# pg_tracing

pg_tracing extension allows to generate spans from with a PostgreSQL instance, providing informations like: 

- Top Level Query: A span representing the executed query with the query text and parameters used
- Planner: A span representing the time spent planning the query
- ExecutorRun: A span representing the execution of the query
- Nested queries support: Queries executed within another query (like calling a pgsql function) will be represented by its own top level query and subsequent spans
- Trigger support: Before and After triggers will be represented by their own top level queries

## Tracecontext Propagation

SQLcommenter or `pg_tracing.trace_context` can be used to propagate a trace context to the database. It will have the following format: 
```sql
/*dddbs='postgres.db',traceparent='00-00000000000000000000000000000009-0000000000000005-01'*/ select 1;

-- SQLCommenter may be added to the end
select 1 /*dddbs='postgres.db',traceparent='00-00000000000000000000000000000009-0000000000000005-01'*/

-- Propagation through trace_context GUC
BEGIN;
SET LOCAL pg_tracing.trace_context='traceparent=''00-00000000000000000000000000000005-0000000000000005-01''';
UPDATE pgbench_accounts SET abalance=1 where aid=1;
COMMIT;
```

The traceparent fields are detailed in [w3c's trace-context](https://www.w3.org/TR/trace-context/#traceparent-header-field-values)

```
00000000000000000000000000000009: trace id
0000000000000005: parent id
01: trace flags (01 == sampled)
```

When a query with a trace context is detected, it is extracted and used by pg_tracing.

## Sampling

Spans will only be generated for sampled queries. A query is sampled if:
- It has a trace context with the sampled flag enabled and it passes the `pg_tracing.caller_sample_rate`
- It has no trace context but the query randomly passes the global `pg_tracing.sample_rate`

With the default values `pg_tracing.caller_sample_rate = 1` and `pg_tracing.sample_rate = 0`, only queries with a trace context and a sampled flag `sampled := 01` will be sampled, effectively offloading sampling decision to the callers.

## Views

### pg_tracing_consume_spans/pg_tracing_peek_spans

The spans generated by the module are made available via a view named `pg_tracing_consume_spans`. This view contains one row for each generated span. The `pg_tracing_peek_spans` view provides the same output as `pg_tracing_consume_spans`. `pg_tracing_peek_spans` doesn't delete read spans.

| Column | Type | Description |
| --- | --- | --- 
| `traceid` | char(32) | Trace identifier of the span |
| `parent_id` | char(16) | Identifier of the span |
| `queryid` | bigint | Hash code to identify identical normalized queries, 0 if `query_id` is disabled |
| `span_type` | text | Type of span (`Planner`, `Executor`...) |
| `span_operation` | text | Name of the span's operation |
| `span_start` | timestamp with time zone | Start time of the span |
| `span_end` | timestamp with time zone | End time of the span |
| `sql_error_code` | char(5) | Error code of the query. `00000` for a succesful query |
| `userid` | oid | OID of user who executed the statement |
| `dbid` | oid | OID of database in which the statement was executed |
| `pid` | integer | Pid of the backend process that processed the query |
| `subxact_count` | smallint | Active count of backend's subtransaction when the span was created |
| `plan_startup_cost` | float8 | Estimated cost before fetching any tuples by the query planner |
| `plan_total_cost` | double precision | Estimated total cost by the query planner |
| `plan_rows` | double precision | Estimated number of row plan is expected to emit |
| `plan_width` | integer | Estimated row width in bytes by the query planner |
| `rows` | bigint | Number of tuples processed |
| `nloops` | bigint | Number of cycles for this node |
| `shared_blks_hit` | bigint | Total number of shared block cache hits by the node |
| `shared_blks_read` | bigint | Total number of shared blocks read by the node |
| `shared_blks_dirtied` | bigint | Total number of shared blocks dirtied by the node |
| `shared_blks_written` | bigint | Total number of shared blocks written by the node |
| `local_blks_hit` | bigint | Total number of local block cache hits by the node |
| `local_blks_read` | bigint | Total number of local blocks read by the node |
| `local_blks_dirtied` | bigint | Total number of local blocks dirtied by the node |
| `local_blks_written` | bigint | Total number of local blocks written by the node |
| `blk_read_time` | double precision | Time spent reading blocks in milliseconds |
| `blk_write_time` | double precision | Time spent writing blocks in milliseconds |
| `temp_blks_read` | bigint | Total number of temp blocks read by the node |
| `temp_blks_written` | bigint | Total number of temp blocks written by the node |
| `temp_blk_read_time` | double precision | Total time the node spent reading temporary file blocks, in milliseconds (if `track_io_timing` is enabled, otherwise zero) |
| `temp_blk_write_time` | double precision | Total time the node spent writing temporary file blocks, in milliseconds (if `track_io_timing` is enabled, otherwise zero) |
| `wal_records` | bigint | Total number of WAL records generated by the node |
| `wal_fpi` | bigint | Total number of WAL full page images generated by the node |
| `wal_bytes` | numeric | Total amount of WAL generated by the node in bytes |
| `jit_functions` | bigint | Total number of functions JIT-compiled by the node |
| `jit_generation_time` | double precision | Total time spent by the node on generating JIT code, in milliseconds |
| `jit_inlining_time` | double precision | Total time spent by the node on inlining functions, in milliseconds |
| `jit_optimization_time` | double precision | Total time spent by the node on optimizing, in milliseconds |
| `startup` | bigint | Time to the first tuple in nanoseconds |
| `parameters` | text[] | Value of the query's parameters |
| `deparse_info` | text | Information extracted from deparsing a plan node |

## Functions

### pg_tracing_info

The statistics of the pg_tracing extension itself are tracked and made available through `pg_tracing_info`. This function will only return a single row.

| Column | Type | Description |
| --- | --- | --- 
| `processed_traces` | bigint | Total number of traces processed |
| `processed_spans` | bigint | Total number of spans processed |
| `dropped_traces` | bigint | Total number of traces dropped due to exceeding `pg_tracing.max_span` spans |
| `dropped_spans` | bigint | Total number of spans dropped due to exceeding `pg_tracing.max_span` spans |
| `otel_sent_spans` | bigint | Total number of spans dropped successfully sent to the otel collector |
| `otel_failures` | bigint | Total number of failures to send spans to the otel collector  |
| `last_consume` | timestamp with time zone | Time at which spans were last consumed |
| `stats_reset` | timestamp with time zone | Time at which all statistics in the `pg_tracing_info` view were last reset |

### pg_tracing_reset()

Discards all statistics gathered so far by `pg_tracing`. Span buffer is not emptied by this function. By default, this function can only be executed by superusers. Access may be granted to others using `GRANT`.

### pg_tracing_spans(boolean)

The `pg_tracing_consume_spans` and `pg_tracing_peek_spans` views are defined in terms of a function named `pg_tracing_spans`. It is possible for clients to call the `pg_tracing_spans` function directly. By calling `pg_tracing_spans` with consume false, spans won't be removed from the span buffer. With consume true, read spans will be deleted. By default, this function can only be executed by superusers. Access may be granted to others using `GRANT`.

## Parameters

### pg_tracing.buffer_mode (enum)

Controls span buffer's behaviour when `pg_tracing.max_span` spans is reached. If `keep_on_full`, the existing buffer is kept while new spans are dropped. If `drop_on_full`, the existing buffer is dropped and new spans are added. The default value is `keep_on_full`.

### pg_tracing.caller_sample_rate (real)

Controls the fraction of statements with SQLCommenter tracecontext and an enabled sampled flag that will generate spans. The default value is 1.

### pg_tracing.deparse_plan (boolean)

Controls whether the query plan should be deparsed. Deparsing plan allows to add more details in span's name. A `BitmapHeapScan` node may have `Recheck Cond: (a = 1)` as deparsing information. The default value is `on`.

### pg_tracing.export_parameters (boolean)

Controls whether the query's parameters should be exported in spans metadata. The default value is `on`.

### pg_tracing.filter_query_ids (string)

Restrict sampling to the provided queryIds. An empty value won't filter any queries. The default value is empty.

### pg_tracing.max_parameter_size (integer)

Controls the maximum size of the parameter string. The default value is 1024.

### pg_tracing.max_span (integer)

Specifies the maximum number of spans stored by the extension. If more spans are generated, the span buffer will be emptied if `pg_tracing.buffer_mode` is set to `drop_on_full`. If `pg_tracing.buffer_mode` is set to `keep_on_full`, the new spans will be dropped and tracing will be aborted. The default value is 5000. This parameter can only be set at server start.

> [!NOTE]  
> A span takes around 370 bytes of memory so 5000 spans will use 1.7MB while 10000 spans will use 3.5MB. You can use the following query to get the exact size used by the pg_tracing's spans:
```sql
SELECT
    (SELECT setting from pg_settings where name='pg_tracing.max_span') AS max_span,
    pg_size_pretty(size) AS total_span_size,
    pg_size_pretty(size/(SELECT setting from pg_settings 
                WHERE name='pg_tracing.max_span')::int)
        AS size_per_span
FROM pg_shmem_allocations
WHERE name ='PgTracing Spans';
```

### pg_tracing.planstate_spans (boolean)

Controls if spans should be generated from the executed query plan. The default value is true.

### pg_tracing.sample_rate (real)

Controls the fraction of statements that generate spans. Statements with tracecontext propagated with SQLCommenter and sampled flag enabled are not impacted by this parameter. For traces with nested statements, either all will be explained or none. The default value is 0.

### pg_tracing.trace_parallel_workers (boolean)

Controls whether spans should be generated for workers created by parallel queries. The default value is true.

### pg_tracing.track (enum)

Controls which statements should be generated. Specify `top` to only generate spans for top-level statements (those issued directly by clients), `all` to also generate spans for nested statements (such as statements invoked within functions), or `none` to disable span generation. The default value is `all`.

### pg_tracing.track_utility (boolean)

Controls whether spans should be generated for utility statements. Utility commands are all those other than `SELECT`, `INSERT`, `UPDATE`, `DELETE`, `TABLE`, and `MERGE`. The default value is `on`.

### pg_tracing.sample_rate (real)

Controls the fraction of statements that generate spans. Statements with tracecontext propagated with SQLCommenter and sampled flag enabled are not impacted by this parameter. For traces with nested statements, either all will be explained or none. The default value is 0.

### pg_tracing.caller_sample_rate (real)

Controls the fraction of statements with SQLCommenter tracecontext and an enabled sampled flag that will generate spans. The default value is 1.

### pg_tracing.otel_endpoint (string)

URL of the otel collector to send spans to. Example: 'http://127.0.0.1:4318/v1/traces'. This parameter can only be set at server start. The default value is NULL.

### pg_tracing.otel_naptime (integer)

Interval in milliseconds between upload of spans to the otel collector. This parameter can only be set at server start.

### pg_tracing.otel_service_name (string)

Service name to set in traces sent to otel collector. This parameter can only be set at server start.

### pg_tracing.otel_connect_timeout_ms (integer)

Maximum time in milliseconds to connect to the otel collector. This includes DNS resolution and protocol handshake. This parameter can only be set at server start.