Query Tracing
BanyanDB supports query tracing, which allows you to trace the execution of a query. The tracing data includes the query plan, execution time, and other useful information. You can enable query tracing by setting the QueryRequest.trace field to true when sending a query request.
The below command could query data in the last 30 minutes with trace enabled:
bydbctl measure query --start -30m -f - <<EOF
name: "service_cpm_minute"
groups: ["measure-minute"]
tagProjection:
tagFamilies:
- name: "storage-only"
tags: ["entity_id"]
fieldProjection:
names: ["total", "value"]
trace: true
EOF
The result will include the tracing data in the response. The duration time unit is in nano seconds.
Vectorized Measure Query Tracing
Vectorized distributed measure queries use the existing pkg/query.Tracer and common.v1.Trace tree. Tracing is opt-in through QueryRequest.trace; trace-off raw wire responses remain raw frame bytes.
Span shape
Aggregation path:
distributed-{nodeID}
├── broadcast-agg
│ └── data-{nodeID}
│ ├── scan
│ ├── groupby-agg-map
│ └── frame-encode
├── reduce-raw-frames
├── apply-top-to-reduce optional
└── build-iterator
Rows path:
distributed-{nodeID}
├── broadcast-rows
│ └── data-{nodeID}
│ ├── scan
│ ├── groupby-first optional
│ ├── top optional
│ ├── limit optional
│ └── frame-encode
├── merge-distributed-rows
├── apply-groupby-first-to-rows optional
├── apply-batch-top-to-rows optional
└── build-iterator
Multi-group rows path:
distributed-{nodeID}
├── build-multi-group-schema
├── broadcast-per-group-{group}
├── merge-distributed-rows-multi
└── build-iterator
Vocabulary
Trace tag keys are exported from pkg/query/tracelabels and should be used instead of string literals.
| Constant | Key |
|---|---|
TagRowsIn |
rows_in |
TagRowsOut |
rows_out |
TagBatchesIn |
batches_in |
TagBatchesOut |
batches_out |
TagGroupsIn |
groups_in |
TagGroupsOut |
groups_out |
TagBytesIn |
bytes_in |
TagBytesOut |
bytes_out |
TagDroppedRows |
dropped_rows |
TagDropReason |
drop_reason |
TagMode |
mode |
TagSchemaCols |
schema_cols |
TagSchemaDegraded |
schema_degraded |
TagSchemaDegradedTags |
schema_degraded_tags |
TagSchemaDegradedFields |
schema_degraded_fields |
TagTypeDivergences |
type_divergences |
TagNodeCount |
node_count |
TagNodeErrors |
node_errors |
TagRespCount |
resp_count |
TagRounds |
rounds |
TagSize |
size |
TagResponseCount |
response_count |
TagResponseDataPointCount |
response_data_point_count |
TagLimitN |
limit_n |
TagLimitOffset |
limit_offset |
TagTopN |
top_n |
TagTopAsc |
top_asc |
TagCalibratedPerNodeLimit |
calibrated_per_node_limit |
TagMemoryChargedBytes |
memory_charged_bytes |
TagDedupKeysSeen |
dedup_keys_seen |
TagDedupCollisions |
dedup_collisions |
TagBlocksSkipped |
blocks_skipped |
TagTimeFilterReason |
time_filter_reason |
TagDecodeNS |
decode_ns |
TagDecodeNSTotal |
decode_ns_total |
TagDecodeNSP50 |
decode_ns_p50 |
TagDecodeNSP99 |
decode_ns_p99 |
TagDecodeNSMax |
decode_ns_max |
TagFramesIn |
frames_in |
TagSourcesIn |
sources_in |
TagGroupName |
group_name |
TagFramesTotal |
frames_total |
TagFramesEmittedIndividually |
frames_emitted_individually |
TagFramesSkipped |
frames_skipped |
TagMergeHeapPops |
merge_heap_pops |
TagCoercedColumns |
coerced_columns |
TagHiddenTagsStripped |
hidden_tags_stripped |
TagHiddenFieldStripped |
hidden_field_stripped |
TagOrderByColIdx |
orderby_col_idx |
TagOrderByFamily |
orderby_family |
TagOrderByTag |
orderby_tag |
TagDesc |
desc |
TagIndexMode |
index_mode |
TagAggFunc |
agg_func |
TagAggPartialKind |
agg_partial_kind |
TagAggValuePath |
agg_value_path |
TagFrameDecoderVersion |
frame_decoder_version |
TagBroadcastTimeoutMS |
broadcast_timeout_ms |
TagAggregatedDataNodeSpans |
aggregated_data_node_spans |
TagNodesWithErrors |
nodes_with_errors |
TagNodesWithZeroRows |
nodes_with_zero_rows |
TagTotalRowsAcrossNodes |
total_rows_across_nodes |
TagTotalBytesAcrossNodes |
total_bytes_across_nodes |
TagNodeLatencyNSP50 |
node_latency_ns_p50 |
TagNodeLatencyNSP95 |
node_latency_ns_p95 |
TagNodeLatencyNSP99 |
node_latency_ns_p99 |
TagNodeLatencyNSMin |
node_latency_ns_min |
TagNodeLatencyNSMax |
node_latency_ns_max |
TagPlan |
plan |
TagRequest |
request |
TagNodeSelectors |
node_selectors |
TagTimeRange |
time_range |
TagRespKind |
resp_kind |
TagFrameBytesTotal |
frame_bytes_total |
TagErrorMsg |
error_msg |
TagIgnoredChildSpans |
ignored_child_spans |
Empty-result diagnosis runbook
- Check
response_data_point_counton the user-visible response trace/root context. - Walk from
broadcast-*into data-node children and find the first span whoserows_outis unexpectedly low. - Use
drop_reasonto distinguishtop,limit,dedup,groupby-first,time-filter, andindex-skip. - For aggregation value issues, inspect
agg_value_pathongroupby-agg-maporreduce-raw-frames. - For fanout issues, compare
node_count,response_count,node_errors, and summary latency tags.
Examples
Each example below shows the trace tree returned in QueryResponse.trace for a representative query. (123ms) is the span Duration (rounded). Tag values follow key=value. Only the tags relevant to the walk are shown — the actual response contains the full vocabulary.
Enabling trace
Set trace=true on the request:
bydbctl measure query -f - <<'EOF'
groups: ["sw_metric"]
name: "service_cpm_minute"
tagProjection:
tagFamilies:
- name: default
tags: [entity_id]
fieldProjection:
names: [value]
groupBy:
tagProjection:
tagFamilies:
- name: default
tags: [entity_id]
fieldName: value
agg:
function: AGGREGATION_FUNCTION_SUM
fieldName: value
timeRange:
begin: 2026-05-24T00:00:00Z
end: 2026-05-24T00:05:00Z
trace: true
EOF
The response carries trace: { spans: [ ... ] } alongside dataPoints.
Example 1 — Healthy aggregation across 3 data nodes
Query: SUM(value) GROUP BY entity_id over 5 minutes, 3 data nodes.
distributed-liaison-0 (47ms) plan="… AggSum on value" node_selectors="{sw_metric:[n1 n2 n3]}"
├── broadcast-agg (38ms) node_count=3 response_count=3 broadcast_timeout_ms=15000 frame_bytes_total=12480
│ ├── data-n1 (28ms) resp_kind=raw-frame bytes_out=4160
│ │ ├── scan (18ms) rows_out=8400 batches_out=66 schema_cols=4
│ │ ├── groupby-agg-map (7ms) mode=map rows_in=8400 rows_out=42 groups_out=42 agg_value_path=typed
│ │ └── frame-encode (2ms) rows_out=42 bytes_out=4160 schema_cols=4
│ ├── data-n2 (31ms) …
│ └── data-n3 (26ms) …
├── reduce-raw-frames (5ms) frames_in=3 rows_out=51 groups_out=51 agg_value_path=typed
├── apply-top-to-reduce (skipped — no Top in request)
└── build-iterator (1ms)
Reading it: the bottom-up rows_in/rows_out chain shows where rows were consumed and where groups were folded; agg_value_path=typed on both the data-node groupby-agg-map and liaison reduce-raw-frames confirms the native numeric value column was used (no FieldValue fallback).
Example 2 — Empty result via Top.N truncation (Q1 case 2)
Same query plus top { number: 1 } and a tight time range that produced very few rows.
distributed-liaison-0 (12ms) response_data_point_count=0 ← unexpected
└── broadcast-rows (8ms) node_count=3 response_count=3
├── data-n1 (4ms)
│ ├── scan (3ms) rows_out=2 batches_out=1
│ ├── top (0ms) top_n=1 top_asc=false rows_in=2 rows_out=1 dropped_rows=1 drop_reason=top
│ └── frame-encode rows_out=1
├── data-n2 (3ms) scan rows_out=0 (time range filter)
└── data-n3 (3ms) scan rows_out=0
└── merge-distributed-rows rows_out=1
└── apply-batch-top-to-rows top_n=1 rows_in=1 rows_out=1
└── build-iterator (1ms)
Diagnosis: response_data_point_count=0 would be the user-visible surprise, but in this example the liaison merge returned 1 row. The empty case is when top_n plus filtering combine to drop everything — look for any drop_reason=top or drop_reason=limit where dropped_rows is large.
Example 3 — Aggregation passthrough fallback (Q1 case 4)
Same SUM query, but one data node’s measure schema still emits the field as a *modelv1.FieldValue passthrough wrapper (mid-rollout).
distributed-liaison-0
├── broadcast-agg
│ ├── data-n1
│ │ └── groupby-agg-map agg_value_path=fieldvalue-fallback ← look here
│ ├── data-n2
│ │ └── groupby-agg-map agg_value_path=typed
│ └── data-n3
│ └── groupby-agg-map agg_value_path=typed
└── reduce-raw-frames agg_value_path=fieldvalue-fallback ← worst-of mode
Action: agg_value_path=fieldvalue-fallback means the operator resolved the agg value via the FieldValue passthrough column instead of a native numeric column. This is correct but slower; it usually points to a tag-type migration in flight on node n1. If the result also looks numerically off, query the suspect node directly via bydbctl measure query --node n1 ... to confirm.
Example 4 — Schema divergence in multi-group (Q1 case 5)
Query joins two groups whose service_name tag has different types across nodes mid-migration.
distributed-liaison-0 response_data_point_count=0
├── build-multi-group-schema (3ms) groups_in=2 schema_cols=7 schema_degraded=true
│ schema_degraded_tags=["service_name"]
│ type_divergences="service_name:Int@sw_metric→TagValue@traffic"
├── broadcast-per-group-sw_metric
├── broadcast-per-group-traffic
└── merge-distributed-rows-multi sources_in=6 rows_out=0 ← drop here
Action: schema_degraded=true on build-multi-group-schema is the early-warning. type_divergences lists the exact (column → type@group) tuples. The merge then sees rows it can’t unify and produces 0. Resolution: complete the tag-type migration so both groups carry the same type for service_name.
Example 5 — Fanout asymmetry with > 20 data nodes (Q1 case 7)
Same query over a 25-node fanout. 3 nodes returned errors; 5 returned zero rows.
distributed-liaison-0
└── broadcast-agg node_count=25 response_count=22 broadcast_timeout_ms=15000
├── data-n3 error_msg="dial tcp: …"
├── data-n7 error_msg="context deadline exceeded"
├── data-n12 error_msg="dial tcp: …"
├── data-n4 rows_out=0
├── data-n5 rows_out=0
├── … (14 more individual children, ordered by descending latency)
└── data-summary
aggregated_data_node_spans=6
nodes_with_errors=0 nodes_with_zero_rows=0
total_rows_across_nodes=18420 total_bytes_across_nodes=921600
node_latency_ns_p50=42_000_000 p95=58_000_000 p99=61_000_000
node_latency_ns_min=11_000_000 node_latency_ns_max=63_000_000
Reading it: node_count - response_count = 3 indicates three nodes never responded (the three with error_msg). The fanout summary span captures the lower-latency 6 healthy nodes that didn’t get individual slots — their aggregate is in total_rows_across_nodes and the percentiles. If nodes_with_zero_rows on data-summary is non-zero, a healthy-looking node was actually empty — a likely indicator of stage / segment misrouting.
Example 6 — Performance bottleneck identification (Q2)
Same query, all nodes healthy, but the user reports slow query.
distributed-liaison-0 (812ms) ← total
├── broadcast-agg (47ms)
│ └── data-n1 / n2 / n3 (peak 31ms)
├── reduce-raw-frames (740ms) ← dominates 16× over peers
│ frames_in=3 rows_out=87432 groups_out=87432
│ agg_value_path=typed
│ memory_charged_bytes=210000000
├── apply-top-to-reduce (12ms)
└── build-iterator (1ms)
Diagnosis: the bottleneck is the span with the largest Duration — here reduce-raw-frames at 740ms. Tags reveal high groups_out=87432 and memory_charged_bytes=210MB, pointing at a high-cardinality GROUP BY. Fix by tightening the time range, adding a more selective filter, or capping with Top.N.
Example 7 — Decode-frame summary at high frame count
When the liaison processes > 19 frames (e.g. a wide fanout returning many partials), per-frame decode spans are summarized:
reduce-raw-frames (95ms) frames_in=100
└── decode-frame-summary
frames_total=100 frames_emitted_individually=0 frames_skipped=100
decode_ns_total=84_000_000 decode_ns_p50=820_000 decode_ns_p99=1_280_000 decode_ns_max=1_910_000
The summary fully represents every frame’s decode duration. decode_ns_p99 ≫ p50 would point at a single slow decode worth investigating; here the distribution is tight, so decode is not the bottleneck.
Wire contract
Under raw wire mode, trace-off responses are still opaque raw frame bytes beginning with RawFrameMagicLeadingByte. Trace-on responses use the existing measure.v1.InternalQueryResponse envelope with raw_frame_body and trace populated. Proto data_points responses under raw mode are rejected loudly by the vectorized collector.