DSL Debug API
Status: shipped. Sample-based runtime debugger for the three DSLs that drive SkyWalking analysis: MAL (meter analysis language), LAL (log analysis language), and OAL (observability analysis language). Design: SWIP-13.
Per-DSL operator references
Each DSL has its own probe surface, payload shape, and rule-key conventions — pick the page that matches the rule you’re debugging:
- MAL — meter rules under
otel-rules,log-mal-rules,telegraf-rules. Each captured record is oneSampleFamilywalking through the rule end-to-end (filter → chain ops → meterEmit). Sample payloads carry the completeSampleFamily(every sample’s name + labels + value + timestamp). - OAL — per-metric dispatcher capture under
catalog=oal. Each record is oneISourcewalking through (source entry → filter clauses → aggregation function → emit). Source samples carry the richServiceRelation-style payload (sourceServiceName, destServiceName, layers, latency, status, detectPoint, …). - LAL — log analysis under
catalog=lal. Each record is one log walking through (text → parser → extractor statements → sink).granularity=statementemits one sample per extractor statement;granularity=block(default) collapses extractor into one sample.
Kept-only capture for MAL (filter is the relevance gate)
MAL is the only DSL where rejected executions are dropped from
records[]. The reason is tag-cardinality noise: MAL filters
discriminate on tag values (service_name == 'foo',
kind == 'http_server_request_duration') and in a multi-tenant or
multi-component flow, a single rule may reject 99% of the traffic
routed to it. If every rejection landed as a record, the default
recordCap (32) would fill with noise within seconds and the capture
would never reach a row demonstrating the rule’s actual processing.
So for MAL the contract is: every record represents one
SampleFamily that passed the rule’s filter clause and walked through
to meterEmit. The first sample is the input that the filter
accepted; subsequent samples follow the chain ops; the last is the
emit.
OAL captures both kept and rejected executions. OAL filters are
deterministic discriminators (CLIENT-vs-SERVER, layer matchers,
status / latency predicates), not high-cardinality tag noise — seeing
the rejected-source samples (continueOn=false) is the filter doing
its job in plain view, useful for verifying partition logic.
LAL has no analogue (abort() is a per-statement short-circuit,
not a filter); aborted logs publish their accumulated samples with
continueOn=false on the abort point so operators can see where the
rule gave up.
For MAL operators debugging “why isn’t my data being aggregated?”: the runtime-rule list endpoint surfaces the rule’s filter source; combine with knowledge of the input tags to identify the mismatch. The debug API’s job is to show what the rule did with traffic it accepted.
Common ground
What the three DSLs share:
- One REST surface: routes mount under
/dsl-debugging/*on theadmin-serverHTTP host (default port17128). The shape of session lifecycle (install, poll, stop) is identical across DSLs — only the rule-key tuple and payload schema differ. - One enable contract: requires
SW_ADMIN_SERVER=defaultandSW_DSL_DEBUGGING=default. Probe insertion is gated by the boot-time switchSW_DSL_DEBUGGING_INJECTION_ENABLED(defaulttrue— once the module is enabled, probes fire and sessions record samples). Set this tofalseonly if the REST surface is wanted but probe overhead is not; flipping requires an OAP restart. - One cluster contract: install / collect / stop broadcast to every
reachable peer over the admin-internal gRPC bus (default port
17129, owned byadmin-server) — NOT the public agent / cluster gRPC port (core.gRPCPort, default11800). Privileged admin RPCs stay off the agent network; operators bind17129to a private peer-to-peer interface only. Each peer’s slice is self-contained — no L2 join. - One retention contract: per-session
recordCap(default1000, hard cap10000) andretentionMillis(default 5 minutes, hard cap 1 hour). Requests outside these bounds return400 invalid_limits. A missedstopself-cleans on retention timeout. - LB-safe routing: any node can serve any verb. POST mints the
sessionIdon the receiving node, broadcasts install to all peers, and reports per-peer install state inpeers[]. GET/stop on a different node aggregate the cluster slices for the same id. So an L7 LB in front of the admin port (default port17128) can route any request to any OAP, and the cluster fan-out hides which node actually owns the rule. - One timestamp per slice: the GET response envelope carries a
single
capturedAt(millis) stamped at GET time. Per-record stamps were redundant — probes fire at clock-tick speed and intra-slice ordering is preserved by array order. sourceTextis the verbatim ANTLR slice: pulled viactx.getStart().getInputStream().getText(Interval.of(start, stop))at parse time, so whitespace and identifier spelling are byte-identical to the original.yaml/.oal. This applies to MAL chain segments, OAL filter clauses, OAL aggregation calls, and LAL extractor statements. Operators can grep the capturedsourceTextagainst the source file directly.
Each DSL’s page has its own walkthrough — start there for a copy-paste example targeting that DSL’s rules.
Session lifecycle
| Verb | Path | Purpose |
|---|---|---|
| POST | /dsl-debugging/session?catalog=&name=&ruleName=&clientId=&granularity= |
Start a debug capture session. |
| GET | /dsl-debugging/session/{id} |
Snapshot the captured records. |
| POST | /dsl-debugging/session/{id}/stop |
Stop a session (idempotent). |
| GET | /dsl-debugging/sessions |
List active sessions on this node. |
| GET | /dsl-debugging/status |
Module posture (phase, injectionEnabled). |
granularity is LAL-only; MAL/OAL recorders ignore it. Body fields
(recordCap, retentionMillis, granularity) are optional; defaults from
the table above apply.
OAL rule picker (read-only)
The DSL debug API also exposes a read-only listing of the OAL rules loaded by an OAP node, used by the debugger UI’s rule picker:
| Verb | Path | Purpose |
|---|---|---|
| GET | /runtime/oal/files |
List the OAL files this OAP loaded at boot, with their sources and registered rules. |
| GET | /runtime/oal/files/{name} |
Raw OAL content (UTF-8 text, classpath-bound). |
| GET | /runtime/oal/rules/{source} |
Rules registered against a single source (debug-key resolver). |
These do not require injectionEnabled=true — they are pure metadata.
Failure modes
Common across DSLs. Per-DSL pages list any DSL-specific failures.
| Response | Meaning |
|---|---|
400 invalid_catalog |
Wire catalog doesn’t match any known DSL. |
400 missing_param |
name, ruleName, or clientId query param missing. clientId is required so the receiving node can clean up the prior session bound to it before installing a new one. |
400 invalid_limits |
recordCap / retentionMillis outside the per-session bounds (see above). |
429 too_many_sessions |
Active-session ceiling on the receiving node (200) is full. Stop a session or wait for retention. |
404 rule_not_found |
No live DSL artifact for (catalog, name, ruleName) on any reachable OAP node — every node and peer reported NOT_LOCAL. The 404 body carries peers[] so operators can see which OAP rejected what. |
404 session_not_found |
Session id unknown to every reachable OAP node — never created, retention timed out, or already stopped. |
503 injection_disabled |
injectionEnabled=false; capture is permanently off until OAP restarts with the flag on. |
500 registry_misconfigured |
A recorder factory wiring bug — file an issue. |
Configuration reference
See Configuration vocabulary for the full table. The DSL-debug-relevant keys:
| Key | Default | Purpose |
|---|---|---|
SW_ADMIN_SERVER |
(empty) | Enables the shared admin HTTP host. Must be set. |
SW_DSL_DEBUGGING |
(empty) | Enables the DSL debug API. |
SW_DSL_DEBUGGING_INJECTION_ENABLED |
true |
Boot-time codegen switch. Set false to disable probes. |
SECURITY: capture payloads include sensitive operational state (raw log bodies, parsed maps, MAL builder state, OAL source events). Treat the admin port as authenticated infrastructure — see Admin API readme — Security Notice.