DSL Debug API — MAL

Status: shipped. Operator reference for the MAL slice of the DSL Debug API. Design: SWIP-13. Index of related pages: DSL Debug API overview.

What it captures

A MAL session attaches to one metric rule. Every scrape window that survives the rule’s file-level filter produces one record in the response; within a record, each probe stage the expression executes appends one sample. The wire shape is:

nodes[]
  records[]
    startedAtMs                  — record boundary timestamp (ms)
    dsl                          — verbatim per-rule DSL text
    rule                         — rule envelope:
      metricPrefix
      name                       — per-rule name (no prefix)
      filter                     — file-level filter closure body, if any
      exp                        — `exp:` body verbatim
      expSuffix                  — file-level expSuffix verbatim, if any
    samples[]
      type                       — input | filter | function | output
      sourceText                 — verbatim DSL fragment for this probe
      continueOn                 — true (MAL captures kept-only; see overview)
      payload                    — SampleFamily.toJson() at this probe stage
      sourceLine                 — omitted for MAL (no per-line mapping)

Sample types and the probes that emit them:

type Probe Fired when
filter captureFilter The file-level filter: closure runs over the input samples (kept-only).
input captureInput The metric reference at the head of the expression resolves a SampleFamily.
function captureStage An in-expression chain op runs (sum, tagEqual, service, etc.).
function captureDownsample A downsampling op runs (e.g. rate("PT1M")).
output captureMeterEmit The metric is emitted to the persistence pipeline (terminal).

sample.sourceText is the verbatim ANTLR slice of the chain segment from the original exp: body — operators can grep the captured text against the source byte-for-byte. There is no leading . (the dot is part of the chain context, not the MethodCallContext slice).

sample.payload is the structured SampleFamily.toJson() at that probe stage — every sample’s name, label set, value, and timestamp is present; truncated at maxSamplesPerCapture (default 64) with a +N more summary.

When no session is bound, the codegen-emitted probe call sites are single volatile-bool reads that JIT eliminates after warm-up — idle cost is effectively free.

Enabling

Two selectors must be enabled — the shared admin HTTP host (admin-server) and the DSL-debug feature on top of it:

SW_ADMIN_SERVER=default
SW_DSL_DEBUGGING=default

injectionEnabled is a boot-time codegen switch, default true once the dsl-debugging module is enabled — the MAL generator emits per-rule GateHolder fields and probe call sites, so debug sessions actually capture samples. Set false only if the REST surface is wanted but no codegen-side probe overhead is acceptable; with false the MAL bytecode is byte-identical to a build without SWIP-13, and POST /dsl-debugging/session returns 503 injection_disabled. Flipping the flag requires an OAP restart:

SW_DSL_DEBUGGING_INJECTION_ENABLED=false   # default is true; set false to disable probes

SECURITY: capture payloads include MAL builder state and sample-family contents. Treat the admin port as authenticated infrastructure — see Admin API readme — Security Notice.

Picking the rule key

A session targets one MAL metric rule. The key tuple is (catalog, name, ruleName):

Field Source
catalog One of otel-rules, log-mal-rules, telegraf-rules — the directory the rule file lives in
name The rule file name, without .yaml
ruleName The full metric name (metricPrefix + _ + per-rule name)

Example — the shipped otel-rules/vm.yaml declares a metric prefix vm and per-rule name cpu_total_percentage. The full metric name is vm_cpu_total_percentage. The session install call:

POST /dsl-debugging/session?catalog=otel-rules&name=vm&ruleName=vm_cpu_total_percentage

To list the metrics a runtime-rule MAL file exposes, query GET /runtime/rule/list and pull the ruleNames associated with the catalog/name pair (the runtime-rule receiver records every rule’s metric catalog).

End-to-end example

The example uses a runtime-rule-applied MAL rule with a top-level filter clause so all probe stages (filterinputfunctionoutput) appear in the captures.

1. Apply the rule

# /tmp/mal-with-filter.yaml
filter: "{ tags -> tags.service_name == 'my-svc' }"
metricPrefix: e2e_demo
expSuffix: service(['service_name'], Layer.GENERAL)
metricsRules:
  - name: filtered_requests
    exp: e2e_demo_request_count_total.sum(['service_name'])
curl -s -X POST -H 'Content-Type: text/plain' \
     --data-binary '@/tmp/mal-with-filter.yaml' \
     'http://OAP:17128/runtime/rule/addOrUpdate?catalog=otel-rules&name=mal-with-filter'

2. Open a debug session

curl -s -X POST \
     'http://OAP:17128/dsl-debugging/session?catalog=otel-rules&name=mal-with-filter&ruleName=e2e_demo_filtered_requests&clientId=alice'

3. Drive ingest, then poll

curl -s 'http://OAP:17128/dsl-debugging/session/SESSION_ID'

A trimmed slice (one record = one scrape window):

{
  "sessionId": "76b3266a-...",
  "capturedAt": 1777967923700,
  "ruleKey": { "catalog": "otel-rules", "name": "mal-with-filter",
               "ruleName": "e2e_demo_filtered_requests" },
  "nodes": [{
    "nodeId": "0.0.0.0_11800",
    "status": "ok",
    "records": [{
      "startedAtMs": 1777967921000,
      "dsl": "(e2e_demo_request_count_total.sum(['service_name'])).service(['service_name'], Layer.GENERAL)",
      "rule": {
        "metricPrefix": "e2e_demo",
        "name": "filtered_requests",
        "filter": "{ tags -> tags.service_name == 'my-svc' }",
        "exp": "e2e_demo_request_count_total.sum(['service_name'])",
        "expSuffix": "service(['service_name'], Layer.GENERAL)"
      },
      "samples": [
        { "type": "filter",
          "sourceText": "{ tags -> tags.service_name == 'my-svc' }",
          "continueOn": true,
          "payload": {
            "families": 1,
            "items": [ /* one entry per surviving SampleFamily  name, samples count, items[] */ ]
          } },
        { "type": "input",
          "sourceText": "e2e_demo_request_count_total",
          "continueOn": true,
          "payload": { /* head SampleFamily  name, samples, items[] */ } },
        { "type": "function",
          "sourceText": "sum(['service_name'])",
          "continueOn": true,
          "payload": { /* SampleFamily after sum */ } },
        { "type": "output",
          "sourceText": "e2e_demo_filtered_requests",
          "continueOn": true,
          "payload": { /* terminal meter sample  metric, entity, value, timeBucket */ } }
      ]
    }]
  }]
}

sample.sourceText is the verbatim ANTLR slice — match it against the exp: body byte-for-byte. The record-level rule envelope echoes the structured rule config so operators don’t have to re-resolve the file.

4. Stop

curl -s -X POST 'http://OAP:17128/dsl-debugging/session/SESSION_ID/stop'

Cluster behaviour

  • Install broadcasts to every reachable peer; each peer binds its own recorder on its own holder so the slice reflects local L1 parsing.
  • Collect broadcasts and concatenates per-node slices into nodes[]; unreachable peers appear as status: "unreachable" rather than being omitted.
  • Stop broadcasts; missed acks fall out via per-node retention timeout (default 5 minutes).

No cross-node merge — each slice is self-contained.

Failure modes

Response Meaning
400 invalid_catalog The wire catalog is not one of the MAL catalogs.
400 missing_param name or ruleName is missing.
404 rule_not_found No live MAL artifact for the tuple on this node — rule never loaded, was inactivated, or this node hasn’t compiled it yet.
503 injection_disabled injectionEnabled=false. Restart with the flag on to debug.
500 registry_misconfigured A recorder factory wiring bug — file an issue.

Limits

Field Default Purpose
recordCap 1000 Max records before the recorder marks itself captured and refuses appends.
retentionMillis 300000 (5m) Wall-clock retention; the session is reaped after the deadline whether or not it was explicitly stopped.

Override per-session in the POST /dsl-debugging/session body:

{ "recordCap": 200, "retentionMillis": 600000 }

See also