SWIP-13 Live Debugger for MAL / LAL / OAL
Status: implemented in 10.5.0. Operator reference: DSL Debug API.
Motivation
SkyWalking has three DSLs that turn ingested telemetry into L1 outputs (metrics,
logs, records ready for downstream aggregation/persistence): MAL (meter analyzer),
LAL (log analyzer), and OAL (observability analysis). All three are compiled at
OAP boot or via runtime-rule hot-update; the operator authoring the rule then has to
infer correctness from downstream effects — “did the metric appear in the dashboard?”,
“did the log get classified into the right layer?”, “why is endpoint_cpm reading zero
on this endpoint?”.
This SWIP is scoped to data-parsing debugging — we capture how each DSL transforms
its input into the L1 output that the rule emits. We do not follow the data past
L1: there is no L2 aggregation, no cross-node merge, no storage round-trip in scope.
The boundary is the store / emit stage at the tail of each DSL’s pipeline. End-to-end
debugging (L1 → L2 → query → UI) is out of scope.
When a rule misbehaves the operator’s only tools today are:
- read the YAML /
.oalsource, - guess which clause is wrong,
- enable DEBUG logs and grep the OAP log for hints,
- adjust + redeploy + wait.
This is acceptable for static rules edited once a release. With runtime-rule hot-update (PR #13851) shipping per-file edits in seconds, the iteration loop becomes the main cost. We need a debugger that shows what the live ingest actually does at every clause of the rule.
This SWIP proposes a passive sampling debugger that captures the actual data flowing through MAL / LAL / OAL pipelines on demand — from the raw input the DSL receives, through every clause, to the L1 output the rule emits — returns a step-by-step visualization payload to the UI, and costs effectively nothing when not in use.
Architecture Graph
┌─────────── operator UI / swctl / curl ───────────┐
│ POST /dsl-debugging/session { clientId, ... } │
│ GET /dsl-debugging/session/{id} → payload │
└──────────────────────────┬───────────────────────┘
│ admin port (default 17128)
┌────────────────────────────────────────┼──────────────────────────────────┐
│ │ │
▼ ▼ ▼
┌────────────────────────┐ bus ┌────────────────────────┐ bus ┌────────────────────────┐
│ OAP node A (entry) │ ─────────► │ OAP node B (peer) │ ─────────► │ OAP node C (peer) │
│ │ │ │ │ │
│ ┌──────────────────┐ │ │ ┌──────────────────┐ │ │ ┌──────────────────┐ │
│ │ admin-server │ │ │ │ admin-server │ │ │ │ admin-server │ │
│ │ (Armeria HTTP) │ │ │ │ (Armeria HTTP) │ │ │ │ (Armeria HTTP) │ │
│ │ /runtime/rule/* │ │ │ │ /runtime/rule/* │ │ │ │ /runtime/rule/* │ │
│ │ /dsl-debugging/*│ │ │ │ /dsl-debugging/*│ │ │ │ /dsl-debugging/*│ │
│ └──┬─────────┬─────┘ │ │ └──┬─────────┬─────┘ │ │ └──┬─────────┬─────┘ │
│ │ │ │ │ │ │ │ │ │ │ │
│ ▼ ▼ │ │ ▼ ▼ │ │ ▼ ▼ │
│ runtime- dsl- │ │ runtime- dsl- │ │ runtime- dsl- │
│ rule debugging │ │ rule debugging │ │ rule debugging │
│ plugin plugin │ │ plugin plugin │ │ plugin plugin │
│ │ │ │ │ │ │ │ │
│ ▼ │ │ ▼ │ │ ▼ │
│ GateHolder per rule │ │ GateHolder per rule │ │ GateHolder per rule │
│ ─ gate (volatile) │ │ ─ gate (volatile) │ │ ─ gate (volatile) │
│ ─ recorders[] (CoW) │ │ ─ recorders[] (CoW) │ │ ─ recorders[] (CoW) │
│ │ │ │ │ │ │ │ │
│ ┌───────────▼──────┐ │ │ ┌───────────▼──────┐ │ │ ┌───────────▼──────┐ │
│ │ generated MAL / │ │ │ │ generated MAL / │ │ │ │ generated MAL / │ │
│ │ LAL / OAL byte- │ │ │ │ LAL / OAL byte- │ │ │ │ LAL / OAL byte- │ │
│ │ code calls │ │ │ │ code calls │ │ │ │ code calls │ │
│ │ {MAL,LAL,OAL}- │ │ │ │ {MAL,LAL,OAL}- │ │ │ │ {MAL,LAL,OAL}- │ │
│ │ Debug.capture* │ │ │ │ Debug.capture* │ │ │ │ Debug.capture* │ │
│ └───────────┬──────┘ │ │ └───────────┬──────┘ │ │ └───────────┬──────┘ │
│ │ │ │ │ │ │ │ │
│ ▼ │ │ ▼ │ │ ▼ │
│ per-session │ │ per-session │ │ per-session │
│ JSON record list │ │ JSON record list │ │ JSON record list │
└────────────────────────┘ └────────────────────────┘ └────────────────────────┘
▲
│ collect partials over admin-internal gRPC bus on GET
│
operator's GET request (UI / curl)
A new oap-server/server-admin/ parent groups the admin-side modules.
Three sibling modules under it cooperate, plus server-core provides
the classpath contract:
server-admin/admin-server— a new shared module providing one Armeria HTTP server, one admin port (default17128), and a route-registration SPI. Hosts every admin REST surface (write/dangerous APIs that operators run from a jumphost or CI). Replaces the runtime-rule plugin’s own embedded HTTP server.server-admin/runtime-rule— the existing runtime-rule plugin, moved fromserver-receiver-plugin/skywalking-runtime-rule-receiver-plugin/intoserver-admin/as a sibling ofadmin-server. URL paths and YAML block name stay the same; only the Maven module location changes. Registers its REST handlers onadmin-server.server-core(existing module, additive only) — gains the cross-DSL primitives shared by every generated rule class:GateHolder, the markerDebugRecorderinterface, and the OAL-specificOALDebugstatic probes +OALDebugRecorderinterface (OAL’s runtime types —ISource,Metrics— are already in server-core, so its probe surface can live here). MAL- and LAL-specific probes can NOT live here because their signatures take types owned by the analyzer modules (SampleFamily,Sample,ExecutionContext); putting them in server-core would reverse the existing analyzer→server-core dependency direction. They live in their own DSL’s analyzer module instead.analyzer/meter-analyzer(existing module, additive only) — gains theMALDebugstatic probe class and theMALDebugRecorderinterface, both typed toSampleFamily,Sample, andMeterEntity. MAL-generated bytecode referencesMALDebugdirectly; the analyzer module is already on the classpath wherever MAL classes are loaded.analyzer/log-analyzer(existing module, additive only) — gains theLALDebugstatic probe class and theLALDebugRecorderinterface, both typed toExecutionContextandLALOutputBuilder. LAL-generated bytecode referencesLALDebugdirectly.server-admin/dsl-debugging— a new feature module that provides the runtime implementation behind the per-DSL recorder interfaces. It depends onserver-core,analyzer/meter-analyzer, andanalyzer/log-analyzer(this is the natural feature → analyzer → core direction). Holds:MALDebugRecorderImpl/LALDebugRecorderImpl/OALDebugRecorderImpl(one per DSL, each with hand-written JSON serializers for that DSL’s artifacts), the per-node session registry (Map<SessionId, DebugSession>+Map<ClientId, SessionId>), the install / uninstall paths that mutate the holders’ recorder arrays, the admin-internal gRPC fan-out RPCs, and the REST handler. Registers its routes on theadmin-serverHTTP host and its cluster gRPC service on the admin-internal gRPC bus (both owned byadmin-server).
Capture lives inside the DSL-generated bytecode (one extra
gate-guarded static call per stage boundary). The cross-DSL primitives
(GateHolder, RuleKey, the DebugRecorder marker interface) live
in server-core. Each DSL’s probe class lives in that DSL’s home
module: MALDebug in analyzer/meter-analyzer, LALDebug in
analyzer/log-analyzer, OALDebug in server-core (OAL runtime
types are already there). Recorder implementations and the session
registry live in dsl-debugging. Generated MAL bytecode references
MALDebug from its own module, generated LAL bytecode references
LALDebug from its own module, and generated OAL bytecode references
OALDebug from server-core — every invokestatic resolves through
the existing analyzer → server-core dependency direction; no reverse
deps are introduced.
Cluster fan-out rides the admin-internal gRPC bus (the same
transport runtime-rule registers its RuntimeRuleClusterServiceImpl
on); dsl-debugging ships its own gRPC service implementation
(DSLDebuggingClusterServiceImpl) for its four RPCs
(InstallDebugSession / CollectDebugSamples / StopDebugSession /
StopByClientId).
Runtime-rule’s RuntimeRuleClusterServiceImpl stays exactly where it is
— the two services are independent, register independently, and either
module can be enabled or disabled without affecting the other. Both ride
the admin-internal gRPC bus owned by admin-server (default port
17129) — a dedicated transport separate from the public agent /
cluster gRPC port (core.gRPCPort, default 11800) so privileged admin
RPCs stay out of the agent network’s blast radius. Peer discovery is
handled by AdminClusterChannelManager (admin-server module service),
which dials each peer’s admin-internal port via the cluster module’s
existing peer registry.
The status-query-plugin’s existing /debugging/* routes (config dump,
MQE step-through, trace step-through) stay where they are on the query
port — those are read-only query-time inspection helpers and have a
different audience from admin write APIs.
Proposed Changes
1. The data contract returned to the UI
The Claude Design handoff (Runtime Rule Admin canvas) pinned three shapes the
UI expects. The backend is the source of truth — UI is a strict visualizer.
The capture target is shaped per DSL, not generic — each DSL has its own workflow with its own terminal artifact handed to L1, and the data contract mirrors those terminal artifacts. L1 / L2 / cross-node aggregation / persistence are explicitly out of scope. The debugger always stops before the L1 hand-off — at the point where the rule has produced its final L1-bound object but hasn’t yet entered the streaming/aggregation kernel.
Kept-only capture for MAL (the filter is high-cardinality)
MAL is the only DSL where rejected executions are dropped from
records[]. MAL filters discriminate on tag values (service_name,
kind, component) and in a multi-tenant or multi-component flow a
rule routinely rejects 99% of the traffic routed to it by metric
name. Publishing every rejected execution would burn recordCap
(default 1000) on noise within seconds and never reach a row that
demonstrates the rule’s actual processing. So for MAL the contract
is: every record represents one SampleFamily that passed the
rule’s filter and walked through to meterEmit. Implementation:
MALDebugRecorderImpl.appendFilter(kept=false) calls
discardCurrentExecution(); the in-flight execution is silently
dropped.
OAL captures both kept and rejected executions. OAL filters are
deterministic discriminators (CLIENT-vs-SERVER, layer matchers,
status / latency predicates) — not high-cardinality tag noise.
Rejected-source samples (continueOn=false) are the filter doing its
job in plain view, useful for verifying that the partition logic
fires the way the operator expects. Implementation: each
appendFilter call adds a sample regardless of kept; on
kept=false the recorder also calls publishCurrentExecution()
because the FTL’s return ends the per-source pipeline at that
point, so the next source’s appendSource mustn’t bleed into this
record.
LAL has no analogue (abort() isn’t a filter — it’s a
per-statement short-circuit); aborted logs publish their accumulated
samples with continueOn=false on the abort point so operators can
see where the rule gave up.
MAL — two-half pipeline: SampleFamily transforms → MeterSystem build
MAL is the only DSL whose terminal artifact is not a SampleFamily.
expression.run(samples) returns a final SampleFamily, but the rule’s L1-bound
output is what Analyzer.doAnalysis builds after that — one
AcceptableValue<Long | DataTable | BucketedValues | PercentileArgument> per
MeterEntity, populated via meterSystem.buildMetrics(...).accept(...). That
AcceptableValue is what meterSystem.doStreamingCalculation(...) consumes;
that call is L1 entry, so capture stops one line before it.
Wire shape
The collect response is a hierarchy of nodes[].records[].samples[]. One
record equals one full execution of the rule’s pipeline (one
SampleFamily for MAL, one ISource for OAL, one log for LAL); samples
within a record are the probe firings in execution order.
Per-record fields (rule-pipeline execution):
| Field | Meaning |
|---|---|
startedAtMs |
Wall-clock millis when the record opened (the first probe of the execution). |
dsl |
Verbatim per-rule DSL text. Stamped per record so a session that spans a hot-update can render the right rule version per record. |
rule |
Structured rule envelope: MAL carries {metricPrefix, name, filter, exp, expSuffix}; LAL/OAL carry {ruleName, sourceLine}. Lets operators read the rule’s intent without re-resolving the source file. |
samples[] |
Probe-emitted samples in execution order. |
Per-sample fields:
| Field | Meaning |
|---|---|
type |
input / filter / function / aggregation / output — the probe class. |
sourceText |
The verbatim ANTLR slice of the DSL fragment that produced this sample (chain ops, filter clauses, aggregation calls, LAL statements). Lets operators grep against the source byte-for-byte. |
continueOn |
true when the pipeline continued past this step; false on rejected filter branches (OAL) or abort() paths (LAL). MAL is kept-only — no false rows. |
payload |
DSL-specific structured snapshot — see per-DSL contract below. |
sourceLine |
1-based line number in the source .yaml / .oal / LAL DSL block. Omitted when 0 (e.g. label-only probes — see “Line-number granularity in MAL” below). |
Per-DSL probe → sample-type mapping, in pipeline order:
| Catalog | Probe (call site) | type |
Payload |
|---|---|---|---|
| MAL | captureFilter (file-level) |
filter |
SampleFamily.toJson() after the file-level closure (kept-only: rejected branches are dropped). |
| MAL | captureInput |
input |
head SampleFamily of the exp: body. |
| MAL | captureStage × N |
function |
SampleFamily after each chain op (sum, tagEqual, service, …); sourceText is the verbatim segment without the leading dot. |
| MAL | captureDownsample |
function |
SampleFamily after downsampling; sourceText is the function call (e.g. rate("PT1M")). |
| MAL | captureMeterEmit (terminal) |
output |
terminal meter sample with metric name + entity + value + timeBucket. |
| OAL | captureSource |
input |
ISource.toJson() — full source columns under fields{} (e.g. ServiceRelation source attributes). |
| OAL | captureFilter |
filter |
post-filter source view; both kept (continueOn=true) and rejected (continueOn=false) branches are emitted (filters are deterministic discriminators). |
| OAL | captureAggregation |
aggregation |
Metrics.toJson() after the aggregation function (cpm(), percentile2(10), …). |
| OAL | captureEmit (terminal) |
output |
Metrics.toJson() with count / total / value / timeBucket from the metric family’s appendDebugFields hook. |
| LAL | appendText (entry) |
input |
raw input only (LogData rendered via LogDataDebugDump); fires before LALOutputBuilder.init(). |
| LAL | appendParser / appendExtractor |
function |
block-mode synopses — LALOutputBuilder.outputToJson() snapshot of the builder so far. |
| LAL | appendLine |
function |
statement-mode probe (one per extractor statement); sourceText carries the verbatim statement; payload is the builder snapshot. |
| LAL | appendOutputRecord (terminal) |
output |
final builder snapshot — DB-bound row with merged tags[] (each carries `status: original |
| LAL | appendOutputMetric (terminal) |
output |
the SampleFamily handed off to MAL. |
LAL input/output payload split. The first sample on every LAL record
carries only the raw input; all subsequent samples carry only the
output (LALOutputBuilder.outputToJson()). The framework renders raw
input directly off ctx.input() on the input probe — the
LALOutputBuilder is not consulted there because init() runs at the
sink stage. This keeps the wire small (no redundant input on every
probe) and avoids the stale-cache empty-input bug that would arise if
the dispatch went through a not-yet-initialised builder.
The terminal output capture is the boundary; the debugger does not
follow the persisted entity further into L1. For MAL meter emits, the
typed T (one of Long / DataTable / BucketedValues /
PercentileArgument) is exposed in the payload so the UI can render
the correct shape per metric type.
Downsampling — explicit vs. default. A rule’s downsampling: field is
optional; when the operator omits it MAL fills in a default based on the
metric type (typically AVG for gauges, SUM for counters; see the MAL
compiler’s metric-type resolution rules). The UI must distinguish the two
cases so operators don’t read a defaulted AVG as a value they declared.
The MalDownsample capture carries an origin field — "explicit" when
the YAML’s downsampling: is present, "default" when MAL filled it in
— and the UI renders the row with a “default” badge in the latter case
(e.g., downsampling: AVG (default)).
The UI renders the per-rule debug payload as a table whose left column is
sourceText (with sourceLine as a small line-number gutter) and right
column is the stage’s result snapshot. Operators read the rule top-to-bottom
in the source they wrote, with the output of each line beside it — no
translating from synthetic stage ids back to source.
Line-number granularity in MAL. The MAL exp: field in YAML is one
scalar string — often line-folded across two or three physical lines. ANTLR
parses the whole expression as a single source unit, so every chain stage
emitted from that expression carries the same sourceLine (the YAML
line where the exp: key sits). This is honest rather than synthetic;
attempting to back-compute per-stage line numbers from sub-positions inside
a folded scalar would be brittle and misleading. Per-stage discrimination
comes from sourceText — the verbatim fragment is unique per stage even
when lines collide. The UI line-gutter shows the same number twice when
that’s the truth, and operators recognise the fragments anyway because
they wrote them.
LAL is different — its DSL has one statement per line in the YAML body
(json {}, service parsed.x as String, tag 'k': v, etc.), so each
captured LAL statement has a distinct sourceLine. OAL captures per
clause with each clause’s actual ANTLR-derived line; same-line clauses
share a number, distinct-line clauses each get their own. sourceText
carries the fragment-level detail regardless. (See the OAL data
contract below for the full per-clause table.)
File-level filter fires before expression.run(). The probe ordering
within one ingest pass is:
MALDebug.captureFilter (file `filter:` block — once, on the whole input map)
MALDebug.captureInput (per metric reference inside expression.run — already filtered)
MALDebug.captureStage × N (each chain stage)
MALDebug.captureScope (after the expSuffix scope-binding stage)
MALDebug.captureDownsample (rule-level metadata, fires once)
MALDebug.captureMeterBuild (per MeterEntity, in Analyzer post-chain)
MALDebug.captureMeterEmit (per MeterEntity, immediately before doStreamingCalculation)
MALDebug.captureInput therefore reports the post-filter sample count, not
the raw count the receiver delivered — the file-level filter has already
trimmed the input map by the time the generated run() reads it.
LAL — typed output, not “unified log”
LAL’s terminal artifact is the typed output builder the rule produces.
The codegen calls ctx.setOutput(new <OutputType>()) at the start of
execute(); the extractor populates fields via typed setters; then
RecordSinkListener.parse() calls builder.init(metadata, input, moduleManager) to fill in standard fields, and build() calls
builder.complete(sourceReceiver). The L1 boundary is complete(...);
capture is the populated builder just before that call.
OutputType is per-rule and varies — LogBuilder, EnvoyAccessLogBuilder,
DatabaseSlowStatementBuilder, SampledTraceBuilder, custom subclasses
contributed by LALSourceTypeProvider SPI. The UI must show the actual type
name and its full populated state, not flatten everything to “LogData”.
A LAL rule may also emit metrics into MAL via MetricExtractor.submitMetrics (ctx, sampleBuilder). The Sample produced is wrapped into a SampleFamily
and either appended to ctx.metricsContainer() (when called inside a MAL
fanout) or fed to provider.getMetricConverts().forEach(it -> it.toMeter(...)).
This second emission is independent of the typed record output — some rules
emit both, some only one, some neither.
Per-record blocks (operator toggles per-block sampling; off blocks return
null cells):
| Block | Captured artifact |
|---|---|
text |
raw log body string + body-type (json / yaml / text / none) |
parser |
parsed map / typed proto object — the structure parsed.* resolves against |
extractor |
populated typed output builder (snapshot mid-flight) + extracted tags + resolved def variables |
sink |
branch taken (enforcer / sampler / rateLimit) + kept / dropped + reason |
output_record |
the typed output builder right before complete() — includes outputType class name + every populated field |
output_metric |
the SampleFamily (≥ 1 Sample) about to be handed to MAL — only present when the rule has a metrics{} block |
output_record and output_metric are independent — each is null if the
rule didn’t produce that output (output_record is null on a sink-dropped
record; output_metric is null for any rule without metrics{}).
output_metric does not chain into a separate MAL debug session.
Per-record envelope:
{ id, ts, svc, body_type, text, parsed, extracted, def_vars,
sink: { branch, kept, note },
output_record: { outputType, builderState } | null,
output_metric: { sampleFamily } | null }
Granularity — block vs. statement (opt-in)
The block view above (5 fixed blocks + 2 outputs) answers most operator questions: did the parser succeed → did the extractor populate fields → did the sink keep the record. For “step through every DSL line in my extractor” investigations, sessions can opt into statement-level capture via a request-time flag.
| Granularity | Capture points per log record | Record-cap math |
|---|---|---|
block (default) |
5 fixed blocks + 2 output emit points = 7 cells per log record | 8 records × 7 = 56 cells worst case |
statement (opt-in) |
7 block points plus one cell per meaningful DSL statement (each service/tag/def/if/metrics line) |
8 records × (7 + N statements) cells worst case |
statement mode burns the per-session record cap faster — a rule with 12
extractor statements + sink branches produces ≈ 19 cells per log record
versus 7 in block mode. The
per-session record cap
treats each cell uniformly, so an operator selecting statement should
lower recordCap in the request body to stay within the session’s
worst-case heap budget. The session payload reports
granularity: "block" | "statement" so the UI knows which renderer to
use.
Statement-level capture maps each cell to its source DSL line via
sourceLine carried through LALScriptModel’s AST nodes (see
LAL hooks) — the UI
fetches the rule YAML once at session start (via /runtime/rule/list)
and reads line sourceLine directly. No sourceText is duplicated in
the captured payload. LAL’s one-statement-per-line convention guarantees
the line is unique per captured statement, so a line number alone is
enough — unlike MAL chain stages and OAL clauses, which often share a
single line and therefore must carry their verbatim fragment as
sourceText.
This granularity flag applies to LAL only. MAL is already
statement-level for free — every method-chain stage is one DSL statement
and the chain capture (MALDebug.captureStage) records it. OAL is single-rule
already (do<Metric>(source) is one logical statement), so the
distinction does not apply.
OAL — Source row → typed Metrics object before in()
OAL is dispatcher-style: each do<MetricsName>(source) method takes one
Source row, optionally filters it, builds a <MetricsName>Metrics object,
copies fields from the source, calls the entry method (combine /
accept / add depending on the function class), and finally calls
MetricsStreamProcessor.getInstance().in(metrics). That in(...) call is
the L1 boundary; capture stops one line before it.
Operator selects one OAL line
(e.g. endpoint_cpm = from(Endpoint.*).cpm();) and configures
(maxRows, windowSec). The picker is fed by the
OAL read-only management API
introduced in this SWIP — without it the UI has no way to discover what
OAL rules exist at runtime.
OAL probe → sample-type mapping (one source event = one record; samples flow source → filter → aggregation → output in execution order):
| Probe (call site) | type |
Origin | Captured artifact |
|---|---|---|---|
captureSource |
input |
OAL clause | ISource.toJson() — the inbound source row (concrete subtype: Endpoint, Service, ServiceRelation, …) under payload.fields{}. sourceText = from(...) clause. |
captureFilter |
filter |
OAL clause (× N) | post-filter source view; both kept (continueOn=true) and rejected (continueOn=false) branches are emitted. One sample per filter clause in the rule. |
captureAggregation |
aggregation |
OAL clause | Metrics.toJson() after the aggregation function. sourceText = the verbatim clause (cpm(), percentile2(10), apdex(name, status)). |
captureEmit (terminal) |
output |
compiler-implicit | Metrics.toJson() of the L1-ready metric (count / total / value / timeBucket from Metrics.appendDebugFields); fired immediately before MetricsStreamProcessor.in(metrics). |
Per-clause is the natural OAL granularity, not an opt-in. OAL is one
logical rule per line in the .oal file, but each rule is composed of
distinct clauses (from(Source.*), optionally .filter(...), ending in
.<aggregationFn>(...)). Every user-written clause becomes its own captured
row carrying its verbatim OAL fragment as sourceText — same principle as
MAL chain stages, just naturally bounded (≤ 5 stages per rule). There is no
granularity: "block" | "statement" flag for OAL because the only useful
view is per-clause; there’s nothing finer to opt into.
Some rules write the whole expression on one physical line
(endpoint_cpm = from(Endpoint.*).cpm();); others split it across lines
for readability. Per-clause sourceLine reports each clause’s actual
position from ANTLR’s source location — operators see distinct lines when
they wrote them on distinct lines, and the same line repeated when they
wrote the rule on one line. sourceText discriminates clauses regardless.
The output sample is the terminal capture (compiler-emitted, no
sourceText — render with an (implicit) badge in the UI). The
aggregation function (CPMFunction, LongAvgFunction,
PercentileFunction, etc.) determines what fields the Metrics
instance carries at emit time. Metrics.appendDebugFields(JsonObject)
is the per-family hook (CPMMetrics emits count / total / value,
PercentileMetrics emits its bucket array, etc.); the UI sees the
concrete class name plus the populated state. The debugger does not
follow the metric into the cross-node L2 / persistence path; cross-node
aggregation is out of scope.
For the per-DSL examples and full payload key reference see the operator pages: MAL, LAL, OAL.
2. Capture mechanism — per-rule-instance GateHolder (no ThreadLocal, no shared registry)
Captures live entirely in the rule’s runtime state, not in a per-thread
context. Each compiled rule class generates its own GateHolder
instance at construction time, with contentHash set as a constructor
argument and made immutable for that holder’s lifetime. Probes embedded
in the generated bytecode read the holder directly:
public final class GateHolder {
// Hot-path field: read by every probe call site. Public + volatile so
// the generated bytecode reads it directly with one volatile load and
// the JIT can hoist the load when the method is hot.
public volatile boolean gate;
// Immutable for this holder's lifetime — set by constructor at codegen,
// never overwritten. Hot-update creates a brand-new holder with the
// new content's hash; the old holder retains its own hash until GC.
private final String contentHash;
// CoW snapshot of currently-bound recorders; mutated only on session
// install / uninstall via the synchronized methods below. Public +
// volatile so generated probe code reads it with one volatile load —
// symmetric with `gate`. Readers MUST treat the array as immutable;
// writers always replace the reference, never mutate the array in place.
public volatile DebugRecorder[] recorders = EMPTY;
private static final DebugRecorder[] EMPTY = new DebugRecorder[0];
public GateHolder(String contentHash) { this.contentHash = contentHash; }
public String getContentHash() { return contentHash; }
/** Called by `DebugSessionRegistry.install(...)`. */
public synchronized void addRecorder(DebugRecorder r) {
DebugRecorder[] old = recorders;
DebugRecorder[] neu = Arrays.copyOf(old, old.length + 1);
neu[old.length] = r;
recorders = neu;
if (old.length == 0) gate = true; // 0→1 transition flips the gate
}
/** Called by `DebugSessionRegistry.uninstall(...)`. */
public synchronized void removeRecorder(DebugRecorder r) {
DebugRecorder[] old = recorders;
DebugRecorder[] neu = removeFromArray(old, r);
recorders = neu;
if (neu.length == 0) gate = false; // 1→0 transition flips the gate
}
}
contentHash is final because the holder belongs to one specific
generated class instance. There is no scenario in which the same
holder serves two different rule contents: hot-update produces a new
class, the new class’s constructor produces a new holder with the new
hash, and the old holder continues to exist (with the old hash) for as
long as the old class instance is referenced by in-flight ingest. Each
captured record stamps record.contentHash = holder.getContentHash()
at append time and inherently picks up the right value because the
probe fires on the holder owned by whichever class the receiver thread
is running.
// idiom emitted by codegen at every probe call site:
if (DEBUG.gate) { // public volatile field — direct read
MALDebug.captureStage(DEBUG, rule, sourceLine, sourceText, family);
}
DEBUG is not static final for MAL and LAL — those DSLs each
have a singleton runtime instance per rule (MalExpression,
LalExpression), so the holder lives as a regular instance field on
that singleton:
// inside a generated MAL rule class:
public final GateHolder DEBUG = new GateHolder("7c3a91…");
// inside a generated LAL rule class:
public final GateHolder DEBUG = new GateHolder("9b12a0…");
For OAL, the per-metric runtime artifact (<MetricName>Metrics) is
constructed per source row — not a singleton. Instead, the
dispatcher singleton (e.g., EndpointDispatcher) holds per-metric
holder instance fields, one per metric the dispatcher routes to:
// inside a generated OAL dispatcher (singleton, held by SourceReceiverImpl):
public final GateHolder gate_endpoint_cpm = new GateHolder("8a21f0…");
public final GateHolder gate_endpoint_resp_time = new GateHolder("8a21f0…");
// ... one per metric in this scope's `do<Metric>` method set
private void do_endpoint_cpm(Endpoint source) {
if (gate_endpoint_cpm.gate) {
OALDebug.captureSource(gate_endpoint_cpm, "endpoint_cpm", ...);
}
// ... existing build / aggregation / emit ...
}
(All metrics in one OAL file share the file’s contentHash, so the
constructor argument is the same for every per-metric holder created
from that file.)
The install / uninstall paths call addRecorder(...) /
removeRecorder(...) directly — no reflection. The methods are
synchronized to serialise the array copy-on-write against concurrent
mutations; the receiver hot-path read takes no lock (it just reads the
volatile array reference). The gate is auto-managed inside
addRecorder / removeRecorder on 0↔1 transitions, so the install
path doesn’t have to remember to flip it separately.
No shared Map<RuleKey, GateHolder> registry. Sessions find the
live holder by asking the rule’s existing analyzer / dispatcher
infrastructure for the live runtime artifact (the MalExpression /
LalExpression / dispatcher singleton) and then calling
.debugHolder(...) on it — see
Install / uninstall.
This is what makes content-hash-overwrite-on-hot-update structurally
impossible: there is no shared map entry to overwrite, no swap window,
no race. Old captures continue on the old holder until the old class
goes out of scope (typically a few hundred ms during drain); new
captures fire on the new holder. Records carry whichever
contentHash was on the holder they were stamped from — the UI sees a
clean version boundary in the session payload.
When no session is bound to the rule, gate=false and JIT eliminates the
call site entirely — ~1 ns idle cost (the volatile load + branch). When
sessions are bound, the probe iterates the (small) recorders array and
each recorder appends to its own session’s payload. Multiple sessions
on the same rule are first-class: each is one entry in the array, all
fed by the same probe call.
The receiver path is unchanged. There is no ThreadLocal, no
analyzer-side wrapper, no per-thread context to install or propagate.
Sessions register themselves on the rule’s holder; receiver threads
running the rule’s generated code read the holder directly. This sits
inside the project’s “no ThreadLocal side-channels” rule documented in
CLAUDE.md — the holder is rule-scoped state, not a thread-local
side-channel.
3. Capture API and per-DSL hooks
This section nails down (a) the static surface generated code calls, (b) the recorder interface that owns per-session state, (c) the receiver-side scope wrapper that installs / uninstalls a recorder for one ingest pass, and (d) exactly what the MAL / LAL / OAL generators emit.
3.1 Per-DSL probe classes — MALDebug, LALDebug, OALDebug
The probe surface splits into three DSL-specific classes, each living
in the same module that owns the DSL’s runtime types. There is no
single unified DSLDebug class — generated bytecode for each DSL
calls into its own DSL’s probe class. This is forced by Java’s
classpath direction: SampleFamily and Sample are owned by
analyzer/meter-analyzer; ExecutionContext is owned by
analyzer/log-analyzer; both modules depend on server-core, not the
other way around.
| Probe class | Module | Generated bytecode origin | Imports analyzer types |
|---|---|---|---|
MALDebug |
analyzer/meter-analyzer |
MAL classes (compiled by this module) | SampleFamily, Sample (own module) |
LALDebug |
analyzer/log-analyzer |
LAL classes (compiled by this module) | ExecutionContext (own module) |
OALDebug |
server-core |
OAL classes (compiled by oal-rt, deps server-core) |
none — ISource, Metrics are in server-core |
Each probe takes the per-rule GateHolder (server-core) as its first
argument and fans the capture out to every recorder registered on the
holder. No ThreadLocal, no per-thread context propagation. The
generated bytecode wraps each call site with the gate check
(if (DEBUG.gate)) so the call is only emitted once a session is active
for this rule.
The probe surface is intentionally not generic. Each DSL has its own
workflow with its own typed L1-bound artifact, and we want generated code
to call methods whose types match the artifact rather than boxing
everything through a StageData wrapper.
MALDebug (in analyzer/meter-analyzer, package org.apache.skywalking.oap.meter.analyzer.v2.dsldebug)
public final class MALDebug {
private MALDebug() {}
// ── Part A: chain probes (emitted from MALMethodChainCodegen / MALExprCodegen)
public static void captureInput(GateHolder h, String rule, int sourceLine,
String metricRef, SampleFamily f) { fanOut(h, r -> ((MALDebugRecorder) r).appendInput(rule, sourceLine, metricRef, f)); }
public static void captureFilter(GateHolder h, String rule, int sourceLine,
String filterExpr, SampleFamily f) { fanOut(h, r -> ((MALDebugRecorder) r).appendFilter(rule, sourceLine, filterExpr, f)); }
public static void captureStage(GateHolder h, String rule, int sourceLine,
String sourceText, SampleFamily f) { fanOut(h, r -> ((MALDebugRecorder) r).appendStage(rule, sourceLine, sourceText, f)); }
public static void captureScope(GateHolder h, String rule, int sourceLine, String scopeExpr,
SampleFamily f, Map<MeterEntity,Sample[]> meterSamples) { fanOut(h, r -> ((MALDebugRecorder) r).appendScope(rule, sourceLine, scopeExpr, f, meterSamples)); }
public static void captureDownsample(GateHolder h, String rule, String function,
String origin, SampleFamily f) { fanOut(h, r -> ((MALDebugRecorder) r).appendDownsample(rule, function, origin, f)); }
// ── Part B: post-chain probes (emitted from Analyzer.doAnalysis, hand-written)
public static void captureMeterBuild(GateHolder h, String rule, MeterEntity entity,
String metricName, String valueType,
AcceptableValue<?> v) { fanOut(h, r -> ((MALDebugRecorder) r).appendMeterBuild(rule, entity, metricName, valueType, v)); }
public static void captureMeterEmit(GateHolder h, String rule, MeterEntity entity,
String metricName, AcceptableValue<?> v,
long timeBucket) { fanOut(h, r -> ((MALDebugRecorder) r).appendMeterEmit(rule, entity, metricName, v, timeBucket)); }
private static void fanOut(GateHolder h, java.util.function.Consumer<DebugRecorder> body) {
DebugRecorder[] rs = h.recorders;
for (int i = 0; i < rs.length; i++) {
DebugRecorder r = rs[i];
if (!r.isCaptured()) body.accept(r);
}
}
}
LALDebug (in analyzer/log-analyzer, package org.apache.skywalking.oap.log.analyzer.v2.dsldebug)
public final class LALDebug {
private LALDebug() {}
public static void captureText(GateHolder h, String rule, ExecutionContext ctx) { fanOut(h, r -> ((LALDebugRecorder) r).appendText(rule, ctx)); }
public static void captureParser(GateHolder h, String rule, ExecutionContext ctx) { fanOut(h, r -> ((LALDebugRecorder) r).appendParser(rule, ctx)); }
public static void captureExtractor(GateHolder h, String rule, ExecutionContext ctx) { fanOut(h, r -> ((LALDebugRecorder) r).appendExtractor(rule, ctx)); }
public static void captureSink(GateHolder h, String rule, ExecutionContext ctx,
String branch, boolean kept) { fanOut(h, r -> ((LALDebugRecorder) r).appendSink(rule, ctx, branch, kept)); }
public static void captureLine(GateHolder h, String rule, int sourceLine,
ExecutionContext ctx) { fanOut(h, r -> ((LALDebugRecorder) r).appendLine(rule, sourceLine, ctx)); }
public static void captureOutputRecord(GateHolder h, String rule, ExecutionContext ctx,
LALOutputBuilder builder) { fanOut(h, r -> ((LALDebugRecorder) r).appendOutputRecord(rule, ctx, builder)); }
public static void captureOutputMetric(GateHolder h, String rule, ExecutionContext ctx,
SampleFamily handedToMal) { fanOut(h, r -> ((LALDebugRecorder) r).appendOutputMetric(rule, ctx, handedToMal)); }
private static void fanOut(GateHolder h, java.util.function.Consumer<DebugRecorder> body) {
DebugRecorder[] rs = h.recorders;
for (int i = 0; i < rs.length; i++) {
DebugRecorder r = rs[i];
if (!r.isCaptured()) body.accept(r);
}
}
}
captureOutputMetric takes SampleFamily directly:
analyzer/log-analyzer already depends on analyzer/meter-analyzer
(see oap-server/analyzer/log-analyzer/pom.xml), so this is the
existing dep direction, not a new one. The whole probe API stays
typed end-to-end with no Object leaks.
OALDebug (in server-core, package org.apache.skywalking.oap.server.core.dsldebug)
OAL’s runtime types (ISource, Metrics) are already owned by
server-core, so OAL’s probe class can live there too:
public final class OALDebug {
private OALDebug() {}
public static void captureSource(GateHolder h, String rule, int sourceLine,
String fromClauseText, ISource src) { fanOut(h, r -> ((OALDebugRecorder) r).appendSource(rule, sourceLine, fromClauseText, src)); }
public static void captureFilter(GateHolder h, String rule, int sourceLine,
String filterClauseText, ISource src,
String left, String op, String right, boolean kept) { fanOut(h, r -> ((OALDebugRecorder) r).appendFilter(rule, sourceLine, filterClauseText, src, left, op, right, kept)); }
public static void captureBuild(GateHolder h, String rule, Metrics built) { fanOut(h, r -> ((OALDebugRecorder) r).appendBuild(rule, built)); }
public static void captureAggregation(GateHolder h, String rule, int sourceLine,
String aggregationClauseText, Metrics afterEntryFn) { fanOut(h, r -> ((OALDebugRecorder) r).appendAggregation(rule, sourceLine, aggregationClauseText, afterEntryFn)); }
public static void captureEmit(GateHolder h, String rule, Metrics readyForL1) { fanOut(h, r -> ((OALDebugRecorder) r).appendEmit(rule, readyForL1)); }
private static void fanOut(GateHolder h, java.util.function.Consumer<DebugRecorder> body) {
DebugRecorder[] rs = h.recorders;
for (int i = 0; i < rs.length; i++) {
DebugRecorder r = rs[i];
if (!r.isCaptured()) body.accept(r);
}
}
}
These methods are the only symbols generated bytecode references —
generators never see DebugRecorder or the ThreadLocal. That keeps the
generated class identical across builds whether or not a session is active.
The probe set is intentionally closed. New DSL stages introduced later
(say, OAL adds a new dispatcher stage) get a new typed probe; we do not
push it through a generic capture(stageId, Object) because that loses
type information at the recorder and forces every appendXxx to deal
with instanceof switching. Typed probes keep the recorder narrow and
the JSON renderer per-artifact compact.
3.2 DebugRecorder — marker interface + per-DSL extensions
DebugRecorder is split the same way as the probe classes: a marker
interface in server-core (so GateHolder.recorders[] can be typed
without dragging in analyzer modules), plus three extension interfaces
in their respective DSL modules carrying the typed appendXxx methods.
DebugRecorder marker (in server-core)
package org.apache.skywalking.oap.server.core.dsldebug;
public interface DebugRecorder {
String sessionId();
String catalog();
String name();
String ruleName();
/** Narrow gate — called by `DebugSessionRegistry` before bind. */
boolean matches(String catalog, String name, String ruleName);
/** Becomes true once recordCap is hit; further appends are no-ops. */
boolean isCaptured();
}
This is the only type GateHolder.recorders[] is parameterised on. It
contains no append* methods — those live on the per-DSL extensions
below. Probes downcast to the right extension interface based on which
DSL’s class is calling.
MALDebugRecorder (in analyzer/meter-analyzer, extends DebugRecorder)
package org.apache.skywalking.oap.meter.analyzer.v2.dsldebug;
public interface MALDebugRecorder extends DebugRecorder {
void appendInput(String rule, int sourceLine, String metricRef, SampleFamily f);
void appendFilter(String rule, int sourceLine, String filterExpr, SampleFamily f);
void appendStage(String rule, int sourceLine, String sourceText, SampleFamily f);
void appendScope(String rule, int sourceLine, String scopeExpr,
SampleFamily f, Map<MeterEntity, Sample[]> meterSamples);
void appendDownsample(String rule, String function, String origin, SampleFamily f);
void appendMeterBuild(String rule, MeterEntity entity, String metricName,
String valueType, AcceptableValue<?> v);
void appendMeterEmit(String rule, MeterEntity entity, String metricName,
AcceptableValue<?> v, long timeBucket);
}
LALDebugRecorder (in analyzer/log-analyzer, extends DebugRecorder)
package org.apache.skywalking.oap.log.analyzer.v2.dsldebug;
public interface LALDebugRecorder extends DebugRecorder {
void appendText(String rule, ExecutionContext ctx);
void appendParser(String rule, ExecutionContext ctx);
void appendExtractor(String rule, ExecutionContext ctx);
void appendSink(String rule, ExecutionContext ctx, String branch, boolean kept);
void appendLine(String rule, int sourceLine, ExecutionContext ctx);
void appendOutputRecord(String rule, ExecutionContext ctx, LALOutputBuilder builder);
void appendOutputMetric(String rule, ExecutionContext ctx, SampleFamily handedToMal);
}
OALDebugRecorder (in server-core, extends DebugRecorder)
package org.apache.skywalking.oap.server.core.dsldebug;
public interface OALDebugRecorder extends DebugRecorder {
void appendSource(String rule, int sourceLine, String fromClauseText, ISource src);
void appendFilter(String rule, int sourceLine, String filterClauseText,
ISource src, String left, String op, String right, boolean kept);
void appendBuild(String rule, Metrics built);
void appendAggregation(String rule, int sourceLine, String aggregationClauseText,
Metrics afterEntryFn);
void appendEmit(String rule, Metrics readyForL1);
}
Implementation: one impl class per DSL in dsl-debugging
Each session targets one rule which is one DSL. The session’s recorder implements only the corresponding extension interface:
MALDebugRecorderImpl implements MALDebugRecorder— for sessions onotel-rules/log-mal-rules.LALDebugRecorderImpl implements LALDebugRecorder— for sessions onlal.OALDebugRecorderImpl implements OALDebugRecorder— for sessions onoal.
server-admin/dsl-debugging depends on server-core,
analyzer/meter-analyzer, and analyzer/log-analyzer, so it can
construct any of the three impls. The instanceof downcast inside the
probe is a compile-time-trivial check on a tagged type — JIT specialises
it after warm-up.
DebugSession (in dsl-debugging) is the recorder factory plus the
Map<SessionId, DebugSession> and Map<ClientId, SessionId>
book-keeping from
Session lifecycle, storage, and memory control.
Serialization happens at capture time inside each appendXxx
implementation — payload JSON is built once per probe so polls during
the session return identical bytes every time.
3.3 Install / uninstall — direct GateHolder mutation, no per-thread context
There is no analyzer-side wrapper. Sessions are installed and removed
by mutating the per-rule GateHolder from the session-management code
in dsl-debugging; receiver threads observe the new state via the
volatile recorders field on the next probe. The receiver code path
itself is unchanged from today.
Rule identity is a typed object, not a slash-encoded string.
A small immutable class lives in server-core and is the single
identifier the install/uninstall path uses — every method takes
RuleKey, never an encoded String:
public record RuleKey(Catalog catalog, String name, String ruleName) {
public enum Catalog { OTEL_RULES, LOG_MAL_RULES, LAL, OAL }
}
RuleKey.equals/hashCode are auto-derived from the three fields; no
delimiter, no escaping, no parsing. Each DSL’s runtime artifact
exposes its own RuleKey ruleKey() method (filled in at codegen
time), so the install path can inspect the live artifact’s identity
without re-parsing strings.
Each DebugSession stores the exact GateHolder it bound into.
That lets uninstall remove the recorder from the original holder
even if the rule has since been hot-updated and the live artifact now
references a different holder. Without this, install would bind to V1
and uninstall would call removeRecorder on V2 (the new live holder),
leaving V1’s recorder array and gate stuck on until the old class is
garbage-collected.
public final class DebugSession {
final SessionId sessionId;
final ClientId clientId;
final RuleKey ruleKey;
final GateHolder boundHolder; // ← captured at install time, never updated
// ... payload, lifecycle state ...
}
// inside DebugSessionRegistry.install(key, clientId, sessionSpec):
GateHolder h = liveHolderFor(key); // current live holder (see lookup below)
DebugRecorder recorder = newRecorderFor(key, sessionSpec);
DebugSession session = new DebugSession(sessionId, clientId, key, h, ...);
sessions.put(sessionId, session);
clientIdIndex.put(clientId, sessionId);
h.addRecorder(recorder); // synchronized; auto-flips gate on 0→1
// inside DebugSessionRegistry.uninstall(sessionId):
DebugSession session = sessions.remove(sessionId);
clientIdIndex.remove(session.clientId, sessionId);
session.boundHolder.removeRecorder(session.recorder); // ← original holder
Hot-update interactions stay correct:
- Install runs against the live holder of the moment.
session.boundHolderis pinned to that instance. - A subsequent runtime-rule swap creates a new live holder; the registry’s
next
liveHolderFor(key)returns the new one. New session installs bind to the new holder. - The old session’s
uninstall(manual stop / retention / replaced-by-client) walks throughsession.boundHolder— i.e., the V1 holder — so V1’s recorder array shrinks correctly and V1’s gate flips back tofalsewhen its last recorder leaves. The new live holder is untouched. - The old V1 class’s drainage continues hitting V1’s holder; once V1’s
recorder array is empty its gate is
falseand probes are JIT-eliminated again on V1 too.
The liveHolderFor(key) lookup goes through each DSL’s existing
runtime registry — no new shared Map<RuleKey, GateHolder>:
private GateHolder liveHolderFor(RuleKey key) {
switch (key.catalog()) {
case OTEL_RULES:
case LOG_MAL_RULES:
// The meter-analyzer Provider already maintains the live
// MalExpression instance per rule, replaced on hot-update.
return malProvider.liveExpression(key).debugHolder();
case LAL:
// The log-analyzer Provider already maintains live LalExpression
// instances per rule, replaced on hot-update.
return lalProvider.liveExpression(key).debugHolder();
case OAL:
// The dispatcher is held by SourceReceiverImpl; per-metric holder
// lives on it as `gate_<ruleName>` instance field.
return dispatcherManager.dispatcherFor(key)
.debugHolder(key.ruleName());
}
throw new IllegalArgumentException(key.toString());
}
No reflection. addRecorder / removeRecorder are typed methods on
the GateHolder instance reachable through each DSL’s existing live-
artifact registry — direct method calls. Hot-update for MAL or LAL
replaces the live MalExpression / LalExpression instance with a
new one carrying its own fresh GateHolder; pre-update sessions stay
on the old holder (which the old instance still references) until
their session window ends or the old class drains; post-update
sessions install on the new holder. There is no shared registry to
overwrite mid-flight.
recorders is a copy-on-write snapshot — readers iterate whatever
snapshot was visible when they read the volatile reference; writers
publish a new snapshot atomically. The per-rule lock serialises
concurrent install/uninstall against each other (rare events; one
operator-driven action at a time per rule). The receiver hot path takes
no lock and incurs no TL access.
Why no analyzer-side wrapper:
- The receiver doesn’t need to know which sessions match the rule — the holder is the per-rule binding. Probes consult it directly.
- No indirection to add at every analyzer call site (no diff per analyzer beyond the codegen change inside the rule’s own class).
- Multi-session support is built into the recorder array: install appends, uninstall removes; the probe iterates whatever’s there. No composite-recorder type, no per-thread state to install or propagate.
3.4 Generator hook insertion — concrete code diffs
MAL — two halves: chain codegen + Analyzer hand-off to MeterSystem
The MAL pipeline splits cleanly across two files. The chain (filter / closure /
scope / tag / downsampling stages) is emitted by MALMethodChainCodegen; the
post-chain MeterSystem build + emit lives in Analyzer.doAnalysis. The
debugger hooks both halves.
Chain codegen (source). The current loop emits one
_var = _var.method(args); line per stage. The hook is one extra line per
stage:
// before:
for (final MethodCall mc : chain) {
if (mc.isExtension()) { emitExtensionCall(sb, var, mc); continue; }
sb.append(" ").append(var).append(" = ")
.append(var).append('.').append(mc.getName()).append('(');
emitBuiltinArgs(sb, var, mc);
sb.append(");\n");
}
// after:
for (final MethodCall mc : chain) {
if (mc.isExtension()) {
emitExtensionCall(sb, var, mc);
emitMalCapture(sb, var, mc, /*op=*/mc.getNamespace() + "::" + mc.getName());
continue;
}
sb.append(" ").append(var).append(" = ")
.append(var).append('.').append(mc.getName()).append('(');
emitBuiltinArgs(sb, var, mc);
sb.append(");\n");
emitMalCapture(sb, var, mc, mc.getName());
}
private static void emitMalCapture(StringBuilder sb, String var, MethodCall mc) {
// Emit a literal-string call. `mc.toMalSource()` rebuilds the verbatim
// MAL fragment from the AST: e.g. "tagEqual('region', 'us-east-1')",
// ".sum(['svc', 'instance'])", "myext::scale(2.0)". The compiled string
// literal goes into the constant pool once per generated class.
sb.append(" ").append(DSL_DEBUG_FQN)
.append(".MALDebug.captureStage(\"").append(currentRuleName).append("\", ")
.append(mc.getLine()).append(", \"")
.append(escapeJava(mc.toMalSource())).append("\", ")
.append(var).append(");\n");
}
MALDebug.captureInput is emitted once at the top of run() (in
MALExprCodegen.generate*, where _var first reads from samples.get(...))
with sourceLine = the line where the metric reference appeared and
metricRef = the reference text (e.g. node_memory_MemAvailable_bytes).
Filter (file-level, applied in Analyzer before expression.run()), tag,
binary operations (* 100, / otherMetric, etc., emitted by
MALExprCodegen as their own _var = _var.multiply(...) / _var.divide(...)
lines), and built-in chain stages each get a MALDebug.captureStage(...) line
carrying their MAL fragment. Scope binding emits MALDebug.captureScope(...) with
the expSuffix text (e.g. instance(['host'], Layer.VM)) plus the freshly
populated Map<MeterEntity, Sample[]>. Closure companion classes
(MALClosureCodegen) need no changes — the chain stage that owns the
closure (tag({...}), forEach({...})) already captures it at its method
call line; the closure’s own line shows up as the sourceText of that
parent stage.
File-level filter, also in Analyzer.doAnalysis (source).
The filter: block at the top of the YAML file ({ tags -> tags.job == 'node-exporter' }) is applied to the input map before expression.run()
fires. One probe captures the filtered SampleFamily with the filter’s verbatim
text:
// existing:
if (filterExpression != null) {
input = filterExpression.filter(input);
if (input.isEmpty()) return;
}
// after — every probe takes the rule's GateHolder as its first argument:
GateHolder DEBUG = expression.debugHolder(); // resolved once per ingest pass
if (filterExpression != null) {
input = filterExpression.filter(input);
if (DEBUG.gate) {
MALDebug.captureFilter(DEBUG, ruleName, fileFilterLine,
fileFilterSource /* "{ tags -> tags.job == 'node-exporter' }" */,
SampleFamily.flatten(input) /* lightweight view across the input map */);
}
if (input.isEmpty()) return;
}
fileFilterLine and fileFilterSource are stored on the rule definition
when the YAML is parsed; they’re per-rule constants, captured once per
ingest pass.
Post-chain hand-off in Analyzer.doAnalysis. The existing code iterates
meterSamples and calls meterSystem.buildMetrics(...).accept(...) followed
by send(...) (which calls meterSystem.doStreamingCalculation(...)). Two
probes wrap the build / emit boundary:
// existing:
meterSamples.forEach((meterEntity, ss) -> {
generateTraffic(meterEntity);
switch (metricType) {
case single:
AcceptableValue<Long> sv = meterSystem.buildMetrics(metricName, Long.class);
sv.accept(meterEntity, getValue(ss[0]));
send(sv, ss[0].getTimestamp()); // ← doStreamingCalculation = L1 entry
break;
// ... labeled / histogram / histogramPercentile branches
}
});
// after — every probe takes the rule's GateHolder as its first argument:
final GateHolder DEBUG = expression.debugHolder();
meterSamples.forEach((meterEntity, ss) -> {
generateTraffic(meterEntity);
switch (metricType) {
case single:
AcceptableValue<Long> sv = meterSystem.buildMetrics(metricName, Long.class);
sv.accept(meterEntity, getValue(ss[0]));
if (DEBUG.gate) {
MALDebug.captureMeterBuild(DEBUG, ruleName, meterEntity, metricName, "Long", sv);
MALDebug.captureMeterEmit(DEBUG, ruleName, meterEntity, metricName, sv,
TimeBucket.getMinuteTimeBucket(ss[0].getTimestamp()));
}
send(sv, ss[0].getTimestamp());
break;
// ... same shape for labeled (DataTable), histogram (BucketedValues),
// histogramPercentile (PercentileArgument)
}
});
Both probes fire before send(...). Note that send(...) itself
mutates v by setting v.timeBucket immediately before
doStreamingCalculation, so at the moment MALDebug.captureMeterEmit runs
v.timeBucket is still 0. The probe takes the time-bucket as a separate
parameter, computed from the same ss[0].getTimestamp() that send will
use; the recorder serializes (v fields, computed timeBucket) so the UI
sees the exact (value, timeBucket) pair that’s about to be streamed,
without needing the post-mutation snapshot. The debugger does not follow
the value past doStreamingCalculation.
LAL — LALClassGenerator.generateExecuteMethod + Analyzer-side terminal
LAL’s execute(filterSpec, ctx) already passes ctx explicitly — no
ThreadLocal needed. The codegen places per-block probes inside execute()
(source); the terminal typed-output capture lives in
RecordSinkListener.parse so it fires once init(metadata, ...) has
populated standard fields and right before the listener’s build()
delegates to complete(sourceReceiver).
// generated execute() body — hooks are gate-guarded and pass `this.DEBUG`:
public final GateHolder DEBUG = new GateHolder("...sha...");
public void execute(FilterSpec filterSpec, ExecutionContext ctx) {
LalRuntimeHelper h = new LalRuntimeHelper(ctx);
h.ctx().setOutput(new <OutputType>());
if (DEBUG.gate) LALDebug.captureText(DEBUG, ruleName, ctx); // >>> raw body
filterSpec.json(ctx);
if (DEBUG.gate) LALDebug.captureParser(DEBUG, ruleName, ctx); // >>> parsed map
if (!ctx.shouldAbort()) {
_extractor(filterSpec.extractor(), h);
if (DEBUG.gate) LALDebug.captureExtractor(DEBUG, ruleName, ctx); // >>> typed builder mid-flight
}
filterSpec.sink(ctx);
if (DEBUG.gate) LALDebug.captureSink(DEBUG, ruleName, ctx, // >>> kept / dropped + branch
h.lastSinkBranch(),
!h.lastSinkDropped());
}
h.lastSinkBranch() / h.lastSinkDropped() are tiny additions to
LalRuntimeHelper that capture which sink branch (enforcer / sampler /
rateLimit / pass-through) ran last and whether it dropped the record;
they’re populated by the existing sink spec methods on the same hot path.
Terminal record output is captured one frame up the stack in
RecordSinkListener (source) — i.e., not in generated bytecode
but in the analyzer wrapper that drives it:
// before:
public LogSinkListener parse(LogMetadata md, Object input, ExecutionContext ctx) {
if (ctx == null || !(ctx.output() instanceof LALOutputBuilder)) return this;
builder = ctx.outputAsBuilder();
builder.init(md, input, moduleManager);
return this;
}
// after:
public LogSinkListener parse(LogMetadata md, Object input, ExecutionContext ctx) {
if (ctx == null || !(ctx.output() instanceof LALOutputBuilder)) return this;
builder = ctx.outputAsBuilder();
builder.init(md, input, moduleManager);
GateHolder DEBUG = ctx.lalExpression().debugHolder(); // resolved per call
if (DEBUG.gate) {
LALDebug.captureOutputRecord(DEBUG, ruleName, ctx, builder); // >>> typed builder, before complete()
}
return this;
}
That captures the typed builder (LogBuilder, EnvoyAccessLogBuilder,
DatabaseSlowStatementBuilder, SampledTraceBuilder, custom subclasses) with
all standard fields populated. The recorder serializes the class name plus
every populated bean property — the UI shows the operator the actual record
shape that’s about to be persisted.
Metric output (only present for rules with a metrics{} block) fires from
MetricExtractor.submitMetrics (source) right after
SampleFamilyBuilder produces the family that’s about to be handed to MAL:
// inside submitMetrics(...):
final Sample sample = builder.build();
final SampleFamily sampleFamily = SampleFamilyBuilder.newBuilder(sample).build();
GateHolder DEBUG = ctx.lalExpression().debugHolder();
if (DEBUG.gate) {
LALDebug.captureOutputMetric(DEBUG, ruleName, ctx, sampleFamily); // >>> handed to MAL
}
// ... existing dispatch into MetricConverts ...
The capture fires once whether the family is appended to ctx.metricsContainer()
(MAL fanout case) or fed straight to provider.getMetricConverts().forEach(...).
The debugger does not chain into the MAL pipeline that consumes this
family — that boundary is the L1-bound hand-off from LAL’s perspective, even
though MAL would itself become L1-bound only after its own meterSystem
build/emit. From this LAL session’s view, MAL is downstream and out of scope.
The two helpers LALBlockCodegen and LALDefCodegen are unchanged for
block-level capture — block content emission is the same; only the
orchestration in generateExecuteMethod adds the capture lines.
Statement-level capture — always emitted, recorder-side filter.
LAL codegen unconditionally emits one
LALDebug.captureLine(rule, sourceLine, ctx) call after every
meaningful AST statement (OutputFieldAssignment, TagAssignment,
DefStatement, IfBlock predicate evaluations, MetricsInline
blocks). The probes are present in every compiled LAL class regardless
of how the operator will later debug the rule. There is no
recompile-on-session-install path — the rule’s class is generated
once at boot or runtime-rule apply time, with all probe call sites in
place, and stays that class until a structural runtime-rule update
replaces it.
The session’s granularity field is purely a recorder-side filter:
| Granularity | Recorder behaviour |
|---|---|
block (default) |
appendLalLine(...) returns immediately — statement records are dropped. Only block-level appends accumulate. |
statement |
appendLalLine(...) serializes and appends the record. Block-level appends accumulate alongside. |
Idle-path cost remains zero — DEBUG.gate=false eliminates all probe
call sites, statement-level included. Active-path cost in block mode is
slightly higher than the previous design (~7 ns per statement probe to
do the holder load + recorder iteration + recorder’s
“granularity != statement → return” check), but the absolute cost is
bounded: a typical LAL extractor block has 10-15 statements, so
~100-150 ns of additional cost per log record on a block-mode session.
Well inside the active-path budget. The
performance ladder updates
to reflect this; idle and capturing rows are unchanged.
AST plumbing required (one-time, not per session). Add a
int sourceLine field to every relevant statement node in
LALScriptModel. The ANTLR listener already has
ctx.getStart().getLine() per parsed statement; thread it into the
AST. The codegen reads it when emitting the LALDebug.captureLine(...) call.
The bytecode LineNumberTable already produced by
LALClassGenerator.addLineNumberTable tracks generated-Java lines
and is unrelated — sourceLine here is the original DSL line in the
YAML so the UI can highlight the corresponding source line side-by-side
with the captured value.
Concurrency — multiple sessions of any granularity mix on the same
rule: each registers as one entry on the rule’s GateHolder.recorders
array (see Performance plan).
Each statement probe iterates the array; recorders running in block
mode drop the record; recorders running in statement mode keep it.
No cross-session interference; no class swap; no DSLClassLoaderManager
involvement.
OAL — FreeMarker template dispatcher/doMetrics.ftl
OAL emits one do<Metric>(source) method per metric per dispatcher class,
filter as a guard, then <Metric>Metrics metrics = new ..., source-field
copy, metrics.<entryFn>(...), then MetricsStreamProcessor.in(metrics)
(source). Hooks bracket each transition; the final probe fires
before the L1-entry in(...) call, not after:
The template uses the full per-clause probe signatures (rule, sourceLine, sourceText, …) as defined in the probe surface.
OAL AST deliverable — source-position fields don’t exist today
The current OAL model classes do not carry source positions:
oap-server/oal-rt/src/main/java/.../v2/model/SourceReference.java— has nolocation/sourceLinefield.oap-server/oal-rt/src/main/java/.../v2/model/FunctionCall.java— same; no source location.oap-server/oal-rt/src/main/java/.../v2/model/FilterExpression.java— declares aSourceLocation locationfield (line 42), butOALListenerV2.enterFilterStatement(line 99) does not populate it on construction; the field is currently always null.
This SWIP makes the source-position carrying an explicit deliverable of Phase 3 (OAL single-node):
- Add
int sourceLine(orSourceLocation) toSourceReferenceandFunctionCallmodel classes (analogous toFilterExpression’s existing field). - Update
OALListenerV2.enterSourceReference/enterFunctionCall/enterFilterStatementto populate the source position fromctx.getStart().getLine()on each AST node it constructs. - Add
String sourceText()accessors that pretty-print the verbatim OAL fragment from each node (from(Endpoint.*),.cpm(),.filter(detectPoint == DetectPoint.SERVER), etc.) — used by the FreeMarker model for${from.sourceText},${f.sourceText},${aggregationSourceText}.
These are pure model-side additions; no behavioural change to the OAL runtime, no breaking change to the parser. The SourceReference / FunctionCall changes are tracked under Phase 3 in Phased delivery.
Once the AST changes land, ${from.line}, ${aggregationLine},
${f.line} populate from MetricDefinition.SourceReference /
FilterExpression / FunctionCall nodes through the FreeMarker
model.
Each per-metric holder lives on the dispatcher singleton as
gate_<metricsName> (declared once in the dispatcher’s class body
alongside the other instance fields the template emits). The
do<MetricsName> body reads that holder and passes it as the first arg
to every OALDebug.captureXxx call:
private void do${metricsName}(${sourcePackage}${from.sourceName} source) {
if (gate_${metricsName}.gate) {
org.apache.skywalking.oap.server.core.dsldebug.OALDebug
.captureSource(gate_${metricsName}, "${metricsName}",
${from.line}, "${from.sourceText}", source);
}
<#if filters.filterExpressions??>
<#list filters.filterExpressions as f>
if (!new ${f.expressionObject}().match(${f.left}, ${f.right})) {
if (gate_${metricsName}.gate) {
org.apache.skywalking.oap.server.core.dsldebug.OALDebug
.captureFilter(gate_${metricsName}, "${metricsName}",
${f.line}, "${f.sourceText}",
source, "${f.left}", "${f.opSymbol}", "${f.right}", false);
}
return;
}
if (gate_${metricsName}.gate) {
org.apache.skywalking.oap.server.core.dsldebug.OALDebug
.captureFilter(gate_${metricsName}, "${metricsName}",
${f.line}, "${f.sourceText}",
source, "${f.left}", "${f.opSymbol}", "${f.right}", true);
}
</#list>
</#if>
${metricsClassPackage}${metricsName}Metrics metrics =
new ${metricsClassPackage}${metricsName}Metrics();
<#-- ... existing source-field copy ... -->
if (gate_${metricsName}.gate) {
org.apache.skywalking.oap.server.core.dsldebug.OALDebug
.captureBuild(gate_${metricsName}, "${metricsName}", metrics);
}
metrics.${entryMethod.methodName}(${entryMethod.argsList});
if (gate_${metricsName}.gate) {
org.apache.skywalking.oap.server.core.dsldebug.OALDebug
.captureAggregation(gate_${metricsName}, "${metricsName}",
${aggregationLine}, "${aggregationSourceText}", metrics);
org.apache.skywalking.oap.server.core.dsldebug.OALDebug
.captureEmit(gate_${metricsName}, "${metricsName}", metrics);
}
org.apache.skywalking.oap.server.core.analysis.worker
.MetricsStreamProcessor.getInstance().in(metrics);
}
OALDebug.captureEmit fires immediately before MetricsStreamProcessor.in(metrics)
— that call is L1 entry, and the debugger stops there. The dispatcher template
dispatch.ftl is unchanged: it just routes a Source row to its do<Metric>
methods; capture is per-metric, inside each method.
No receiver-side wrapping is needed for OAL — the dispatcher
singleton holds per-metric GateHolder instance fields
(gate_endpoint_cpm, gate_endpoint_resp_time, …), and each
do<Metric>(source) reads its own metric’s holder directly. Same
self-resolving probe model as MAL and LAL; the receiver code path
(SourceReceiverImpl.receive → DispatcherManager.forward →
dispatcher.dispatch(source) → the generated do<Metric>(source))
is unchanged.
3.5 Recap of what changes vs. what doesn’t
| Component | Change |
|---|---|
| MAL / LAL / OAL grammar + AST | unchanged |
| MAL / LAL / OAL generator orchestrators | one extra capture-emission helper called per stage |
| Generated bytecode | adds N static-method calls to the per-DSL probe class (MALDebug.capture* / LALDebug.capture* / OALDebug.capture*) |
| Receiver / analyzer call sites | unchanged — probes self-resolve through the per-rule GateHolder |
| New plumbing | GateHolder, DebugRecorder marker, OALDebug + OALDebugRecorder (in server-core); MALDebug + MALDebugRecorder (in analyzer/meter-analyzer); LALDebug + LALDebugRecorder (in analyzer/log-analyzer); recorder impls + DebugSessionRegistry + REST handler (in dsl-debugging); HTTP host shared via admin-server. No new shared Map<RuleKey, GateHolder> registry — sessions reach holders through each DSL’s existing live-artifact lookup. |
| Behavior when no session is installed | per-rule GateHolder (singleton) gates every probe via its gate volatile-boolean field; JIT eliminates the call when off (~1 ns idle). See Performance plan. |
Compile-failure handling is unchanged — capture hooks only fire if the rule compiled successfully. The chat clarification is honored: “if it is parsed, there is no grammar issue”.
3.6 Performance plan — idle path effectively free, active path bounded
Capture lives in the data hot path. A receiver-side LAL pipeline can run
100K log lines/sec; a busy OAL dispatcher fans dozens of Source rows per
trace. Probe overhead must be near-zero when no session is active and
strictly bounded when one is. Three layers handle this:
Layer 1 — per-rule-instance GateHolder, gate flipped through interface methods
Every generated rule artifact carries a GateHolder field constructed
fresh per class generation. The holder is bound to that specific
class instance, not shared across class versions. Concrete shape per
DSL:
- MAL / LAL:
public final GateHolder DEBUG = new GateHolder("...sha...");on the generated rule class. The class has a singleton runtime instance (MalExpression/LalExpression); the holder lives as an instance field on that singleton. - OAL: per-metric
public final GateHolder gate_<metric> = new GateHolder("...sha...");on the dispatcher singleton (one holder per metric the dispatcher routes to). The metric class itself is per-source-row, so the holder cannot live there; the dispatcher is the per-rule control point.
In both cases, hot-update replacing the rule’s runtime artifact also
replaces the holder — the new artifact’s constructor builds a brand-new
GateHolder with the new content’s hash. The old holder lingers as
long as the old artifact is referenced by drainage and session state;
its contentHash is final and never overwritten, so records captured
on the old artifact carry the right (old) hash.
The DSL codegen wraps every probe call site:
// emitted in MALMethodChainCodegen / LALClassGenerator / OAL FreeMarker:
if (DEBUG.gate) {
MALDebug.captureStage(DEBUG, ruleName, sourceLine, sourceText, _var);
}
DEBUG.gate is a public volatile boolean field on the holder, default
false. When DEBUG.gate == false (the steady state on every node not
currently debugging the rule), the JIT compiles the branch as
never-taken: the capture call site becomes effectively dead code, and
the only runtime cost per probe is one volatile-load + branch (~1 ns).
On a hot LAL rule with statement-level probes and no session running,
the per-log-line idle overhead is ~15 probes × 1 ns = 15 ns — well
under the 1% budget.
Session install flips the gate through GateHolder’s typed
interface methods — never reflection. The session code calls
holder.addRecorder(r); the holder atomically appends the recorder to
its CoW array and flips gate=true if it just transitioned 0→1:
// inside DebugSessionRegistry.install(ruleKey, recorder):
GateHolder h = liveHolderFor(key);
h.addRecorder(recorder); // synchronized; auto-flips gate on 0→1
Symmetric on uninstall: h.removeRecorder(recorder) removes the entry
and flips gate=false on 1→0. The holder owns its own state
transitions; callers don’t have to remember to flip the gate
separately.
The gate field is one writer-side write per 0↔1 transition, many
reader-side loads per probe; volatile is the correct primitive. The
recorders array is replaced via volatile write on each
addRecorder/removeRecorder, so receiver threads always see a coherent
snapshot.
Layer 2 — holder-resident recorder array, fan-out at the probe
When gate is true, the probe call site fires and reaches
the relevant per-DSL probe (MALDebug.captureXxx / LALDebug.captureXxx
/ OALDebug.captureXxx, depending on which DSL’s class is calling).
That method reads the holder’s
recorders array and iterates:
DebugRecorder[] rs = h.recorders; // one volatile load
for (int i = 0; i < rs.length; i++) {
DebugRecorder r = rs[i];
if (!r.isCaptured()) body.accept(r); // each recorder appends to its session
}
No ThreadLocal. No per-thread state to install or maintain. Multiple
sessions on the same rule sit in the array and each receives the
capture independently. The receiver thread doesn’t know or care which
sessions exist — it just iterates the holder’s snapshot.
Layer 3 — per-session isCaptured() short-circuit
Once a session hits its record cap, isCaptured() flips to true. The
fan-out loop still iterates over that recorder, but skips the
serialization. With a single capped session in a 1-recorder array, the
total cost stays in the few-ns range (gate + array read + isCaptured
check + skip).
Cost summary
| State | Cost per probe |
|---|---|
Idle — no session anywhere on this rule (gate=false) |
~1 ns (volatile load + branch; JIT eliminates the call site) |
| Active — 1 recorder, all captured (cap reached) | ~7 ns (gate + array load + isCaptured + skip) |
| Active — N recorders, all captured | ~7 ns + ~3 ns per extra recorder |
| Active — capturing (real append, single recorder) | ~500–2000 ns (JSON serialize + append) |
| Active — capturing across N recorders | N × ~500–2000 ns (independent per recorder) |
Active-path serialization uses per-DSL streaming renderers — direct
StringBuilder walks of SampleFamily / ExecutionContext /
LALOutputBuilder / Metrics rather than reflection-based JSON. CPU is
bounded by the per-record figure in the table above; memory is bounded
by the two-dimension session contract (MAX_ACTIVE_SESSIONS=200 per
node × per-session recordCap, see
Session lifecycle, storage, and memory control).
Bytecode + memory overhead when not running
| Resource | Cost |
|---|---|
| Generated bytecode per rule | ~20 bytes per probe call site (gate-load + branch + invokestatic + args) |
| Constant-pool entries | One String entry per sourceText literal; one int per sourceLine |
| Per-class field | One static-final GateHolder reference (instance carries volatile boolean gate, volatile String contentHash, volatile DebugRecorder[] recorders) |
| Session registry state | Empty when no sessions exist — no idle memory |
Across the bundled rule set (~50 MAL rules, ~14 LAL rules, ~12 OAL files × many rules each), the cumulative bytecode increase is small — single- digit MiB. The OAP image size is dominated by other concerns; this is not a constraint.
Validation gate
JMH benchmarks added under oap-server-bench:
DSLDebugIdleBench— measures probe call cost with no session installed. CI gate: regress build if mean cost exceeds 2× the no-probe baseline on the same JIT.DSLDebugActiveBench— measures probe call cost during an active session. CI gate: regress build if mean cost exceeds 5 μs per capture at the median.- Tail-latency check (
p99 < 50 μs per capture) under contended load (8 receiver threads × hot rule).
Benchmarks run on every PR that touches oap-server/server-core/.../dsldebug/,
the DSL generators, or the analyzer call sites. A fail blocks merge.
Why not a single global gate
A single global “any session active anywhere” flag would also eliminate idle cost — but it would force every receiver thread to enter the fan-out path (read the recorders array, iterate, even if empty) the moment any session on any rule exists anywhere. With per-class gates, only threads processing the specifically debugged rule pay attention; everyone else stays on the JIT-eliminated call-site path.
What this means for deployment
- Operators leaving DSL-debugging enabled in
application.ymlwith no active session pay essentially zero — confirmed by the JMH idle bench. - Operators starting a session on rule X impose cost only on the receiver threads that processed rule X’s data — and only for the duration of the session capture window plus retention.
- A misbehaving operator script that opens many sessions concurrently is
bounded by
maxActiveSessions: 200(rejected with HTTP 429 above the cap), and even at the cap the cost on any one rule’s hot path is identical to one session.
3.7 Implementation strategy — hard-coded probes, hand-written serializers, no reflection
The capture pipeline has three layers; only the call sites are auto-
emitted by codegen, the rest is hand-written Java in server-core and
dsl-debugging with no reflection on the hot path.
Layer 1 — call sites: codegen-emitted constants
Probe call sites (if (DEBUG.gate) {MAL,LAL,OAL}Debug.captureXxx(DEBUG, ...)) are
hard-coded as bytecode constants by the existing DSL codegens:
| Codegen owner | Emits |
|---|---|
MALMethodChainCodegen.emitChainStatements |
one MALDebug.captureStage(...) per chain stage; sourceText = mc.toMalSource() written into the constant pool as a String literal |
MALExprCodegen (input read) |
one MALDebug.captureInput(...) at each samples.get(...) site |
Analyzer.doAnalysis (hand-written, not codegen) |
the MALDebug.captureFilter / MALDebug.captureScope / MALDebug.captureDownsample / MALDebug.captureMeterBuild / MALDebug.captureMeterEmit calls |
LALClassGenerator.generateExecuteMethod |
per-block LALDebug.captureText / LALDebug.captureParser / LALDebug.captureExtractor / LALDebug.captureSink |
LALBlockCodegen.generate*Statement |
LALDebug.captureLine(...) after each statement (always emitted; recorder filters by granularity) |
RecordSinkListener.parse (hand-written) |
LALDebug.captureOutputRecord(...) after builder.init(...) |
MetricExtractor.submitMetrics (hand-written) |
LALDebug.captureOutputMetric(...) after SampleFamilyBuilder.build() |
dispatcher/doMetrics.ftl FreeMarker |
per-clause OAL captures with line numbers from MetricDefinition.SourceReference etc. |
Each emitted call expands to 4-8 bytecode instructions plus the constant-pool entries for the literal arguments (rule name, source line, source text). No reflection, no method-handle lookup, no allocation — the JVM’s invokestatic instruction dispatches directly to the matching per-DSL probe class.
Layer 2 — {MAL,LAL,OAL}Debug.captureXxx: hand-written probes in their DSL’s home module
Each probe is a few-line static method that fans out to the holder’s recorder array:
public static void MALDebug.captureStage(GateHolder h, String rule, int sourceLine,
String sourceText, SampleFamily f) {
DebugRecorder[] rs = h.recorders; // one volatile load
for (int i = 0; i < rs.length; i++) {
DebugRecorder r = rs[i];
if (!r.isCaptured()) r.appendMalStage(rule, sourceLine, sourceText, f);
}
}
No reflection. No ThreadLocal. No allocation on the no-active-recorder
path (the receiver hits this method only when h.gate == true, which
implies recorders.length >= 1; the array load + length check is
unconditional). When all recorders are isCaptured(), the loop runs
through them in a few ns.
Layer 3 — DebugRecorderImpl.appendXxx: hand-written per-DSL serializers in dsl-debugging
Each appendXxx method has a hand-written JSON serializer for that
specific artifact type. No ObjectMapper, no reflection, no annotation
processing — direct StringBuilder writes that know the artifact’s
shape:
@Override public void appendMalStage(String rule, int line, String text, SampleFamily f) {
if (++recordCount > recordCap) { captured = true; return; }
StringBuilder sb = new StringBuilder(256);
sb.append("{\"kind\":\"stage\",\"line\":").append(line)
.append(",\"text\":").append(jsonEscape(text))
.append(",\"samples\":[");
Sample[] samples = f.samples;
for (int i = 0; i < samples.length; i++) {
if (i > 0) sb.append(',');
appendSampleJson(sb, samples[i]); // hand-written walk of labels + value
}
sb.append("]}");
serializedRecords.add(sb.toString());
}
appendSampleJson, appendMeterEntityJson, appendBucketedValuesJson,
appendLALOutputBuilderJson (per known subclass), appendISourceJson
(per scope) are all hand-written walks of the artifact’s fields. Each
artifact type has a known shape; we don’t need to discover fields at
runtime.
For LALOutputBuilder subclasses (LogBuilder, EnvoyAccessLogBuilder,
DatabaseSlowStatementBuilder, SampledTraceBuilder, custom subclasses
contributed by LALSourceTypeProvider), the recorder dispatches via a
small Map<Class<? extends LALOutputBuilder>, BuilderSerializer>
populated at dsl-debugging’s startup. Custom builders contribute their
own serializer through the LALSourceTypeProvider SPI alongside their
existing outputType() contribution. If no serializer is registered the
recorder falls back to a generic Java-bean walk that uses
PropertyDescriptor reflection — only on this fallback path, not
hot-path.
Why no reflection on the hot path
A Jackson ObjectMapper.writeValueAsString(family) would be 5-10× slower
than the hand-written walk for SampleFamily because it would walk
Sample’s reflection metadata per call. With statement-level LAL active
on a busy rule (8 records × 15 statements × 60s window), that adds up to
visible CPU. Hand-written serializers keep us at ~500-1000 ns per
capture, which is the active-path budget validated by the JMH active
bench.
Per-rule-instance GateHolder — codegen mechanics
Every generated rule artifact gets one GateHolder field constructed
in the class’s constructor (or constructor-like initializer for
dispatcher metric fields). The holder’s contentHash is supplied as a
constructor argument and stored in a final field, so it is fixed
for that holder’s entire lifetime:
// MAL / LAL — instance field on the rule's singleton runtime class:
public final GateHolder DEBUG = new GateHolder(/* contentHash = */ "7c3a91…");
// OAL — per-metric instance field on the dispatcher singleton:
public final GateHolder gate_endpoint_cpm = new GateHolder("8a21f0…");
The contentHash literal is computed at codegen time (SHA-256 over the
content bytes the codegen consumed) and embedded directly in the
generated bytecode’s constant pool. There is no separate registry of
holders, no Map<RuleKey, GateHolder> to dedup or replace, no
computeIfAbsent semantics — each generated class instance simply
constructs its own holder once.
Session-install code finds the holder by asking each DSL’s existing
live-artifact registry (MalExpression / LalExpression /
dispatcher singleton) for the rule and calling .debugHolder(...) —
typed method, not reflection. Mutation goes through the holder’s typed
interface methods (addRecorder, removeRecorder); the gate
transitions are encapsulated inside those methods, so callers can’t
forget to flip the gate. No Class.getField(...), no
setBoolean(null, ...), no reflection anywhere on the install /
uninstall path.
Per-rule holder access on the hot path is the cheapest possible flag:
a single final reference load + one volatile boolean read + branch.
The final DEBUG reference is JIT-inlinable across the receiver’s
hot loop; the volatile read on gate is the actual gate check.
Net summary
- Codegen emits: probe call sites +
public final GateHolder DEBUG = new GateHolder("...sha...")(orgate_<metric> = new GateHolder(...)per metric, for OAL dispatchers) + theRuleKey ruleKey()accessor. Hard-coded bytecode literals; nothing computed at runtime. {MAL,LAL,OAL}Debugstatic methods: hand-written probes. No reflection.DebugRecorderImplper-DSL appenders: hand-written JSON serializers per artifact type. No reflection on the hot path; only fallback for unknown LAL output-builder subclasses.- All allocation on the active path: one
StringBuilderper capture (recycled? — no, fresh per capture; the per-capture lifetime is short and ZGC/G1 handle it efficiently).
3.8 Interaction with rule changes — content-hash stamping
Runtime-rule hot-update can replace a rule’s class while a debug session
is active. The captured sourceLine / sourceText describe the rule
content as of the class that produced them — once the class swaps, a
later record’s source positions refer to a different content. The UI
needs to know which content each record describes so it can render the
right YAML alongside.
The capture carries the rule’s content hash directly, computed by
the DSL engine at content-load time. The hash is not persisted
anywhere — not in the runtime-rule DB row, not in any registry on
disk. It lives only in memory: in the GateHolder, in the generated
class’s constant pool (as a String literal), and in the captured
records that quote it. Storing it persistently would just denormalize
content that’s already in hand whenever it matters.
Static rules loaded from disk at boot have no DB row at all — their content is read once, hashed in memory, and the hash flows into the generated class. Runtime-managed rules have a DB row carrying the rule content (status, content bytes, updateTime); the runtime-rule applier already loads that content into memory before compiling, so the hash falls out for free from the same in-memory bytes the codegen consumes. Both paths converge on the same SHA-256 from the same content with no extra DB column required.
Extended GateHolder
The GateHolder introduced in the
Performance plan
grows one field:
The GateHolder carries contentHash as a final constructor argument
(see the canonical GateHolder definition).
Codegen emits the constructor call at the rule’s instance-field initializer:
// MAL / LAL — generated rule class:
public final GateHolder DEBUG = new GateHolder(/* contentHash = */ "7c3a91…");
// OAL — generated dispatcher:
public final GateHolder gate_endpoint_cpm = new GateHolder("8a21f0…");
contentHash is computed once per class generation:
| Origin | Where the hash is computed |
|---|---|
| Static-disk rules at boot | the analyzer reads the YAML from disk; codegen hashes those bytes locally before generating the class |
| Runtime-rule managed rules | the runtime-rule applier loads content from the DB; codegen hashes those bytes locally — same code path |
Both paths feed the codegen the same content bytes; codegen runs
SHA-256 over them once and embeds the hex digest as a String literal
in the constant pool of the generated class. The DB row stores only
the content (plus status + updateTime); the hash is recomputed
whenever the content is loaded into memory, never read from a stored
column.
The runtime-rule plugin’s /runtime/rule/list response includes a
contentHash field for operator visibility — that response field is
computed on demand from the in-memory content the plugin already has,
not read from a persisted column. Same convention here: derive on
load, don’t store.
What the recorder writes
Every captured record reads DEBUG.contentHash at append time and stores
it in the per-record JSON:
{
"id": "r17",
"contentHash": "7c3a91…",
"stages": [
{"sourceLine": 7, "sourceText": "tagEqual('region', 'us-east-1')", "samples": [...]},
{"sourceLine": 7, "sourceText": "* 100", "samples": [...]}
]
}
A volatile field read is one cheap load; this is added to the active
path’s per-capture cost (already 500-2000 ns), not the idle path.
What the session payload returns
Top-level alongside nodes[], the session response includes a
ruleSnapshots map keyed by content hash:
{
"sessionId": "d-abc",
"ruleSnapshots": {
"7c3a91…": {
"capturedFirstAt": 1745329200000,
"content": "filter: ...\nexpSuffix: ...\nmetricsRules:\n - ..."
},
"b1d402…": {
"capturedFirstAt": 1745329485000,
"content": "..."
}
},
"nodes": [...]
}
Snapshots are populated by the DebugSessionRegistry the first time it
sees a record carrying a hash it hasn’t recorded yet. The content for a
snapshot comes from the rule’s source — runtime-rule’s DB row for
managed rules, the in-memory disk-read for static rules — neither of
which is on the hot path.
One hash, three places it surfaces — all the same value
The same contentHash appears in three places, computed by the same
SHA-256 over the same content bytes:
| Place | Field | Purpose |
|---|---|---|
| Each captured record in a debug-session response | record.contentHash |
Identify which rule version emitted this record |
ruleSnapshots[hash] in the same response |
the map key + .content |
Carry the rule text that matches record.contentHash |
Rule-content management APIs (existing /runtime/rule/list, new /runtime/oal/files, etc.) |
contentHash field on each rule |
Same identifier the UI uses to correlate captures with rules |
The UI’s invariant: record.contentHash == ruleSnapshots[h].contentHash == /runtime/.../list.contentHash for the same applied content. UI
correlation is “look up by SHA prefix”; no parallel identifier scheme
to map across.
How hot-update interacts
| Runtime-rule action | Holder state | Captured records |
|---|---|---|
| Filter-only update (same identity) | new <clinit> writes new contentHash; gate stays as-is |
Pre-swap records keep their original hash; post-swap records carry the new hash |
| Structural update | same — new hash on the holder | Records before vs. after the swap reference different snapshots; UI labels the boundary |
| LAL block ↔ statement recompile | content unchanged — contentHash re-computes to the same SHA |
All records carry one hash; ruleSnapshots has one entry |
| Inactivate / delete | holder removed; sessions auto-terminate | Final session payload still has all snapshots already captured |
| Static-only rule (no runtime-rule) | hash set once at boot; no further swaps | Single-hash session, single snapshot |
Why content-hash, not a counter
- Static rules have no DB-side revision number. Computing a SHA at boot is the only honest identity the recorder can stamp.
- Idempotence. A re-apply with byte-identical content produces the
same SHA → same
ruleSnapshotsentry → no spurious “rule changed” boundary in the UI when nothing changed. - Cluster agreement is guaranteed by the algorithm, not by
coordination. Hashing rules:
- Algorithm: SHA-256 over the rule’s content bytes, hex-encoded lowercase (64 chars).
- Byte source: the exact bytes the DSL engine compiled —
UTF-8-decoded YAML/
.oaltext from disk for static rules, the UTF-8 bytes from the runtime-rule DB row for managed rules. No normalization (no whitespace stripping, no line-ending rewrites, no comment removal). Two nodes that compile the same content bytes deterministically produce the same digest. - Convergence path: runtime-rule’s reconciler already converges
content cluster-wide within one tick (≤ 30 s). Within that
convergence window peers may briefly carry different content for
the same rule — and therefore different hashes. The session
payload’s
nodes[]array shows each peer’s hash explicitly so the UI sees the divergence honestly during the window; after convergence all peers’ captures stamp the same hash. - Cross-version cluster: if OAP binaries differ across nodes, their shipped static-rule bytes can differ, producing different hashes for the “same” static rule. This is the same issue runtime-rule documents as out of scope; deployment discipline is the fix.
- No counter to wrap, drift, or be surprising. The hash IS the identifier.
3.9 Permanent injection disable — boot-time bytecode opt-out
This flag is additive to the existing runtime controls — not a replacement. Three orthogonal layers govern whether captures can run, in order of escalating strength:
| Layer | Type | What it controls |
|---|---|---|
admin-server.enabled |
runtime (require restart to load) | Whether the admin HTTP server runs at all. If false, no REST endpoints exist. |
dsl-debugging.enabled |
runtime | Whether the dsl-debugging module installs. If false, sessions can’t be created. |
Per-rule GateHolder.gate |
runtime | Whether captures fire on this rule for an active session. Flipped on session install/end. |
dsl-debugging.injectionEnabled (NEW) |
boot-time codegen | Whether probe call sites are emitted into generated bytecode at all. |
The first three already exist (or are introduced earlier in this SWIP) and remain valid — they handle the normal “operator pauses or disables the feature” cases without forcing a restart or a re-codegen. The fourth is what’s new in this section: a stronger, more permanent guarantee for deployments that want no probe call sites in the bytecode at all.
Why such a flag is useful even when the runtime flags exist:
- High-assurance / regulated environments where any debug instrumentation in production-loaded bytecode is a no-go regardless of whether it would currently fire.
- Defense in depth — even an admin who somehow flipped a
GateHoldercannot trigger captures because there are no call sites to fire. - Performance-paranoid sites that want absolute zero overhead past JIT
elimination (the
~1 ns volatile loadof the runtime-gate idle path is not “absolute zero”).
A boot-time configuration flag governs this:
dsl-debugging:
default:
# When false, the DSL codegens emit MAL/LAL/OAL classes WITHOUT any
# MALDebug / LALDebug / OALDebug call sites and WITHOUT the GateHolder static
# field. Generated bytecode is byte-identical to a build without
# SWIP-13. Default is true (capability preserved). Once set false at
# boot, debug captures cannot be enabled until restart with the flag
# back to true.
injectionEnabled: ${SW_DSL_DEBUGGING_INJECTION_ENABLED:true}
What changes when injectionEnabled: false:
| Aspect | injectionEnabled: true (default) |
injectionEnabled: false |
|---|---|---|
| Generated class — chain-stage call sites | if (DEBUG.gate) MALDebug.captureStage(...) |
absent — line not emitted |
Generated class — GateHolder DEBUG static field |
present, registered via <clinit> |
absent |
live-artifact lookup (liveHolderFor(key)) |
resolves to a real holder | unused — feature isn’t installed |
DebugSessionRegistry.install(...) (mutates holder) |
normal install path | DebugSessionRegistry is unbound; install rejected with 503 |
POST /dsl-debugging/session |
accepts and starts a session | rejects with 503 injection_disabled and a body explaining why |
| Existing telemetry / data-path behaviour | identical | identical |
The flag is boot-time only, by design. Flipping it at runtime would
require regenerating every MAL/LAL/OAL class to either add or strip the
probe call sites, and the runtime-rule reload infrastructure works on
content changes, not on codegen-flag changes. Operators changing the
flag must restart OAP for it to take effect. This is enforced by the
provider: at startup the flag is read once into a final boolean injectionEnabled and threaded into every DSL generator’s GenCtx. No
hot-update path exists, no API mutates it.
Status visibility — GET /dsl-debugging/status
A new admin endpoint on admin-server exposes the current posture so
operators / CI / compliance auditors can confirm a node’s state without
parsing YAML:
| Method | Path | Effect |
|---|---|---|
| GET | /dsl-debugging/status |
Return the current injection + session-runtime state |
Response shape:
{
"adminHostEnabled": true,
"dslDebuggingEnabled": true,
"injectionEnabled": false,
"injectionEnabledSource": "SW_DSL_DEBUGGING_INJECTION_ENABLED=false (boot)",
"sessionsAcceptingNewRequests": false,
"activeSessions": 0,
"maxActiveSessions": 200,
"ruleClassesWithProbes": 0,
"ruleClassesTotal": 87
}
Field semantics:
injectionEnabled— the boot-resolved value of the flag. Always reflects what the codegen actually did.injectionEnabledSource— human-readable provenance: which env var or YAML key set the value, and that it was resolved at boot.sessionsAcceptingNewRequests—falsewhenever any of the four layers above (admin-server.enabled,dsl-debugging.enabled, all ruleGateHolderreferences, orinjectionEnabled) puts the system into a state that can’t actually capture. The flag composes the runtime posture into a single yes/no for clients.ruleClassesWithProbes/ruleClassesTotal— quick visual confirmation that the codegen actually applied the flag. WithinjectionEnabled: falsethe first count is 0; withtrueit equals the total. Disagreement signals a bug or partial reload.
GET /dsl-debugging/status is read-only, returns immediately, has no
session-cap interaction, and is safe to scrape at high frequency from
monitoring tools.
Cluster behaviour
injectionEnabled is per-node — it’s a boot-time codegen choice, not a
clustered configuration. A cluster running with mixed
injectionEnabled values has some nodes that can capture and others
that cannot. The nodes[] array in a session response surfaces this
honestly: nodes with injectionEnabled: false contribute
{status: "injection_disabled"} slices instead of records. Operators
who need uniform behaviour set the flag uniformly across the cluster
(deployment discipline; the SWIP doesn’t synchronise it).
3.10 Security — out of scope; delegated to the deployment
The capture API is read-side and can expose sensitive payloads (raw
log bodies, parsed maps, extracted fields) when the operator debugs
the LAL surface. Securing access is out of scope for this SWIP.
admin-server ships unauthenticated this iteration; the deployment is
responsible for putting it behind a sidecar / gateway / IP allow-list.
SWIP-13 does not introduce auth, redaction, or masking primitives —
those concerns live in the dedicated Admin API documentation
(see docs/en/setup/backend/admin-api/),
which carries the security notice for admin-server, the runtime-rule
update API, and the DSL debug API together. The notice is the single
canonical place operators should read before opening the admin port to
their cluster.
4. REST surfaces on the shared admin-server
admin-server is the runtime / on-demand admin surface. It hosts exactly
two concerns in this SWIP — runtime-rule (existing, migrated onto the
shared server) and DSL debugging (new). They share the same Armeria
server, the same admin port (default 17128), and the same security
posture (no auth in this release, gateway-protect, ships disabled by
default):
| Subsection | Purpose | Origin |
|---|---|---|
| 4.1 OAL read-only management API | List loaded .oal files / rules so the OAL debugger has a picker |
NEW — this SWIP |
| 4.2 DSL debug session API | Start / poll / stop debug sessions across MAL / LAL / OAL | NEW — this SWIP |
existing /runtime/rule/* (migrated) |
Runtime-rule hot-update endpoints, hosted on admin-server post-migration | MIGRATED — runtime-rule plugin’s own HTTP server is removed |
Not on admin-server (intentionally): the status-query-plugin
endpoints GET /status/config/ttl and GET /debugging/config/dump,
plus all /debugging/query/* step-through helpers, stay where they are
on the query port (default 12800). They are released APIs serving a
read-only telemetry-inspection audience and we do not move them. The
admin-server URI namespace is reserved for write/admin endpoints
(runtime-rule writes, dsl-debugging session control). Operators who
want config inspection in addition to admin operations open both ports
on their gateway, exactly as today.
4.1 OAL read-only management API — the catalog the debugger picks from
Today MAL and LAL rules are runtime-managed (the runtime-rule plugin
exposes /runtime/rule/list), but OAL rules are not — they are loaded
from oap-server/server-starter/src/main/resources/oal/*.oal at boot and
have no runtime listing API. That gap blocks the OAL debugger: the
operator has to pick “one OAL rule line” to sample, and there is currently
no way for the UI or swctl to discover what rules exist without reading
the source .oal files off disk.
This SWIP closes that gap with a read-only OAL management API. Read-only because OAL hot-update (add/update/inactivate/delete at runtime) is a much larger feature with its own classloader-isolation, alarm-reset, and cluster-coordination concerns — that is explicitly out of scope here, scoped for a future SWIP. What lands now is just the listing, sufficient to drive the OAL debugger’s rule picker.
The API lives in a new small module oal-management (or as a
sub-package inside oal-rt — the implementer picks based on dependency
hygiene), registers its handler on admin-server, and is structured so that
adding write endpoints later is purely additive.
| Method | Path | Effect |
|---|---|---|
| GET | /runtime/oal/files |
List loaded .oal files. One row per file: { name, path, ruleCount, status, contentHash } |
| GET | /runtime/oal/files/{name} |
Single file detail: { name, path, content, rules: [...] } where content is the raw .oal text and rules are the parsed rule definitions |
| GET | /runtime/oal/rules |
Flat list of every rule across every file: [{ file, ruleName, line, sourceScope, expression, function, filters }, ...] |
| GET | /runtime/oal/rules/{ruleName} |
Single rule detail: { file, ruleName, line, sourceScope, expression, function, filters, persistedMetricName } |
Response shape per rule:
{
"file": "core",
"ruleName": "endpoint_cpm",
"line": 14,
"sourceScope": "Endpoint",
"expression": "endpoint_cpm = from(Endpoint.*).cpm();",
"function": "cpm",
"filters": [],
"persistedMetricName": "endpoint_cpm",
"contentHash": "8a21f0…"
}
contentHash is the same SHA-256 the debugger stamps on captured records
(content-hash stamping) — UI
matches record.contentHash against the contentHash returned here to
fetch the right content for rendering. The hash is computed on demand
from the in-memory file content the OAL engine already holds; not stored
in the DB.
The fields come from the existing oal-rt parsing pipeline (parser →
MetricDefinition → MetricDefinitionEnricher → CodeGenModel),
extended with contentHash carried alongside on the
OALRuleRegistry.OalFileSnapshot record (defined below).
OALRuleRegistry — explicit retention contract
Today’s loader does not retain the parsed structures:
OALEngineV2.start
parses, enriches, generates classes, and lets the local OALScriptParserV2
and List<CodeGenModel> go out of scope.
OALEngineLoaderService
only stores the OALDefine set; the parsed AST is discarded after class
generation. Naively re-parsing on each REST call would work but is fragile
(two engines see different shapes if one diverges) and slow.
This SWIP introduces a small new service in server-core that the
loader populates at boot:
package org.apache.skywalking.oap.server.core.oal.rt;
public interface OALRuleRegistry extends Service {
/** Snapshot loaded at boot; updated only by future OAL hot-update. */
List<OalFileSnapshot> files();
Optional<OalFileSnapshot> file(String fileName);
Optional<OalRuleSnapshot> rule(String fileName, String ruleName);
record OalFileSnapshot(
String name, // "core"
String path, // "core.oal"
String content, // raw .oal text
String contentHash, // SHA-256 of `content` bytes — matches captures (see content-hash stamping section)
List<OalRuleSnapshot> rules,
Status status // LOADED | DISABLED | COMPILE_FAILED
) {}
record OalRuleSnapshot(
String file, String ruleName, int line,
String sourceScope, // e.g. "Endpoint"
String expression, // verbatim line(s)
String function, // "cpm" | "longAvg" | "apdex" | …
List<FilterSnapshot> filters,
String persistedMetricName,
String contentHash // copy of file's contentHash for convenience
) {}
}
OALEngineV2.start() calls registry.put(file, snapshot) after class
generation succeeds. Memory cost is bounded by the OAL rule set (~200
rules across ~12 files = a few hundred KiB of String + small objects);
retained for the process lifetime.
When OAL hot-update lands in a future SWIP, the same registry receives
updates through the same put() path — the read API doesn’t change.
The DSLDebuggingProvider’s REST handler reads from this registry; no
re-parsing on the request path.
How the debugger picks a rule
OAL debugging requires the operator to pick (file, ruleName) because OAL
captures sample source rows for one rule line, not the whole file. The
flow is:
- UI calls
GET /runtime/oal/rulesto populate the picker. - Operator picks a rule (e.g.
endpoint_cpmfromcore.oal). - UI calls
POST /dsl-debugging/sessionwithcatalog: "oal",name: "core"(the file),ruleName: "endpoint_cpm". - Backend appends the recorder to that rule’s
GateHolder.recordersarray and flips itsgateif it was the first session — only that rule’sdo<Metric>(source)probes capture.
For MAL and LAL the picker is fed by the existing
/runtime/rule/list (since those catalogs are already runtime-managed).
The OAL picker is the only one that needs this new API.
Phase-0 scope reminder
In this SWIP we ship only the GET endpoints listed above. POSTs that
would mutate OAL state are deferred to a follow-up SWIP. The path prefix
/runtime/oal/* is reserved so that future write endpoints (hot-update,
inactivate, delete) can land without a URL break.
4.2 DSL debug session API
REST routes live in the new dsl-debugging module and register on
admin-server’s Armeria server. URI prefix is /dsl-debugging/* — clean
and self-describing, no overload of the query port’s existing
/debugging/* prefix.
| Method | Path | Body (optional JSON) | Effect |
|---|---|---|---|
| POST | /dsl-debugging/session?catalog=&name=&ruleName=&clientId=&granularity= |
{ recordCap?, retentionMillis?, granularity?: "block" | "statement" } |
Start session; also stops any prior session bound to clientId; broadcast install to peers; return { sessionId, createdAt, retentionDeadline, installed: {created, total}, peers[], priorCleanup } |
| GET | /dsl-debugging/session/{id} |
— | Return aggregated payload (see data contract). 404 session_not_found if no node knows the id. |
| POST | /dsl-debugging/session/{id}/stop |
— | Force close; idempotent. |
| GET | /dsl-debugging/sessions |
— | List active sessions on this node (JSON). |
| GET | /dsl-debugging/status |
— | Module posture: {module, phase, nodeId, injectionEnabled, activeSessions}. |
clientId is a stable identifier the client mints once per debug
context. The UI decides what counts as a debug context — typically
one per browser tab when the page hosts a single debug widget, or one
per widget when the UI embeds several debug views on the same page.
The CLI mints one per invocation. The server doesn’t model the tab /
page / widget distinction; it only enforces “one clientId →
at most one active session, and a new POST replaces the prior”. See
Session lifecycle, storage, and memory control
for how it interacts with the lifecycle.
All catalogs require (catalog, name, ruleName) — the receiver returns
400 missing_param otherwise. For OAL sessions, name is the .oal
file (without extension) and ruleName is picked from /runtime/oal/rules.
For MAL sessions, name is the rule file under metricPrefix (e.g.
vm) and ruleName is the metric name within the file (e.g.
vm_cpu_total_percentage). For LAL sessions, name is the layer-keyed
rule file (e.g. default) and ruleName is the rule name within. All
three tuples are queryable via /runtime/rule/list (and OAL via
/runtime/oal/rules).
Response payload is shaped per-DSL exactly as the
data contract describes; the
receiving node merges its own slice with peer slices into one nodes[]
array under the top-level shape so the UI can show “captured 4 of 6 nodes
responded · oap-03 timed out”.
Security: admin-server inherits and centralises the security caveats already
applied to the runtime-rule REST endpoints — no auth in this release, the
host module ships disabled by default, must be gateway-protected, must
not be exposed to the public internet. Runtime-rule, OAL management, and
DSL-debugging routes are all gated by the same enable flag on admin-server.
No new auth mechanism is introduced.
admin-server — what the shared module provides
oap-server/server-admin/admin-server/ # NEW (under new server-admin/ parent)
src/main/java/.../admin/
AdminServerModule.java # ModuleDefine
AdminServerProvider.java # ModuleProvider — owns Armeria server, port, host, lifecycle
AdminRouteRegistry.java # Service — sub-modules register handler classes
AdminAuthDocumentation.java # one place to document the no-auth caveat
Sub-modules call registry.register(handlerInstance) on prepare();
admin-server aggregates them into one Armeria server bound at boot. No
sub-module owns its own HTTP server.
dsl-debugging — what the new feature module provides
oap-server/server-admin/dsl-debugging/ # NEW (sibling of admin-server under server-admin/)
src/main/java/.../dsldebugging/
DSLDebuggingModule.java # ModuleDefine
DSLDebuggingProvider.java # ModuleProvider — registers REST handler on admin-server
rest/DSLDebuggingHandler.java # Armeria @Get/@Post routes for /dsl-debugging/*
session/DebugSessionRegistry.java # Map<SessionId, DebugSession> + Map<ClientId, SessionId>
session/DebugSession.java # frozen List<String> + lifecycle states
session/DebugRecorderImpl.java # implements DebugRecorder, JSON-serialize-on-capture
bus/InstallDebugSessionRpc.java # cluster fan-out RPCs
bus/CollectDebugSamplesRpc.java
bus/StopDebugSessionRpc.java
bus/StopByClientIdRpc.java # cluster-scope clientId cleanup (load-balancer safe)
The marker DebugRecorder interface, GateHolder, RuleKey, and the
OAL probe class OALDebug live in
server-core (so the MAL/LAL/OAL generators can reference them without
pulling the whole feature module — keeps the dependency graph clean).
dsl-debugging provides the recorder implementation, registry, and REST
host registration.
runtime-rule migration — module relocation + HTTP server merge
Two moves in one release:
-
Module relocation. The existing runtime-rule plugin lives under
oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/today, but it doesn’t receive telemetry data — it serves an admin surface. Under this SWIP the module relocates tooap-server/server-admin/runtime-rule/, becoming a sibling ofadmin-serveranddsl-debuggingunder the newserver-admin/parent:oap-server/server-admin/ # NEW Maven sub-module group admin-server/ # shared Armeria HTTP server runtime-rule/ # MOVED from server-receiver-plugin/ dsl-debugging/ # NEW DSL debugging featureThe selector key in
application.ymlstays asreceiver-runtime-rule:for backward compatibility (operators upgrading don’t have to rename the YAML block); the Java package stays asorg.apache.skywalking.oap.server.receiver.runtimerule.*— only the Maven module path changes. A future cleanup can rename both, but that’s not in scope here. -
HTTP server merge. The runtime-rule plugin’s own embedded Armeria server is removed; its REST handlers register on
admin-server’s shared server. URL paths stay the same:/runtime/rule/addOrUpdate /runtime/rule/fix /runtime/rule/inactivate /runtime/rule/delete /runtime/rule/list /runtime/rule/dumpExisting CLI scripts (
runtime-rule.sh) and operator muscle memory keep working — same routes, same response codes, same body semantics, just hosted on the shared HTTP server.
Both moves happen in Phase 0 (see Phased delivery) so
server-admin/ lands as a coherent group rather than dribbling in
piece by piece.
5. Session lifecycle, storage, and memory control
Captured data is per-session, immutable once captured, and never overwritten. Operator stability comes first: once a sample lands in the session, repeated polls return the same bytes until the session is explicitly stopped or hits its timeout. There is no ring buffer, no drop-oldest — when the cap is reached, the session simply stops accepting new samples.
Lifecycle states
[absent] ──POST /dsl-debugging/session──► CAPTURING ──cap-or-window──► CAPTURED
│ │ │
│ └──manual /stop───────────┤
│ │
└──── timeout ────────────► EXPIRED ──► [absent]
▲
CAPTURED + ttl
── new POST with same clientId ⇒ prior session forced to EXPIRED ──
- CAPTURING — session is installed; capture hooks append samples. Polling is
allowed and returns the partial payload accumulated so far (
status: "capturing"). - CAPTURED — capture is complete (record cap hit, window elapsed, or operator
called
/stopmid-capture). The payload is now frozen. Polling returns the same bytes on every call (status: "captured"). - EXPIRED — the session’s timeout fired. Stored payload is dropped from memory; further polls return 404. Operator must start a fresh session.
Three events terminate a session and free its memory:
-
Manual trigger —
POST /dsl-debugging/session/{id}/stop(mid-capture or post-capture). The session moves directly to EXPIRED on the next sweep — manual stop is interpreted as “I have what I need, throw it away”. -
Implicit cleanup on a new sampling request — cluster-scope. When a
POST /dsl-debugging/sessionarrives carrying aclientId, the receiving node first broadcasts aStopByClientId(clientId)RPC to every peer (including itself) and waits for ack within the per-peer fanout deadline. Each node looks upclientIdin its own localMap<ClientId, SessionId>; if a prior session is bound, the peer terminates it locally (decrements the rule’s recorder ref-count, drops its payload). Once the broadcast settles (or the per-peer deadline expires), the receiving node allocates the new session and runs the normalInstallDebugSessionfan-out.The cluster broadcast is load-balancer safe: because every node receives
StopByClientId(clientId), it doesn’t matter which node the load balancer routes the next POST to — the previous session bound to thatclientId(wherever it lives) gets terminated before the new one is allocated. Without this, a load balancer that bounced between nodes would leave the prior session alive on its original node until retention timeout, and the active-session-cap math would silently double-count.Implementation note: the broadcast is best-effort with a short deadline (default 2 s, same as runtime-rule’s
Suspendfanout). A peer that doesn’t ack within the deadline is logged in the response’speers[]array. No durable management-storage marker is introduced — the prior session on the missed peer self-cleans up via the same retention timeout that handles forgotten sessions in general (default 5 min from creation, hard cap 1 hour). The trade is honest: under partition the active-session count on the missed peer is slightly inflated (by the prior session) until retention elapses, but the peer’s ownMAX_ACTIVE_SESSIONS = 200cap protects it from accumulating unbounded stragglers. -
Timeout — see the cap table below.
A session that has finished capturing (CAPTURED) is not auto-discarded just
because the capture window ended. It lingers until either the operator stops it
or the timeout fires, so the UI can poll without racing against a fast
disappearance.
Caps
| Cap | Default | Mechanism |
|---|---|---|
| Max active sessions per node | 200 | hard ceiling; POST /dsl-debugging/session returns 429 too_many_sessions when full |
| Records per session | default 1000, hard cap 10000 | recorder stops appending once the count is hit and moves the session to CAPTURED. Out-of-range request returns 400 invalid_limits. |
| Capture window (retention) | default 5 min, hard cap 1 hour | per-session retention timeout; sweeper drops the payload at window end |
| Capture-call cost when idle | one volatile-load + branch (gate) | JIT eliminates the call site when gate is false |
The capture surface is bounded by two dimensions only: how many
sessions can be active concurrently, and how many records each session
can retain. There is intentionally no per-session byte cap and no
per-record byte budget — operators sizing for memory look at
maxActiveSessions × recordCap and the per-DSL average payload size.
JSON serialization for the wire happens at probe time so polls return
the same bytes every time and the retention sweeper can free them on a
fixed schedule, but the byte total is reported (not enforced) so
operators can verify their own budget assumptions against real
captures.
Why 200 active sessions is the right ceiling
200 sessions is a defensible upper bound for an OAP node. With the
hard recordCap = 10000 and a typical per-record JSON payload of
~10 KiB (MAL is the largest of the three), the worst-case footprint
across all active sessions on one node is roughly:
worst-case heap ≈ MAX_ACTIVE_SESSIONS × MAX_RECORD_CAP × per-record-bytes
≈ 200 × 10 000 × 10 KiB ≈ 20 GiB (theoretical)
That’s the worst-case product; realistic usage is one to two orders of
magnitude below because most captures stop on retention (5 min default)
or operator stop long before hitting recordCap, and most records are
several KiB rather than the worst-case 10. The 200 default leaves
headroom for many concurrent operators (one debug context = one
sessionId; the UI maintains a single session per debug widget and
reuses it for polls) without letting a runaway script exhaust the
heap. Operators tuning for tighter heap budgets stop sessions sooner
or lower per-session recordCap in the request body — the per-node
ceiling itself is fixed by MAX_ACTIVE_SESSIONS.
The session-id is the unit of accounting; the clientId is the unit of deduplication. The UI is expected to:
- Mint a stable
clientIdper debug context — typically one per browser tab when the page hosts a single debug widget; one per widget when the page embeds multiple. UUID stored insessionStorage(per-tab) or in a per-widget keyed map on the page. - Send
clientIdon everyPOST /dsl-debugging/session. The server cleans up the prior session for thatclientIdbefore allocating a new one, so a forgetful operator who clicks Start sampling repeatedly does not exhaust the active-session ceiling — they hold one slot at a time. - Poll the returned
sessionIduntil the user navigates away or hits Stop. - Send
POST /dsl-debugging/session/{id}/stopon tab close (best-effortbeforeunload).
A future CLI follows the same pattern: start allocates one session under a
stable client ID, tail reuses it, stop releases it. Stale sessions that
the client forgot to stop are reaped by the retention timeout, and any new
start from the same client implicitly cleans up the prior one regardless.
Configuration block — application.yml
Two top-level blocks: admin-server (the shared HTTP server) and
dsl-debugging (the feature). admin-server is also where the existing
runtime-rule plugin reads its host/port now — the HTTP-server keys move
out of receiver-runtime-rule: and into admin-server:. The runtime-rule
plugin keeps only its non-HTTP knobs (reconciler interval, self-heal
threshold).
The structure is one shared block + two feature blocks on top of it:
admin-server: ← shared HTTP server (host, port, security posture)
↑
├─ receiver-runtime-rule: ← feature 1 (DSL hot-update). Routes register on admin-server.
└─ dsl-debugging: ← feature 2 (DSL debug sessions). Routes register on admin-server.
Either feature requires admin-server.selector=default. All three
selectors default to empty (disabled). Operators explicitly opt in
to whichever surface they want; out-of-the-box OAP starts with no
admin port, no debug feature, no runtime-rule.
The “fail fast if a feature is enabled without admin-server” guarantee
uses the OAP module system’s existing requiredModules()
mechanism — no new SWIP machinery needed:
// in DSLDebuggingProvider:
@Override public String[] requiredModules() {
return new String[] { CoreModule.NAME, AdminServerModule.NAME };
}
// in RuntimeRuleProvider (after Phase 0 migration):
@Override public String[] requiredModules() {
return new String[] { CoreModule.NAME, AdminServerModule.NAME, ... };
}
ModuleManager.bootstrap() walks every loaded provider, resolves
requiredModules() against the provider set, and fails the OAP boot
with a ModuleNotFoundException if any required module is absent. So:
| Operator config | Boot result |
|---|---|
| All three selectors empty (default) | OAP starts; no admin port; no debug; no runtime-rule. |
admin-server=default, dsl-debugging= empty |
OAP starts; admin-server runs (with no consumers); idle. |
admin-server=default, dsl-debugging=default |
OAP starts; both bind; routes register on admin-server. |
admin-server= empty, dsl-debugging=default |
Boot fails fast with ModuleNotFoundException: admin-server — operator sees the error and adds SW_ADMIN_SERVER=default. |
admin-server= empty, receiver-runtime-rule=default |
Same — boot fails fast with the same exception path. |
This matches every other receiver / admin module in OAP that depends
on shared infrastructure (every receiver depends on CoreModule the
same way; every storage-using module depends on StorageModule, etc.).
SWIP-13 is reusing the established mechanism, not introducing a new one.
# Shared HTTP server for admin / write APIs (runtime-rule, dsl-debugging).
# DISABLED BY DEFAULT. Enable when any consumer is needed.
# SECURITY NOTICE: no authentication this iteration — gateway-protect
# with IP allow-lists and never expose to the public internet.
admin-server:
selector: ${SW_ADMIN_SERVER:-}
default:
host: ${SW_ADMIN_SERVER_HOST:0.0.0.0}
port: ${SW_ADMIN_SERVER_PORT:17128}
contextPath: ${SW_ADMIN_SERVER_CONTEXT_PATH:/}
idleTimeOut: ${SW_ADMIN_SERVER_IDLE_TIMEOUT:30000}
acceptQueueSize: ${SW_ADMIN_SERVER_QUEUE_SIZE:0}
httpMaxRequestHeaderSize: ${SW_ADMIN_SERVER_HTTP_MAX_REQUEST_HEADER_SIZE:8192}
# Runtime-rule plugin — non-HTTP knobs only after SWIP-13.
# HTTP server config moves to `admin-server:` above. Enabling
# receiver-runtime-rule requires admin-server to also be enabled
# (selector=default); startup fails with a clear error otherwise.
receiver-runtime-rule:
selector: ${SW_RECEIVER_RUNTIME_RULE:-}
default:
refreshRulesPeriod: ${SW_RECEIVER_RUNTIME_RULE_REFRESH_RULES_PERIOD:30}
selfHealThresholdSeconds: ${SW_RECEIVER_RUNTIME_RULE_SELF_HEAL_THRESHOLD_SECONDS:60}
dsl-debugging:
selector: ${SW_DSL_DEBUGGING:-}
default:
# Boot-time codegen switch — enable / disable per-rule probe call sites
# in the generated MAL / LAL / OAL bytecode. Default true once the
# module is enabled; set false to keep the REST surface but emit
# bytecode byte-identical to a build without SWIP-13.
injectionEnabled: ${SW_DSL_DEBUGGING_INJECTION_ENABLED:true}
The active-session ceiling and per-session record cap are NOT operator knobs — they are hard-coded SWIP contract values:
MAX_ACTIVE_SESSIONS = 200per node —POST /dsl-debugging/sessionreturns429 too_many_sessionsonce the count is reached. The receiving node also runs the prior-session cleanup pass for the suppliedclientIdBEFORE counting toward this ceiling, so a single client repeatedly clicking Start sampling cannot itself trigger 429.MAX_RECORD_CAP = 10000per session — request bodies asking for more return400 invalid_limits. Session retention is similarly capped at 1 hour (MAX_RETENTION_MILLIS).
There is intentionally no per-session byte cap and no structural
char / sample / label sub-caps. Operators sizing for memory pressure
look at MAX_ACTIVE_SESSIONS × recordCap × per-record-payload-size.
Per-DSL average payload size is reported as totalBytes on every GET
response so operators can verify their budget against real captures.
Storage location
Per-session payloads live in a process-local Map<SessionId, DebugSession>
inside the dsl-debugging module, alongside a sibling
Map<ClientId, SessionId> index used by the implicit-cleanup path. No
persistence layer. Sessions are lost on OAP restart by design — they are
debugging snapshots, not durable artefacts. Operators who need to keep a
result snapshot the response body client-side (the API already returns the
full payload; the CLI’s tail command saves it to a file).
6. Cluster behaviour — best-effort fan-out (no cross-node merge)
OTLP / log / native-trace data lands on whichever OAP receives the push, so a single-node debug session would only catch the slice of traffic that node owns. The debugger therefore broadcasts the session install to every peer so each peer captures its own slice of L1 parsing.
Important: the fan-out is not an attempt to follow data across nodes. There is no
L2 join in scope — every peer’s slice is a self-contained “what arrived here, what
the DSL did with it, what L1 the rule emitted on this node”. The aggregator on the
receiving node concatenates per-node slices into a nodes[] array; it does not
merge or reconcile samples between nodes.
The fan-out rides the admin-internal gRPC bus owned by
admin-server (default port 17129); dsl-debugging ships
DSLDebuggingClusterServiceImpl and registers it alongside
runtime-rule’s RuntimeRuleClusterServiceImpl on the same admin-
internal port. Four RPC types: InstallDebugSession /
StopDebugSession / CollectDebugSamples / StopByClientId. Peer
discovery is handled by AdminClusterChannelManager, which dials each
peer’s admin-internal port via the cluster module’s existing peer
registry. The default per-call timeout is configurable via
admin-server.internalCommunicationTimeout (default 5000 ms).
Install (POST /dsl-debugging/session)
- Cluster-scope clientId cleanup first. Receiving node sweeps any
locally-bound prior session for the same
clientIdand broadcastsStopByClientId(clientId)to every peer with the configured admin-internal-RPC deadline. Each peer looks upclientIdin its own session map; if found, terminates the bound session locally (callssession.boundHolder.removeRecorder, drops payload, frees active-session slot). The fan-out is best-effort — a missed peer is logged inpeers[]; its stale prior session for thatclientIdself-cleans-up via retention timeout. The install does not block on slow peers. - Receiving node generates the new
sessionIdand computesretentionDeadline = createdAt + retentionMillis. - It iterates the sorted peer list (same view
AdminClusterChannelManagerexposes to the runtime-rule main-selection logic) and sendsInstallDebugSession(sessionId, ruleKey, limits)to each peer with the configured admin-internal-RPC deadline. - A peer that does not ack within the deadline is recorded as
{@code FAILED} for that peer (with the failure reason in
detail) and the fanout continues — no peer can stall the operator. - On the receiving node and on every peer that acked
INSTALLED, the session is installed by mutating the rule’sGateHolder(addRecorder(...)); receiver threads on that node read the holder directly. - The receiving node returns
200 { sessionId, createdAt, retentionDeadline, granularity, localInstalled, installed: {created, total}, peers: [{peer, nodeId, ack: INSTALLED | ALREADY_INSTALLED | NOT_LOCAL | TOO_MANY_SESSIONS | REJECTED | FAILED, detail?}], priorCleanup: {local: {nodeId, stoppedCount, stoppedSessionIds}, peers: [...]} }. Theinstalled.created/installed.totalsummary lets the operator see at a glance “session live on N of M OAPs”; per-node detail (with each node’s id and ack) stays inpeers[].priorCleanupreports every prior session stopped during the cluster-scopeclientIdsweep.
If no peer acks (full partition), the session still runs on the receiving node; the response simply reports zero healthy peers. This is the documented “best effort” contract.
Collect (GET /dsl-debugging/session/{id})
- Receiving node sends
CollectDebugSamples(sessionId)to every peer with a short deadline (default 3 s). - Each peer returns its local payload slice (or an empty slice + reason if the session never installed there).
- The receiving node merges slices into the
nodes[]array under the shape from the data contract and returns200. Peers that timed out are recorded asnodes[].status = "timeout"rather than being omitted, so the UI shows partial coverage honestly.
Stop (POST /dsl-debugging/session/{id}/stop)
Best-effort broadcast of StopDebugSession(sessionId); missing acks are logged but
do not block the response. The TTL sweeper is the backstop for a peer that missed
the stop.
Why fan-out and not a single main
Unlike runtime-rule, debug sessions are not destructive — duplicate installs are
idempotent (same sessionId ⇒ no-op), there is no DDL, and a peer that never gets
the install simply contributes nothing to the result. There is no correctness reason
to serialize through a single main, and the fan-out gives the operator visibility
into traffic on every node.
Consistency bounds
| Event | Bound |
|---|---|
| Healthy install | every reachable peer receives within fanout deadline (default 2 s) |
| Peer partitioned during install | session simply does not capture there; reported in peers[] |
| Peer crashes mid-session | session expires via retention timeout on remaining nodes; collect reports node_unreachable |
| Operator polls collect during fanout | aggregator returns partial result; UI shows progress |
Two operators install concurrently (different clientIds) |
independent sessionIds, no shared state, independent caps |
| Same client’s POSTs split across nodes by load balancer | new POST broadcasts StopByClientId(clientId) to every peer before allocating; reachable peers terminate the prior session immediately; a missed peer’s stale session falls out via retention timeout (~10 min); no durable storage marker required |
7. UI surface (recap)
The Claude Design handoff already locked the three views:
- MAL Waterfall · entity-scope selector · per-stage SampleFamily tables
- LAL records-as-columns × blocks-as-rows grid · split
output·log/output·metricrows · per-block sampling toggle - OAL rule-name picker fed by
/runtime/oal/rules·(maxRows, windowSec)▶ Start sampling button · 5-stage waterfall (source · filter · build_metrics · aggregation · emit) with source-row tables
Plus a “Live debugger” entry in the existing left-nav.
8. CLI surface
The shipping CLI for this feature is swctl (or whatever the eventual
debugging-dsl subcommand looks like once the CLI surface is finalized). No
binary script is shipped in the OAP distribution for this feature.
For e2e tests we keep a thin temporary helper (test/e2e-v2/cases/.../dsl-debug.sh
or equivalent) so the test harness can drive the API without depending on the
yet-to-land CLI. That helper is internal to the test tree, not packaged.
Operators can also drive the API directly with curl against
/dsl-debugging/* — the routes are stable and JSON in/out (see
Session API and
the walkthrough in General usage docs below).
9. What is explicitly out of scope
- L2 aggregation, cross-node merge, storage round-trip, query path. The
debugger stops at the L1 emit (
store) stage. End-to-end debugging (L1 → L2 → persistence → query → dashboard) is a separate problem and not part of this SWIP. - DSL chaining across debug sessions. A LAL rule’s
output_metricflows into MAL in production, but a debug session does not transparently follow it. The operator picks one DSL per session. - Behavioural changes to the data path. Rule grammar and semantics unchanged. DSL compilers’ data-path behavior on a non-debug ingest is bytecode-identical to today modulo one additional capture call.
- Storage. Captured samples never persist; they live in the per-session ring buffer until TTL.
- Runtime-rule hot-update. Unchanged. The debug plugin is a sibling, not a layer.
10. Phased delivery
- Phase 0 —
admin-serverextraction + read-only OAL management.- Pure refactor portion: stand up the
admin-servermodule, migratereceiver-runtime-rule’s embedded HTTP server onto it, delete the old runtime-rule HTTP server class. Endpoints and behaviour unchanged. - New surface: ship the read-only OAL management endpoints
(
GET /runtime/oal/files,GET /runtime/oal/files/{name},GET /runtime/oal/rules,GET /runtime/oal/rules/{ruleName}) onadmin-server. This is the prerequisite for the OAL debugger’s rule picker (Phase 3). - This phase is mergeable on its own — it gives operators a runtime view of OAL rules even before any DSL debugging lands — and unblocks the later phases.
- Pure refactor portion: stand up the
- Phase 1 — MAL single-node. Add
GateHolder+RuleKey+DebugRecordermarker (server-core); addMALDebug+MALDebugRecorder(analyzer/meter-analyzer);GateHolder+RuleKeyinserver-core. Stand up thedsl-debuggingmodule with single-nodePOST/GET/STOPregistered onadmin-server. Chain capture hooks inMALMethodChainCodegen.emitChainStatementsplus file-level filter and post-chainmeter_build/meter_emithooks inAnalyzer.doAnalysis. UI MAL Waterfall lights up. e2e-only helper script undertest/e2e-v2/; shipped CLI lands later viaswctl. - Phase 2 — LAL single-node. Block-level hooks in
LALClassGenerator.generateExecuteMethod; terminal-output capture inRecordSinkListener.parse; metric-output capture inMetricExtractor.submitMetrics. Statement-level capture is opt-in viagranularity: "statement"and emitsLALDebug.captureLine(...)per meaningful AST statement under a compile-time flag. UI LAL grid lights up. - Phase 3 — OAL single-node. Per-clause hooks in the OAL FreeMarker
template
dispatcher/doMetrics.ftl—source/filter[i]/build/aggregation/emit. UI OAL waterfall lights up. Picker fed by the read-only OAL management API shipped in Phase 0. - Phase 4 — Cluster fan-out.
InstallDebugSession/CollectDebugSamples/StopDebugSession/StopByClientIdon the admin-internal gRPC bus, aggregation in the receiving node,nodes[]in the response shape. - Phase 5 — UI polish + docs. End-to-end e2e test (push → debug
session → payload assertion),
docs/en/setup/backend/dsl-debug-session-api.md, screenshot walkthrough.
Phase 0 is independent. Phases 1-3 then deliver per-DSL value independently; Phase 4 unlocks cluster usage; Phase 5 lands the ops story.
Imported Dependencies libs and their licenses
None. Implementation stays inside the existing module set (Javassist for codegen
already depended on, Apache 2.0; Armeria HTTP host reused from admin-server,
Apache 2.0; admin-internal gRPC bus reused from admin-server).
Compatibility
Operator-facing change — runtime-rule HTTP keys consolidate onto
admin-serverThis SWIP introduces one operator-facing config break, deliberately scoped to the recently-shipped runtime-rule preview (PR #13851). No released telemetry/query APIs are moved or removed.
What changes What operators do Runtime-rule’s HTTP-server keys ( receiver-runtime-rule.rest*/SW_RECEIVER_RUNTIME_RULE_REST_*) are removed; their values move toadmin-server:(SW_ADMIN_SERVER_*). The runtime-rule URL paths and the port number (default 17128) are unchanged.Operators using SW_RECEIVER_RUNTIME_RULE=defaultmust also setSW_ADMIN_SERVER=default. OAP fails fast at startup if runtime-rule is enabled without admin-server.Released APIs left untouched:
GET /status/config/ttlstays on the query port (12800) viastatus-query-plugin/TTLConfigQueryHandler.GET /debugging/config/dumpand the other/debugging/query/*step-through endpoints stay on the query port viastatus-query-plugin/DebuggingHTTPHandler./runtime/rule/*URL paths and the runtime-rule admin port (17128) are preserved; only the YAML config keys feeding them change.
- Storage structure: no change. Sessions live in memory only.
- Network protocols: additive only. New
/dsl-debugging/*HTTP endpoints on the new sharedadmin-serverArmeria server (default port17128). The existing/runtime/rule/*endpoints migrate to the same server with identical paths and behaviour — operator-visible URLs are unchanged. Thestatus-query-plugin’s existing query-port/debugging/*routes (config dump, MQE step-through, trace step-through) are not moved and not affected. New admin-internal RPC typesInstallDebugSession/CollectDebugSamples/StopDebugSession/StopByClientId— peers running an older version simply do not register the handler and replyunimplemented; the fanout records this asinstall_failedand continues. - Configuration — clean migration.
admin-serveris brand new in this SWIP (no prior release; nothing to be backward-compatible with). The runtime-rule plugin’s HTTP-server keys (receiver-runtime-rule.restHost,restPort,restContextPath,restIdleTimeOut,restAcceptQueueSize,httpMaxRequestHeaderSizeand theirSW_RECEIVER_RUNTIME_RULE_REST_*env vars) are removed in this release; their equivalents move to the newadmin-server:block (SW_ADMIN_SERVER_*). The runtime-rule plugin keeps only its non-HTTP config (reconciler interval, self-heal threshold). Operators upgrading from the runtime-rule preview release must addSW_ADMIN_SERVER=default(or the YAML equivalent) alongside their existingSW_RECEIVER_RUNTIME_RULE=default. Ifreceiver-runtime-rule.selector=defaultis set withoutadmin-serverenabled, OAP fails fast at startup with an explicit error pointing operators at the new block — no silent endpoint disappearance. docs/en/changes/changes.mddocuments the rename in the upgrade notes section so the change is not surprise-discovered.- DSL behavior: bytecode-identical to today on the non-debug path.
- Wire breakage: none.
General usage docs
Two short walkthroughs — first the Phase-0 OAL discovery, then a Phase-1
MAL debug session. Final version will move to
docs/en/setup/backend/dsl-debug-session-api.md.
# 0. The endpoint lives on the shared admin-server port (default 17128) — the
# same port runtime-rule already uses. admin-server must be enabled in
# application.yml; ships disabled by default. Same security caveats as
# runtime-rule (no auth — gateway-protect, never expose publicly).
(a) Browse loaded OAL rules (Phase 0, read-only):
# List loaded .oal files.
curl -s http://localhost:17128/runtime/oal/files | jq .
# → [
# {"name":"core","path":"core.oal","ruleCount":86,"status":"LOADED"},
# {"name":"browser","path":"browser.oal","ruleCount":14,"status":"LOADED"},
# ...
# ]
# Drill into one file's rules.
curl -s http://localhost:17128/runtime/oal/files/core | jq '.rules[0:3]'
# → [
# {"ruleName":"service_resp_time","line":2,"sourceScope":"Service",
# "expression":"service_resp_time = from(Service.latency).longAvg();",
# "function":"longAvg","filters":[]},
# {"ruleName":"endpoint_cpm","line":14,"sourceScope":"Endpoint", ...},
# ...
# ]
# Or pull the flat picker list the UI uses directly.
curl -s http://localhost:17128/runtime/oal/rules | jq 'map(.ruleName)' | head
(b) Run a DSL debug session (Phase 1, MAL single-node):
CLIENT_ID=$(uuidgen) # one per debug context — for the CLI that's typically per shell invocation
# 1. Start a session for the OTEL VM rule. catalog/name/ruleName/clientId go
# in the query string; the optional JSON body tunes per-session limits.
# Any prior session bound to $CLIENT_ID is auto-cleaned up before this
# one is allocated.
curl -s -X POST \
"http://localhost:17128/dsl-debugging/session?catalog=otel-rules&name=vm&ruleName=vm_cpu_total_percentage&clientId=$CLIENT_ID" \
-H 'Content-Type: application/json' \
-d '{"recordCap":1000,"retentionMillis":300000}'
# → {"sessionId":"76b3266a-...",
# "createdAt":1777967921000,"retentionDeadline":1777968221000,
# "installed":{"created":3,"total":5},
# "peers":[{"peer":"10.0.1.12:17129","ack":"INSTALLED",...}, ...],
# "priorCleanup":{"local":{"stoppedSessionIds":["..."]},"peers":[...]}}
# 2. Poll the result while traffic flows.
curl -s http://localhost:17128/dsl-debugging/session/76b3266a-... | jq .
# → MAL-shaped payload with steps[] and per-step samples[],
# plus nodes[] showing per-peer capture coverage.
# 3. Stop early (optional — the session also auto-expires at retention TTL,
# or implicitly when the same clientId starts another session).
curl -s -X POST http://localhost:17128/dsl-debugging/session/76b3266a-.../stop
For an OAL debug session (Phase 3), step 1 is preceded by step (a) above —
the operator picks a ruleName from /runtime/oal/rules and posts
{"clientId": "...", "catalog": "oal", "name": "core", "ruleName": "endpoint_cpm", ...}.
The UI Live debugger view drives the same endpoints and renders the result with
the Waterfall / grid / waterfall views from the Runtime Rule Admin design.
Appendix — Open questions for review
- Per-rule capture cost when idle. Benchmarked target: ≤ 1 % overhead per stage
on a no-op
if (DEBUG.gate)JIT-eliminated path. To be measured underoap-server-benchduring Phase 1. - Cap behavior. Hitting the record cap or the capture window moves the
session from CAPTURING to CAPTURED. Per-field truncation does not terminate
the session — it just shortens the offending value with a
… +N chars truncatedmarker so the rest of the record remains useful. Captured samples are immutable — no ring buffer, no drop-oldest, no overwrite. The payload stays stable across polls until the operator stops the session or the retention timeout fires. UI surfaces the terminal cause (reason: "record_cap" | "window_elapsed" | "manual_stop"). Acceptable? - Reaffirming scope. This proposal is parsing/transformation debugging only — bounded by the L1 emit at the tail of each DSL. No L2, no cross-node merge, no storage / query round-trip. Anything beyond L1 (e.g. “why did this metric appear differently after L2 aggregation?”) is a separate problem and stays outside this SWIP. Confirm.