Runtime Rule Hot-Update API
The runtime rule receiver plugin lets operators add, override, inactivate, and delete MAL and LAL rule files at runtime without restarting OAP. Changes are saved in the configured storage backend (JDBC, Elasticsearch, or BanyanDB) and propagate across every node in an OAP cluster within 30 s by default.
For the consistency contract, the three workflows (boot / on-demand / periodic scan), the lifecycle, and how cluster failures are handled, see Runtime Rule Hot-Update — Architecture. This page focuses on the REST API surface.
⚠️ Security notice
The admin port has no authentication in this release. The module is therefore disabled by default; enabling it opens an HTTP endpoint that can change metric and log-processing rules while OAP is running.
Operators enabling the module MUST:
- Gateway-protect the port with an IP allow-list and separate authentication rules.
- Never expose port 17128 to the public internet.
- Bind the HTTP server to
localhostor a private network interface if remote access is not required.
Enabling the module
Set the selector to default in application.yml or via env var:
SW_RECEIVER_RUNTIME_RULE=default
Default port is 17128. All config knobs are in application.yml under the
receiver-runtime-rule block — host, port, periodic-scan interval, self-heal threshold.
HTTP surface
/addOrUpdate takes the rule file as the raw request body and identifies the rule with
the catalog and name query parameters. /inactivate and /delete use the same
parameters with an empty body. There is no JSON request envelope, so shell scripts can
send a YAML file directly with --data-binary @file.yaml.
Content encoding
Rule content is UTF-8 YAML text. The API never base64-encodes content.
- Raw responses (
Content-Type: application/x-yaml; charset=utf-8) and raw/addOrUpdaterequest bodies carry content byte-identically. - JSON responses encode content as a standard JSON string — special characters are
JSON-escaped (
",\, control characters become\u00XX, newlines become\n). A standard JSON parser yields the original UTF-8 YAML; no additional decode step.
Example JSON response from GET /runtime/rule?catalog=otel-rules&name=vm with
Accept: application/json:
{
"catalog": "otel-rules",
"name": "vm",
"status": "ACTIVE",
"source": "runtime",
"contentHash": "5f9b8d3e...",
"updateTime": 1714102400000,
"content": "expSuffix: instance(['host_name','host_ip'], Layer.OS_LINUX)\nmetricPrefix: meter_vm\nmetricsRules:\n - name: cpu_total_percentage\n exp: avg_node_cpu_utilization\n"
}
The \n in the content string is the JSON escape for newline. After JSON.parse()
the value is the YAML body that /addOrUpdate would accept verbatim.
contentHash is the SHA-256 of the UTF-8 content bytes (lowercase hex). It is identical
across raw and JSON modes; the JSON envelope’s escaping does not affect the hash. The
same hash appears on GET /runtime/rule/list, where localState shows whether this node
has already applied that stored version.
/addOrUpdate decodes the request body as UTF-8 regardless of the Content-Type header.
Send valid UTF-8 YAML. If the decoded content cannot be parsed or compiled as a rule, the
server returns 400 compile_failed.
Canonical routes
Write endpoints
| Method | Path | Body | Effect |
|---|---|---|---|
| POST | /runtime/rule/addOrUpdate?catalog=&name=[&allowStorageChange=true][&force=true] |
raw rule YAML | Creates or replaces a rule. Edits that keep the same metric storage shape are applied without pausing the cluster. Edits that add, remove, or reshape metrics pause affected traffic, update and verify backend storage, save the rule, and then resume. If the posted content exactly matches the current ACTIVE rule, the server returns no_change; force=true skips that shortcut for recovery. |
| POST | /runtime/rule/inactivate?catalog=&name= |
empty | Soft-pauses a rule. OAP stops using the rule and saves it as INACTIVE, while the backend measure and historical data remain available for reactivation. |
| POST | /runtime/rule/delete?catalog=&name=[&mode=revertToBundled] |
empty | Removes an INACTIVE runtime row. Active rules return 409 requires_inactivate_first. No bundled twin, default mode → drops the row; the backend measure (if any) stays as an inert artefact. Bundled twin exists, default mode → returns 409 requires_revert_to_bundled because letting bundled silently take over the (catalog, name) is a meaningful state change requiring an explicit operator decision. Bundled twin exists, ?mode=revertToBundled → schema-change path: the orchestrator rehydrates the runtime DSL locally and runs the bundled YAML through the standard apply pipeline so the runtime→bundled delta drops runtime-only metrics, registers bundled-only metrics, and reuses bundled-shared metrics at matching shape; the row is then removed and the bundled rule is the active loader on the local node. Peers converge via the gone-keys reconcile path on their next tick. No bundled twin, ?mode=revertToBundled → returns 400 no_bundled_twin. |
Read endpoints
| Method | Path | Effect |
|---|---|---|
| GET | /runtime/rule?catalog=&name= |
Fetches one rule. Runtime rule first, bundled rule second, otherwise 404. Raw YAML by default; JSON envelope on Accept: application/json. Supports ETag and If-None-Match. |
| GET | /runtime/rule/bundled?catalog=&withContent=false |
Returns bundled rules for one catalog as JSON. withContent defaults to true; false omits each YAML body. Each item includes whether an operator override exists. |
| GET | /runtime/rule/list[?catalog=] |
Returns a single JSON envelope {generatedAt, loaderStats, rules} merged from stored rules and this node’s local state. Each row carries loaderKind, loaderName, bundled, and bundledContentHash so a UI can render override badges without a second roundtrip. Optional catalog= narrows the output; unknown values return 400 invalid_catalog. |
| GET | /runtime/rule/dump[/<catalog>] |
Downloads a tar.gz of stored runtime rules plus manifest.yaml. The server has no bulk import endpoint; the CLI restore command replays individual addOrUpdate and inactivate calls. |
/delete storage semantics — per backend
/delete tears down the runtime DSL on the OAP side (the /inactivate step that must
precede it has already unparked dispatchers, removed workers, and dropped the model from
the in-memory registry) and removes the stored rule from /runtime/rule/list. What
happens to the backend schema and data depends on the path taken:
| Path | Backend effect |
|---|---|
| Default mode, no bundled twin | The runtime DSL was already torn down by /inactivate. The backend measure (if any) is left in place as an inert artefact — no listener writes to it, but the schema and historical rows stay. This matches bundled-rule deletion semantics on disk: removing a YAML from otel-rules/ doesn’t drop its measure either. Reclaim manually via the storage backend’s tools if you need the schema gone. |
| Default mode, bundled twin | Refused with 409 requires_revert_to_bundled. The operator must opt in explicitly. |
?mode=revertToBundled, bundled twin |
Schema-change path. Bundled may have a different shape than runtime. The runtime DSL is rehydrated locally so the apply pipeline can compute the runtime→bundled delta. Metrics in the delta: • Runtime-only (in runtime, not in bundled) — dropped via the listener chain. On BanyanDB the measure / stream is dropped. On ES / JDBC dropTable is a documented no-op (their tables are append-only; TTL reclaims space).• Bundled-only (in bundled, not in runtime) — created. • Bundled-shared at matching shape — reused; no schema mutation. • Bundled-shared at differing shape — reshaped via the listener chain (additive subset each backend supports online: client.update for BanyanDB, add-column for JDBC, mapping append for ES). |
?mode=revertToBundled, no bundled twin |
Refused with 400 no_bundled_twin. |
Historical query semantics on ES and JDBC are unchanged from prior releases: tables stay
beyond dropTable and TTL reclaims rows.
A re-addOrUpdate of the same rule (same name, scope and downsampling) replays
schema registration. On BanyanDB this re-creates the measure; on ES / JDBC this is a
no-op against the existing index / table. In both cases new samples land alongside any
retained history.
Catalog shortcut routes
Implicit catalog in the path — useful when scripting against a single catalog:
/runtime/mal/otel/{addOrUpdate,inactivate,delete}→catalog=otel-rules/runtime/mal/log/{addOrUpdate,inactivate,delete}→catalog=log-mal-rules/runtime/lal/{addOrUpdate,inactivate,delete}→catalog=lal
telegraf-rules is supported by the canonical /runtime/rule/... routes; it does not
currently have a shortcut route.
Valid catalogs + names
| Catalog | What it holds |
|---|---|
otel-rules |
OTEL MAL rule YAML files |
log-mal-rules |
Log-derived MAL rule YAML files |
telegraf-rules |
Telegraf MAL rule YAML files |
lal |
LAL rule YAML files |
Rule name mirrors the static filesystem layout — a relative path under the catalog root
without extension. Segments match [A-Za-z0-9._-]+, joined by /. No leading slash,
no .., no empty segments, no backslash. Examples: nginx, aws-gateway/gateway-service,
k8s/node.
allowStorageChange parameter
/addOrUpdate (and the three catalog shortcut variants) accept an optional
allowStorageChange query parameter. Default is false when absent.
The server rejects any update that would move storage identity unless this flag is set:
- MAL: scope type change (
service(...)→instance(...)), explicit downsampling function change (.downsampling(SUM)→.downsampling(MAX)), switching between single / labeled / histogram variants. - LAL: changing
outputTypeon any rule, adding or removing a rule key within a file.
These are the edits that drop an existing BanyanDB measure’s data, change how new samples are stored, or leave old rows behind on JDBC / Elasticsearch. Body, filter, and tag tweaks that preserve each metric’s storage identity are always accepted and do not reset alarm windows.
Accepted truthy forms (case-insensitive): true, 1, yes. Anything else is treated as
false.
Recommendation — avoid storage-wipe edits in production. Passing
allowStorageChange=truedrops the existing measure’s data on BanyanDB and orphans the old rows on JDBC / Elasticsearch; any alarm rules, dashboards, and historical queries that reference the old shape will miss the pre-change window. Unless the data loss is understood and intended — typically only on a staging cluster or during a planned schema migration — leave the flag off. Prefer a rename (new metric name, new rule name) so the old data keeps accumulating until TTL and the new data starts fresh under a clean identity. Treat the flag as an explicit “I accept data loss” affirmation, not a convenience toggle.
Edge case —
/addOrUpdateafter a/deleteof a runtime-only rule. Default/deletewith no bundled twin leaves the backend measure in place as an inert artefact. A later/addOrUpdateagainst the same(catalog, name)has nopriorContentto diff against, so the storage-change guardrail will not refuse the request even when the new content reuses the same metric names with a different shape. The apply pipeline’s listener chain may reshape the inert backend measure silently. If you suspect a stale schema from a previously-deleted rule, push the new rule withallowStorageChange=trueso the intent is explicit; or rename the metrics in the new rule so the old schema stays inert and a fresh measure is created instead.
Recovery from a failed apply
When an /addOrUpdate fails during validation or apply, the node does not lose the
previous rule version. The pre-change rule keeps serving every metric that was not being
changed, and the response includes an applyStatus explaining the failure.
What to expect during a failure:
- The node keeps serving the prior rule for unchanged metrics. Samples continue flowing to the existing measures; dashboards and alarm rules against those metrics keep working.
- Metrics that were newly added by the failed attempt are rolled back (no orphan measures left on BanyanDB).
- Metrics in the storage-changing set — where the rule changed a metric’s function or scope
and was allowed through with
allowStorageChange=true— are lost. The old measure was removed before the new one attempted registration; a mid-flight failure leaves neither. This is the documented cost ofallowStorageChange. /runtime/rule/listreports the rule’slastApplyErrorso the failure is visible without tailing logs.- Severe backend or apply failures also write an
ERRORlog line naming the catalog, rule, and reason. - Peers either self-heal back to running on the old content (if the row was never committed) or retry the same broken content on their next periodic scan and fail the same way (if the row did advance). Either way they never serve samples against a moved schema while the main’s apply was in flight.
Recovery flow:
-
Inspect.
curl /runtime/rule/list | jq 'select(.lastApplyError != null)'. Confirm which rule is degraded and read the error message. -
Diagnose. Check the OAP log when the list output is not enough. Typical causes:
- Rule syntax or parse error — fix the YAML and re-push via
/addOrUpdate. - Storage schema moved without the guardrail — re-push with
?allowStorageChange=true&force=true(see below), or rename the metric so the old measure keeps accumulating until TTL and new data flows under a new identity. - Backend unavailable during a schema update — retry once the backend is healthy; the next periodic scan will also retry without operator action.
- Rule syntax or parse error — fix the YAML and re-push via
-
Apply the fix. Two options:
Option A: re-push via
/runtime/rule/addOrUpdatewith the recovery flags.curl -X POST --data-binary @vm-previous-known-good.yaml \ "http://OAP:17128/runtime/rule/addOrUpdate?catalog=otel-rules&name=vm&allowStorageChange=true&force=true"Two flags layer on top of the regular addOrUpdate:
allowStorageChange=true— accepts shape-breaking edits the guardrail would otherwise reject with 409.force=true— bypasses the same-contentno_changeHTTP shortcut so a re-post of known-good bytes is treated as a fresh apply request. The persisted row (if any) is re-written and any peers stuck mid-Suspend are re-Resumed; schema and dispatch handlers are not rebuilt — the in-memory engine state is content-keyed, so a true no-op against a healthy node remains a no-op even withforce=true. This is the unstick path for “the prior push raced a transient backend failure and a peer is still suspended”; if you need the engine to recompile (e.g., after manually editing a backend table), re-post the content with a single character change first, then the real content. Combine both when the recovery target re-shapes the measure. Same failure modes as a normal/addOrUpdate— bad rule content still fails400 compile_failedand the prior rule keeps serving.
Option B: Manual restore from a prior
/dumptarball. If you have a dump taken before the broken push, extract the specific file and re-post it with the recovery flags:# assuming runtime-rule-dump-2026-04-22.tar.gz was taken before the broken push. # Archive entries are under runtime-rule-dump/<catalog>/<name>.yaml. tar -xzf runtime-rule-dump-2026-04-22.tar.gz runtime-rule-dump/otel-rules/vm.yaml curl -X POST --data-binary @runtime-rule-dump/otel-rules/vm.yaml \ "http://OAP:17128/runtime/rule/addOrUpdate?catalog=otel-rules&name=vm&allowStorageChange=true&force=true" -
Verify. Re-run
listand confirmlastApplyErroris cleared andlocalStateisRUNNING. Watch the OAP log for the apply-OK confirmation. -
(Best practice) Take a fresh
/runtime/rule/dumpimmediately after a successful recovery so the new baseline is captured for any future incident.
What the recovery flags do NOT do:
- They do not roll the rule content back to a previous version automatically. Runtime-rule
storage is last-write-wins;
/addOrUpdate(with or withoutforce=true) is a write path, not a rollback path. The operator supplies the content to restore. - They do not bypass rule compile errors. If the content is syntactically invalid, the node
returns
400 compile_failedwhether or notforce=trueis set. The flags accept storage-level changes the guardrail would block and re-drive a stuck same-content shortcut; they do not accept broken rule content.
Response codes
Write endpoints return JSON: {applyStatus, catalog, name, message}. Read endpoints use
the response formats listed above; their error responses use the same JSON shape.
Success
| Status | applyStatus |
Meaning |
|---|---|---|
| 200 OK | no_change |
content byte-identical to current row; nothing to do |
| 200 OK | filter_only_applied |
body / filter edits applied via fast path; no backend storage change, no alarm reset |
| 200 OK | structural_applied |
storage-changing edit applied: cluster pause, backend update and check, persist, cluster resume all succeeded |
| 200 OK | inactivated |
row flipped to INACTIVE; backend measure and data preserved |
| 200 OK | static_tombstoned |
/inactivate against a rule that exists only on disk; an INACTIVE tombstone row is now persisted |
| 200 OK | already_inactive |
/inactivate against an already-inactive row; idempotent no-op |
| 200 OK | deleted |
/delete of a rule with no bundled twin; row removed, backend measure left as inert artefact |
| 200 OK | reverted_to_bundled |
/delete?mode=revertToBundled; runtime row removed, bundled rule installed via the apply pipeline (schema change handled by the standard delta path) |
| 200 OK | not_found |
/inactivate or /delete against an absent rule; idempotent no-op |
| 200 OK | filter_only_persisted |
row persisted but the in-memory swap threw on this node; converges on the next periodic scan |
Client error — caller has to act
| Status | applyStatus |
Meaning |
|---|---|---|
| 400 Bad Request | compile_failed, empty_body, invalid_* |
rule parse failure or request validation failure; row was NOT persisted |
| 400 Bad Request | invalid_catalog, invalid_mode |
unknown catalog= or mode= query value |
| 400 Bad Request | no_bundled_twin |
/delete?mode=revertToBundled against a rule with no bundled YAML on disk; drop the mode flag, or check that the bundled YAML exists |
| 409 Conflict | storage_change_requires_explicit_approval |
update would move storage identity and allowStorageChange was not set — no cluster pause, no persist, no side effects |
| 409 Conflict | update_in_progress |
another apply is already in flight for this rule; retry after a few seconds |
| 409 Conflict | requires_inactivate_first |
/delete against an ACTIVE row; run /inactivate first, then /delete |
| 409 Conflict | requires_revert_to_bundled |
/delete (default mode) against a rule with a bundled YAML twin on disk; either re-issue with ?mode=revertToBundled to fall back to bundled, or leave the row INACTIVE |
| 409 Conflict | delete_refused |
cross-file ownership conflict: bundled’s claims overlap another active bundle. Update or /inactivate the conflicting bundle(s) first |
| 503 Service Unavailable | storage_unavailable |
storage could not be read while checking the current rule; retry when storage is healthy |
Cluster-routing errors — usually transient
| Status | applyStatus |
Meaning |
|---|---|---|
| 409 Conflict | origin_conflict |
a peer rejected the cluster pause because it was already running its own apply (split-brain); the loser aborts and resumes |
| 409 Conflict | split_brain_detected |
this node detected a competing main during the cluster pause; aborted and broadcast a resume |
| 421 Misdirected Request | cluster_view_split |
the receiving node’s peer-list disagreed with the sender’s; refused to re-forward. Wait a few seconds for the peer-list to settle, then retry |
| 502 Bad Gateway | forward_failed |
could not reach the cluster main to forward the request; transport error message in message |
Server error — apply or persist failed
| Status | applyStatus |
Meaning |
|---|---|---|
| 500 Internal Server Error | ddl_verify_failed |
backend storage was changed but the post-apply check rejected the new shape; new metrics rolled back, prior rule preserved |
| 500 Internal Server Error | apply_failed |
server failed while applying the rule; partial changes rolled back, prior rule preserved |
| 500 Internal Server Error | persist_failed |
row write failed; on filter-only this node still serves the pre-edit rule, on structural the local node rolled back and resumed peers |
| 500 Internal Server Error | commit_deferred |
apply succeeded and row was persisted, but the local finishing step failed on this node. Storage is authoritative and peers will converge; this node will retry on its next periodic scan |
| 500 Internal Server Error | teardown_deferred |
row was inactivated, but local cleanup failed; this node retries on the next periodic scan |
| 500 Internal Server Error | revert_to_bundled_failed |
bundled apply failed during DDL or verify (typically a backend-storage issue — BanyanDB unreachable, shape rejection, or schema-barrier timeout). The orchestrator unwound the step-1 runtime install so local state matches the persisted INACTIVE row. Retry once storage recovers. |
| 500 Internal Server Error | revert_to_bundled_precondition_failed |
revertToBundled prep step failed (no engine for catalog, MeterSystem unavailable for installRuntime). Local state is unchanged. Retry when the prerequisite recovers. |
| 500 Internal Server Error | dao_unavailable, inactivate_failed, delete_failed, other *_failed |
management storage or local cleanup failed; check the message for the specific failure point. |
Per-node list output
GET /runtime/rule/list returns a single JSON envelope:
{
"generatedAt": 1730000000000,
"loaderStats": { "active": 27, "pending": 0 },
"rules": [
{
"catalog": "otel-rules",
"name": "vm",
"status": "ACTIVE",
"localState": "RUNNING",
"suspendOrigin": "NONE",
"loaderGc": "LIVE",
"loaderKind": "RUNTIME",
"loaderName": "runtime-rule:otel-rules/vm@0428-153042",
"contentHash": "7c3a…",
"bundled": true,
"bundledContentHash": "c3d4…",
"updateTime": 1730000000000,
"lastApplyError": ""
}
]
}
Bundled-only rows (no operator override) and recently deleted rows omit fields that
do not exist in storage, such as updateTime and lastApplyError. UI clients call
fetch().json() once; operators can jq '.rules[]' for line-oriented inspection.
Rule status by source (bundled vs. runtime)
The combination of status, loaderKind, and bundled tells you which copy of a rule
the OAP is actually serving on this node. Reading these three fields together:
| Operator action history | status |
loaderKind |
bundled |
What is serving |
|---|---|---|---|---|
| Bundled rule shipped on disk; operator never touched it | BUNDLED |
NONE |
true |
Bundled YAML, served from the OAP’s shared default classloader (registered at boot by the catalog loaders). |
Operator pushed /addOrUpdate overriding a bundled rule |
ACTIVE |
RUNTIME |
true |
Runtime override in a per-file runtime-rule: loader. Compare contentHash with bundledContentHash to detect drift. |
Operator pushed /addOrUpdate for a brand-new rule (no bundled twin) |
ACTIVE |
RUNTIME |
false |
Runtime override in a per-file runtime-rule: loader. No bundled fallback. |
Operator /inactivated a runtime override of a bundled rule |
INACTIVE |
NONE |
true |
Nothing — handlers are unregistered. The bundled rule does not auto-resurrect; to turn it back on, push /addOrUpdate (with the bundled YAML or your own) or call /delete?mode=revertToBundled (which reverts to bundled via the schema-change path). Plain /delete is refused with 409 requires_revert_to_bundled to force the explicit decision. |
Operator /inactivated a bundled-only rule |
INACTIVE |
NONE |
true |
Nothing — same as above. The INACTIVE row is a tombstone carrying the bundled YAML at inactivate-time. |
Operator /inactivated a brand-new runtime rule |
INACTIVE |
NONE |
false |
Nothing — handlers gone. To turn back on: /addOrUpdate (with new content) or /delete (rule is fully gone). |
/delete?mode=revertToBundled propagating after a bundled-twin row was removed |
n/a (no row) |
BUNDLED |
true |
Bundled rule, freshly compiled into a bundled: loader. Equivalent to a fresh boot of bundled. |
Quick decision rules for an operator reading /list:
-
status=BUNDLED→ comes from disk only. -
status=ACTIVE+bundled=true+contentHash != bundledContentHash→ runtime override is modified relative to bundled. UIs typically render this as “Override (modified)”. -
status=ACTIVE+bundled=true+contentHash == bundledContentHash→ runtime override matches bundled. UIs typically render this as “Override (matches bundled)” — common after an explicit/addOrUpdate ?source=bundledrevert. -
status=ACTIVE+bundled=false→ runtime-only rule, no on-disk twin. -
status=INACTIVE→ soft-paused. The DAO row preserves the content the operator last had;/listdoes not surface it (callGET /runtime/rulefor the YAML). -
loaderKind=BUNDLED→ abundled:loader is currently serving (typical after/delete?mode=revertToBundled, where the bundled YAML was compiled into a fresh per-file loader). -
loaderKind=NONE→ no per-file loader. ForBUNDLEDthis is normal (shared default loader). ForINACTIVEthis is the rule being off. -
status—ACTIVEorINACTIVEfor stored rows.BUNDLEDandn/aare synthesized list values:BUNDLED— shipped on disk, no operator override. Healthy steady state; no runtime row exists.n/a— transient: a runtime row was just removed and this node hasn’t swept it yet. Cleared on the next periodic scan.
-
localState— per-node transient:RUNNING|SUSPENDED|NOT_LOADED. Distinct fromstatus; a node mid-structural-apply isACTIVE+SUSPENDED. After/inactivate,localStateisNOT_LOADEDregardless of whether a bundled twin exists on disk —/inactivateis a soft-pause that respects the operator’s “off” intent. Bundled fall-over only fires on/delete(default mode for a bundled-twin row) or the gone-keys reconcile path. -
suspendOrigin— whenlocalState=SUSPENDED, who paused this node:SELF— this node is running its own apply.PEER— the cluster main paused this node for its apply.BOTH— should not appear under correct routing; presence signals a transient split-brain that clears via the normal handshake or the 60 s self-heal.
-
loaderGc— diagnostic indicator showing whether the per-rule isolation has been retired for this rule and (if so) whether the JVM has reclaimed it. Operators normally don’t need to act on this; a value other thanLIVEfor anACTIVErow would suggest a rule cleanup issue worth investigating. -
loaderKind— origin of the per-file class loader currently serving this rule:RUNTIME— operator-pushed runtime override.BUNDLED— bundled rule serving via bundled fall-over (a runtime override was previously in place, then/delete?mode=revertToBundledreinstalled the bundled YAML in a freshbundled:loader).NONE— no per-file loader (typical for bundled-only rules served from the shared default loader; also a row whose loader has been retired but not yet replaced).
-
loaderName— formatted loader name (<kind>:<catalog>/<rule>@<MMdd-HHmmss>), the same string the JVM surfaces in stack traces and the loader graveyard’s INFO/WARN log lines. Empty whenloaderKindisNONE. -
contentHash— SHA-256 of the stored content for runtime rows, or the local content for bundled-only and recently deleted rows. Matching hashes pluslocalState=RUNNINGmean two nodes are serving the same content for that rule. -
bundled—truewhen a bundled YAML exists on disk for(catalog, name). Set on every row regardless of status, so a UI can render an “Override” / “Modified from bundled” badge by comparingcontentHashwithbundledContentHash. -
bundledContentHash— SHA-256 of the bundled YAML, present only whenbundled=true. A diff betweencontentHashandbundledContentHashindicates a runtime override that has drifted from the bundled rule. -
lastApplyError— most recent local apply error. Empty when the last apply succeeded, no attempt has been made, or the rule was inactivated (the inactive path clears stale errors so/listdoesn’t surface an error against a rule that is already down). -
pendingUnregister— only set forstatus=n/aentries; the row was just deleted and teardown is scheduled for the next periodic scan.
The loaderStats envelope counter exposes process-wide DSL classloader bookkeeping —
active is the number of rules currently served by per-file loaders, pending is the
number of retired loaders the JVM has not yet collected. A steadily elevated pending
across many polls is the leak signal the OAP also surfaces as a WARN log line.
Reading a single rule’s content — GET /runtime/rule
GET /runtime/rule?catalog=…&name=… returns the YAML body for one rule. By default
(source=runtime or omitted) the runtime row wins — bundled YAML is returned only when no
runtime row exists. Pass ?source=bundled to read the bundled YAML even when a runtime
override is in place; the response 404s with not_found when the rule has no bundled twin.
This makes the “compare runtime override against bundled” workflow a two-call sequence:
fetch the runtime body with the default request, then fetch the bundled body with
?source=bundled and diff in the editor. POST /runtime/rule/delete drops the runtime
override; the next /list will show the row served by the bundled fall-over
(loaderKind=BUNDLED).
Consistency model — at a glance
The full contract is in the architecture doc. The headline:
- Persist is commit. Once
/addOrUpdatereturns 200, the cluster will converge on that content. - Last write wins. Concurrent writes to different nodes serialize on the cluster main;
the second write wins. The losing operator gets
409 split_brain_detectedif the cluster detected the race; otherwise both operators see 200 and the second commit’s content is what every node ends up running. - Bounded convergence. Healthy structural commits land cluster-wide within 30 s (one periodic scan). Aborted commits self-heal within 60 s. Filter-only edits land locally in milliseconds and on every other node within 30 s.
- No quorum, no leader election, no two-phase commit. The runtime-rule entry in storage is the single source of truth.
- Samples for an affected metric are dropped during a structural cutover. This is by design — the schema is moving and in-flight samples have no valid landing.