Cluster Status Check Sequence

The Cluster Status page (/operate/cluster, sidebar Operate → Cluster) is the operator’s single pane for “is the OAP backend healthy and configured correctly?” It runs three independent checks in parallel — the Query and Admin OAP ports plus a Zipkin/OTLP trace-source probe — they do not block each other, and the page surfaces each pane’s result independently.

The panes are independent: a healthy :12800 with broken :17128 is a real and recoverable state (forgot to expose the admin port behind a Kubernetes Service), and Horizon makes that diagnosis obvious. The Zipkin/OTLP pane is informational for the trace menu — a red dot there is not a cluster-wide outage.

Pane A — Query / GraphQL port (`:12800`)

Single GraphQL call fired every 30 seconds:

query {
  version
  getTimeInfo { timezone, currentTimestamp }
  checkHealth { score, details }
}

What the pane shows

Field	Source	Notes
Reachable	HTTP success of the GraphQL call	Hard fail → whole pane shows red banner.
Version	`version`	The OAP build string.
Server timezone	`getTimeInfo.timezone`	UTC offset like `+0800`. Used for time-range conversion throughout the UI.
Server timestamp	`getTimeInfo.currentTimestamp`	Epoch ms. UI shows skew vs browser clock if non-trivial.
Health score	`checkHealth.score`	`0` = OK, `>0` = degraded, `<0` = not started.

Failure modes

Hard fail (unreachable): GraphQL endpoint refused / timed out / 5xx. reachable: false. Whole UI shows a top-of-page “OAP unreachable” banner — query pages cannot render.
Soft fail (degraded): score > 0 — OAP is up but degraded (storage lag, receiver backlog, internal queue depth). Shown as a yellow “degraded (score N)” chip; details from checkHealth.details.
Soft fail (not started): score < 0 — OAP process is running but has not finished initialization yet. Shown as “not started”; usually transient during a rolling restart.

Poll cadence

Stale-time: 20 s
Refetch interval: 30 s

Pane B — Admin host (`:17128`)

Single admin REST call fired every 60 seconds:

GET <adminUrl>/debugging/config/dump

OAP returns a flat key/value map. Each required module is reported as enabled when any key with that module’s prefix appears in the dump.

What you see (per refresh)

If the admin host is unreachable, every module shows off and the whole pane goes red. If it is reachable, each module shows enabled or disabled independently based on whether its keys appear in the config dump:

admin-server — when off, the admin host responded but does not expose the admin selector. The dump endpoint is itself served by admin-server, so in practice this means a custom OAP build. When admin-server is off the dump is empty, so the other three modules all show off as well — one root cause, not three stacked warnings.
receiver-runtime-rule — when off, DSL Management, alarm rules, and the cluster rule matrix are disabled. Yellow badge.
dsl-debugging — when off, the Live Debugger is disabled. Yellow badge.
inspect — when off, the Inspect page is disabled. Yellow badge.

What the pane shows

Module	Hint shown when off
`admin-server`	“Confirm `SW_ADMIN_SERVER=default` is set on OAP and port 17128 is exposed.”
`receiver-runtime-rule`	“Set `SW_RECEIVER_RUNTIME_RULE=default` on OAP to enable DSL Management.”
`dsl-debugging`	“Set `SW_DSL_DEBUGGING=default` on OAP to enable the Live Debugger.”
`inspect`	“Set `SW_INSPECT=default` on OAP to enable the Inspect page.”

Poll cadence

Stale-time: 30 s
Refetch interval: 60 s

Pane C — Zipkin / OTLP traces

A third pane probes OAP’s Zipkin v2 REST endpoint and reports reachability. It feeds only the Zipkin/OTLP trace menu — a red dot here is not a cluster-wide outage; the rest of the UI keeps working and only Zipkin/OTLP trace views are affected.

Reading the page during an incident

The triage flow during “Horizon shows banners I don’t understand”:

Is the Query pane green? If not, OAP itself is down / unreachable — fix OAP first, the rest is downstream.
Is the Admin pane green? If not, expose port 17128 and / or turn on the four selectors — see the per-module hints.
Is the health score > 0? OAP is up but degraded — pull details from checkHealth (visible in the Query pane) and triage on the OAP side.

Horizon UI

Cluster Status Check Sequence

Pane A — Query / GraphQL port (:12800)

What the pane shows

Failure modes

Poll cadence

Pane B — Admin host (:17128)

What you see (per refresh)

What the pane shows

Poll cadence

Pane C — Zipkin / OTLP traces

Reading the page during an incident

Pane A — Query / GraphQL port (`:12800`)

Pane B — Admin host (`:17128`)