SWIP-14 Inspect API for Admin Server
Motivation
When operators author OAL / MAL / LAL rules and turn them on (statically or via runtime-rule hot-update from SWIP-13), the natural follow-up question is:
- “Which metrics did this rule actually produce?”
- “For metric
service_xyz, which entities are emitting values right now?” - “Once I have an entity, can I query it via MQE without guessing how to
reconstruct the
Entityshape?”
Today the only way to answer these is to dig through dashboards, write exploratory MQE expressions against guessed entity names, or read storage directly. There is no first-class “browse” interface.
The public GraphQL surface deliberately does not expose “enumerate entities with values for metric X in range R”, because that scan is unbounded and would compete with user-facing query traffic for storage resources. Operators still need it — for debugging rules, validating new monitoring features, and onboarding new agents.
This SWIP proposes an Inspect API on the existing admin-server (SWIP-13’s admin-host bundle, default port 17128) that exposes:
- The metric catalog with type + scope + supported downsamplings.
- For a given metric + time range, the set of entities (capped) that hold values, decoded into MQE-ready form.
The output of (2) is shaped so the operator can paste it directly into a
follow-up MQE call against the existing MetricsExpressionQuery GraphQL
mutation. This SWIP does not introduce a new MQE entry point — the public
GraphQL path remains the single MQE surface.
Scope
In scope:
COMMON_VALUEandLABELED_VALUEmetric types — the twodataTypes thatMQEVisitor.visitMetricactually dispatches today.Service,ServiceInstance,Endpoint,ServiceRelation,ServiceInstanceRelation,EndpointRelationscopes.- All three storage backends: BanyanDB, Elasticsearch, JDBC.
/inspect/metricslists every registered metric regardless of type (includingHEATMAPandSAMPLED_RECORD) so operators can see the full catalog. Only/inspect/entitiesis restricted to MQE-dispatchable types.- Companion: relocate + dual-bind the Status API. The existing
status-query-plugin(cluster nodes / alarm runtime status / TTL config / debug query trace) is replaced by a newstatusfeature module underserver-admin/. The module registers its handlers on both the public RESTHTTPHandlerRegister(preserving today’s binding — skywalking-ui consumes these endpoints oncore.restPort) and, if admin-server is enabled, the admin-serverHTTPHandlerRegister. The default state remains “status on the public REST port” so the UI keeps working unchanged. No new endpoint is added — clients that need to discover the OAP REST address parse the existingGET /debugging/config/dump, which already publishes the effective configuration including REST host and port.
Out of scope (deferred):
HISTOGRAM/HEATMAPmetrics from/inspect/entities. MQE has no dispatch path for histogram dataType —MQEVisitor.visitMetriconly handlesCOMMON_VALUE,LABELED_VALUE, andSAMPLED_RECORD; histograms flow through the legacyMetricsQuery.readHeatMapGraphQL resolver, which is being phased out. ThemqeEntityblock this API produces would not be usable for them. They remain visible in/inspect/metrics(withtype: HEATMAP) so operators can see they exist.SAMPLED_RECORDmetrics from/inspect/entities. These live in Stream storage (logs, trace segments, browser errors) and are addressed viaIRecordsQueryDAO/ trace-relative queries, not the metrics DAO. A future Inspect-Records endpoint can mirror this design once the scoping question is settled. They remain visible in/inspect/metrics.ProcessandProcessRelationscopes.ProcessIDis a SHA-256 hash with no decoder; readable rendering would require aProcessTrafficjoin, which we punt on for v1.- Firing MQE from admin-server. Operators take the inspect response and call
the existing public GraphQL
execExpressionmutation. Keeping MQE on one surface preserves auth/quotas/observability there.
Architecture
┌── operator / swctl / curl ──┐ ┌── skywalking-ui ──┐
│ /inspect/* (admin-only) │ │ /status/* (UI) │
│ /status/* (admin mirror) │ │ │
└─────────────┬───────────────┘ └─────────┬─────────┘
│ admin port 17128 │ public REST port 12800
▼ to kee ▼
┌─────────────────────────────────┐ ┌──────────────────────────────┐
│ admin-server (Armeria) │ │ public REST (sharing/core) │
│ ┌───────────────────────────┐ │ │ GraphQL / MQE / receivers │
│ │ inspect feature module │ │ │ │
│ │ ├── /inspect/metrics │ │ │ ┌────────────────────────┐ │
│ │ └── /inspect/entities │ │ │ │ status feature module │ │
│ └───────────────────────────┘ │ │ │ ├── /status/cluster/.. │ │
│ │ │ │ ├── /status/alarm/.. │ │
│ ┌───────────────────────────┐ │ │ │ ├── /status/config/.. │ │
│ │ status feature module │◄─┼───┼───┤ └── /debugging/... │ │
│ │ (same handler instance, │ │ │ │ │ │
│ │ dual-bound when admin │ │ │ └────────────────────────┘ │
│ │ is on; this side is │ │ │ │
│ │ silent if admin off) │ │ └──────────────────────────────┘
│ └───────────────────────────┘ │
└─────────────────────────────────┘
│ │
▼ ▼
┌──────────────────────────┐ ┌──────────────────────────────┐
│ ValueColumnMetadata │ │ IMetricsQueryDAO │
│ StorageModels │ │ .listEntityIdsInRange(...) │ ◄── new method
│ MetadataQueryService │ │ (BanyanDB / ES / JDBC impl) │
│ DefaultScopeDefine │ └──────────────────────────────┘
└──────────────────────────┘
│
▼
decode entity_id via IDManager.*ID.analysisId(),
enrich with layer(s) from MetadataQueryService cache,
shape into MQE-ready Entity
inspect-client flow (admin operator):
1. (one-time) GET admin:17128/debugging/config/dump → parse REST host/port
OR rely on a known deployment-time URL.
2. GET admin:17128/inspect/metrics → pick a metric
3. GET admin:17128/inspect/entities?metric=… → get entityId + mqeEntity
4. POST <REST URL>/graphql execExpression( → fire the actual query on
entity = mqeEntity, expression = …) the public GraphQL surface
ui flow (skywalking-ui, unchanged):
1. GET oap:12800/status/cluster, /status/alarm, /status/ttl, …
(admin-server may be off — these still work via the public REST binding)
The inspect feature module is a sibling of dsl-debugging and
runtime-rule under the existing admin-host. It registers its handlers in
start() against the shared HTTPHandlerRegister. No new port, no new auth
surface.
API
GET /inspect/metrics
Lists all registered metrics with type, scope catalog, and the downsampling levels at which each metric is materialized.
Query parameters:
| Name | Required | Description |
|---|---|---|
regex |
no | Java regex over metric name. Default .*. |
type |
no | Filter by REGULAR_VALUE / LABELED_VALUE / HEATMAP / SAMPLED_RECORD. Repeatable. |
mqeQueryable |
no | If true, return only metrics that /inspect/entities accepts (REGULAR_VALUE, LABELED_VALUE). Default false. |
catalog |
no | Filter by catalog name (SERVICE, SERVICE_INSTANCE, ENDPOINT, SERVICE_RELATION, SERVICE_INSTANCE_RELATION, ENDPOINT_RELATION). Repeatable. |
Response (200 OK, application/json):
{
"metrics": [
{
"name": "service_cpm",
"type": "REGULAR_VALUE",
"catalog": "SERVICE",
"scopeId": 1,
"scope": "Service",
"valueColumnName": "value",
"downsamplings": ["MINUTE", "HOUR", "DAY"]
},
{
"name": "service_percentile",
"type": "LABELED_VALUE",
"catalog": "SERVICE",
"scopeId": 1,
"scope": "Service",
"valueColumnName": "value",
"downsamplings": ["MINUTE", "HOUR", "DAY"]
},
{
"name": "endpoint_response_time",
"type": "HEATMAP",
"catalog": "ENDPOINT",
"scopeId": 3,
"scope": "Endpoint",
"valueColumnName": "dataset",
"downsamplings": ["MINUTE", "HOUR", "DAY"]
}
]
}
Implementation:
- Source of metric list is
ValueColumnMetadata.INSTANCE.getAllMetadata(), the same map already exposed viaMetricsMetadataQueryService.listMetrics. - Entries with
Column.ValueDataType.NOT_VALUEare filtered out — these are persisted-but-not-queryable internal columns (topology lines, service entries) perColumn.java’s contract. They never appear in/inspect/metrics. - Entries whose scope resolves to
Scope.Allare filtered out as well (Scope.Allis deprecated and not routable through the inspect path). typeis mapped fromColumn.ValueDataTypeexactly asMetricsMetadataQueryService.typeOfMetricsdoes today.scopeisScope.Finder.valueOf(scopeId).name().downsamplingsis derived once at startup by scanningStorageModels.allModels(), grouping by metric name, collectingmodel.getDownsampling()values into aSet. Pure metadata, no storage I/O.- This endpoint is safe to poll; results change only when models are registered or removed (boot, runtime-rule add/delete).
GET /inspect/entities
Lists entities that hold values for a metric within a time range, decoded into MQE-ready form.
Query parameters:
| Name | Required | Description |
|---|---|---|
metric |
yes | Metric name (must resolve in ValueColumnMetadata). |
start |
yes | Time-range start. Same format as MQE Duration.start: yyyy-MM-dd (DAY), yyyy-MM-dd HH (HOUR), yyyy-MM-dd HHmm (MINUTE), yyyy-MM-dd HHmmss (SECOND). |
end |
yes | Time-range end. Format mirrors start. |
step |
yes | One of MINUTE, HOUR, DAY. Must be one of the metric’s downsamplings. |
limit |
no | Server-side cap. Default 300, hard-capped at 300. |
coldStage is intentionally not exposed. The inspect API targets the
operationally-active hot path; cold-stage scans bypass the same TTL window
the operator would be debugging within.
The limit is applied as LIMIT N at the storage layer — it bounds the
total rows scanned (N = 300 ≅ 10 buckets × 30 entities), not 300 entities.
Backends dedup on entity_id before returning.
step is parsed via DownSampling.valueOf(step) after upper-casing and
must appear in the metric’s supportedDownsamplings (else 400). The
(start, end) pair is converted to time buckets through the same
Duration.getStartTimeBucket() / getEndTimeBucket() path MQE uses, so
range semantics match exactly.
Response (200 OK, application/json) — example: service_cpm over the
last 10 minutes, where service payment is registered under both MESH and
GENERAL layers:
{
"metric": "service_cpm",
"scope": "Service",
"step": "MINUTE",
"start": "2026-05-10 1230",
"end": "2026-05-10 1240",
"rows": [
{
"entityId": "cGF5bWVudA==.1",
"decoded": { "serviceName": "payment", "isReal": true },
"layer": "MESH",
"mqeEntity": {
"scope": "Service",
"serviceName": "payment",
"normal": true
}
},
{
"entityId": "cGF5bWVudA==.1",
"decoded": { "serviceName": "payment", "isReal": true },
"layer": "GENERAL",
"mqeEntity": {
"scope": "Service",
"serviceName": "payment",
"normal": true
}
},
{
"entityId": "Y2hlY2tvdXQ=.1",
"decoded": { "serviceName": "checkout", "isReal": true },
"layer": "GENERAL",
"mqeEntity": {
"scope": "Service",
"serviceName": "checkout",
"normal": true
}
},
{
"entityId": "bXlzcWw=.0",
"decoded": { "serviceName": "mysql", "isReal": false },
"layer": "VIRTUAL_DATABASE",
"mqeEntity": {
"scope": "Service",
"serviceName": "mysql",
"normal": false
}
}
]
}
An empty result is not an error — { "metric": …, "rows": [] } simply
means no entity wrote a value in the requested window.
Multi-layer rule: one logical entity emits one row per layer it is
registered under. payment appears twice above because
ServiceTraffic has two rows (payment/MESH and payment/GENERAL),
and the inspect API does not collapse them — the operator may want to
investigate one layer at a time, and Layer is meaningful for downstream
calls such as listInstances(serviceId). The mqeEntity block, however, is
identical across layer rows because MQE itself does not take a layer
parameter — including it lets the operator copy-paste without thinking.
ServiceInstance example (service_instance_cpm):
{
"rows": [
{
"entityId": "cGF5bWVudA==.1_cG9kLTAx",
"decoded": {
"serviceName": "payment",
"isReal": true,
"serviceInstanceName": "pod-01"
},
"layer": "GENERAL",
"mqeEntity": {
"scope": "ServiceInstance",
"serviceName": "payment",
"serviceInstanceName": "pod-01",
"normal": true
}
}
]
}
ServiceRelation example (service_relation_client_cpm):
{
"rows": [
{
"entityId": "Y2hlY2tvdXQ=.1-cGF5bWVudA==.1",
"decoded": {
"source": { "serviceName": "checkout", "isReal": true },
"destination": { "serviceName": "payment", "isReal": true }
},
"layer": "GENERAL",
"mqeEntity": {
"scope": "ServiceRelation",
"serviceName": "checkout",
"normal": true,
"destServiceName": "payment",
"destNormal": true
}
}
]
}
ServiceInstanceRelation example
(service_instance_relation_client_cpm, e.g., a consumer instance
calling a provider instance in the java-agent provider/consumer
showcase):
{
"rows": [
{
"entityId": "Y29uc3VtZXI=.1_cG9kLWE=-cHJvdmlkZXI=.1_cG9kLWI=",
"decoded": {
"source": {
"serviceName": "consumer",
"isReal": true,
"serviceInstanceName": "pod-a"
},
"destination": {
"serviceName": "provider",
"isReal": true,
"serviceInstanceName": "pod-b"
}
},
"layer": "GENERAL",
"mqeEntity": {
"scope": "ServiceInstanceRelation",
"serviceName": "consumer",
"normal": true,
"serviceInstanceName": "pod-a",
"destServiceName": "provider",
"destNormal": true,
"destServiceInstanceName": "pod-b"
}
}
]
}
EndpointRelation example (endpoint_relation_cpm, where consumer’s
/order calls provider’s /charge):
{
"rows": [
{
"entityId": "Y29uc3VtZXI=.1-L29yZGVy-cHJvdmlkZXI=.1-L2NoYXJnZQ==",
"decoded": {
"source": {
"serviceName": "consumer",
"isReal": true,
"endpointName": "/order"
},
"destination": {
"serviceName": "provider",
"isReal": true,
"endpointName": "/charge"
}
},
"layer": "GENERAL",
"mqeEntity": {
"scope": "EndpointRelation",
"serviceName": "consumer",
"normal": true,
"endpointName": "/order",
"destServiceName": "provider",
"destNormal": true,
"destEndpointName": "/charge"
}
}
]
}
Error responses:
| Status | Body | Cause |
|---|---|---|
| 400 | {"error":"unknown metric: foo"} |
metric not in ValueColumnMetadata. |
| 400 | {"error":"step DAY not supported by metric foo (MINUTE,HOUR)"} |
metric not materialized at requested downsampling. |
| 400 | {"error":"metric type HEATMAP is not MQE-queryable; /inspect/entities only accepts REGULAR_VALUE and LABELED_VALUE"} |
metric type is HEATMAP (HISTOGRAM dataType). |
| 400 | {"error":"metric type SAMPLED_RECORD is out of scope for /inspect/entities"} |
metric type is SAMPLED_RECORD. |
| 400 | {"error":"process scope is out of scope"} |
scope is Process / ProcessRelation. |
| 400 | {"error":"limit must be between 1 and 300"} |
limit out of range. |
Status API relocation
The status-query-plugin module currently registers its HTTP handlers
(AlarmStatusQueryHandler, ClusterStatusQueryHandler,
TTLConfigQueryHandler, DebuggingHTTPHandler for query trace) against
the public HTTPHandlerRegister exposed by the core / sharing-server
module — the same HTTP surface as GraphQL, on the public REST port
(core.restPort, default 12800, or the sharing-server port if
configured). skywalking-ui calls these endpoints on the public REST
port today, so the binding cannot move; it can only be supplemented.
This SWIP replaces status-query-plugin with a new feature module
status under server-admin/ that dual-binds:
- Always: registers every handler on the public REST
HTTPHandlerRegisterresolved throughCoreModule— the same surfacestatus-query-pluginuses today, with the same URIs and payloads. UI compatibility is preserved. - Conditionally: if
AdminServerModuleis loaded (SW_ADMIN_SERVER=default), also registers every handler on the admin-serverHTTPHandlerRegister. Operators that gateway-protect the admin port can drive status from the same private surface as/inspect/*,/dsl-debugging/*,/runtime/rule/*.
The module is gated by SW_STATUS=default and is enabled by default
(matching today’s status-query-plugin posture so existing UI deployments
keep working without an opt-in step). It does not require
admin-server: when admin-server is off, status still binds on the public
REST port — that is the entire point of the dual-bind design.
Routes hosted by the status feature module — every existing route
preserved verbatim from status-query-plugin. No new endpoint is added
by this SWIP; clients that need to discover the OAP REST URL parse the
existing GET /debugging/config/dump response, which already publishes
the effective configuration including REST host and port.
URI (all GET) |
Purpose |
|---|---|
/status/cluster/nodes |
OAP cluster peer list. |
/status/alarm/rules |
Registered alarm rules. |
/status/alarm/{ruleId} |
Per-rule runtime status. |
/status/alarm/{ruleId}/{entityName} |
Per-rule + per-entity runtime status. |
/status/config/ttl |
Effective TTL configuration. |
/debugging/config/dump |
Effective configuration dump (also serves as the REST-URL discovery primitive for inspect clients). |
/debugging/query/mqe |
Run an MQE expression with debug tracing. |
/debugging/query/trace/queryBasicTraces |
Trace query brief, debug-traced. |
/debugging/query/trace/queryTrace |
Trace query detail, debug-traced. |
/debugging/query/zipkin/api/v2/traces |
Zipkin compat brief, debug-traced. |
/debugging/query/zipkin/api/v2/trace |
Zipkin compat detail, debug-traced. |
/debugging/query/topology/getGlobalTopology |
|
/debugging/query/topology/getServicesTopology |
|
/debugging/query/topology/getServiceInstanceTopology |
|
/debugging/query/topology/getEndpointDependencies |
|
/debugging/query/topology/getProcessTopology |
|
/debugging/query/log/queryLogs |
Every route is published on both binding surfaces (public REST + admin when admin-server is enabled). URI paths, query parameters, and response payloads do not change — only the host:port the routes bind on.
Storage layer
A new abstract method on IMetricsQueryDAO:
List<String> listEntityIdsInRange(String metricName,
String valueColumnName,
Duration duration,
int limit) throws IOException;
Abstract on purpose so any 3rd party storage backend that implements
IMetricsQueryDAO MUST provide this override — a default body would
let a missing override slip through compilation and surface as silent
“no entities” or 500 the first time the inspect API hit that backend.
Returns distinct entity_ids, ordered by most recent timestamp
descending, capped at limit. Each backend must enforce that ordering
explicitly — without an explicit sort directive, scoring / index-internal
order can drop a hot entity that ingested late before the cap is reached.
Implementations:
- BanyanDB (
BanyanDBMetricsQueryDAO):MeasureQueryagainst the resolved measure (downsampling chosen viaMetadataRegistry.formatName(metricName, downSampling)), timestamp range filter, tag projection limited toENTITY_ID, value-column field projection,setOrderBy(new AbstractQuery.OrderBy(Sort.DESC))for timestamp-DESC ordering,.limit(limit). Client-sideLinkedHashSetdedup preserves the server-side ordering. - Elasticsearch (
MetricsQueryEsDAO): range filter ontime_bucket,search.sort(Metrics.TIME_BUCKET, Sort.Order.DESC)for timestamp-DESC ordering,Search.size(limit). The downsampling-suffixed index name is resolved through the existingIndexController/MetricsModelInstallerpath (the same name the existingreadMetricsValuesuses for the requested step). Client-sideLinkedHashSetdedup on the returned hits. - JDBC (
JDBCMetricsQueryDAO):getTablesForRead(metricName, startTimeBucket, endTimeBucket)resolves the day-partitioned tables. Per table:SELECT entity_id, MAX(time_bucket) AS latest_time_bucket FROM <table> WHERE table_column = ? AND time_bucket BETWEEN ? AND ? GROUP BY entity_id ORDER BY latest_time_bucket DESC LIMIT ?. (PostgreSQL rejectsORDER BY time_bucketwhen it isn’t in aSELECT DISTINCTlist, so the GROUP BY shape is the portable one across H2 / MySQL / PostgreSQL.) Per-table results are merged into aMap<entityId, maxTimeBucket>withMath::max, then sorted by the value DESC and capped at the globallimit— so a multi-day range cannot fill the result with stale entities from older partitions before newer partitions are queried.
This shape mirrors the existing
readLabeledMetricsValuesWithoutEntity(metricName, valueColumnName, labels, duration) (limit hard-coded to 10, projects value as well) — we reuse the
range-scan and downsampling-resolution scaffolding from those
implementations.
Decode & enrich
Per row, the handler:
- Calls
Scope.Finder.valueOf(scopeId)to resolve the scope from the metric’s metadata. - Switches on the scope to call the correct
IDManager.{ServiceID,ServiceInstanceID,EndpointID}.analysisId(...)/analysisRelationId(...). - For Service-bearing rows, reads
MetadataQueryService.getService(serviceId)from the in-memory cache and emits one row per layer inservice.getLayers(). - Builds the
mqeEntityblock by inverting the decode — same fieldsEntity.buildId()would consume to reconstruct the sameentity_id.
If a service is missing from the metadata cache (e.g., it was just deleted
but a metric row still exists in the storage TTL window), the row is still
emitted with layer: null and the decoded names; the operator can still
call MQE.
Module layout
Two new feature modules under the existing admin-server bundle:
oap-server/server-admin/inspect/
├── pom.xml
├── src/main/java/org/apache/skywalking/oap/server/admin/inspect/
│ ├── InspectModule.java # ModuleDefine
│ ├── InspectModuleProvider.java # ModuleProvider
│ ├── InspectModuleConfig.java
│ ├── handler/InspectRestHandler.java # Armeria handler
│ ├── decoder/EntityDecoder.java # scope → IDManager dispatch
│ └── response/{MetricRow, EntityRow, MqeEntity}.java
└── src/main/resources/META-INF/services/
└── org.apache.skywalking.oap.server.library.module.ModuleDefine
oap-server/server-admin/status/ ◄── relocated from
server-query-plugin/status-query-plugin
├── pom.xml
├── src/main/java/org/apache/skywalking/oap/server/admin/status/
│ ├── StatusModule.java
│ ├── StatusModuleProvider.java
│ ├── StatusModuleConfig.java
│ ├── handler/
│ │ ├── ClusterStatusHandler.java # moved
│ │ ├── AlarmStatusHandler.java # moved
│ │ ├── TTLConfigHandler.java # moved
│ │ └── DebuggingHTTPHandler.java # moved (query trace)
│ └── service/{AlarmStatusQueryService, …}.java # moved
└── src/main/resources/META-INF/services/
└── org.apache.skywalking.oap.server.library.module.ModuleDefine
requiredModules():
inspect:CoreModule.NAME,StorageModule.NAME,AdminServerModule.NAME. Inspect is admin-only; it has no public-REST mirror and therefore does require admin-server.status:CoreModule.NAMEonly.AdminServerModule.NAMEis consulted reflectively viamoduleManager.has(...)for the conditional second handler registration. Declaring it as required would force every status-using deployment to also enable admin-server, which contradicts the “UI keeps working with admin off” goal.
Handler registration in status’s start():
// Always — public REST surface, where the UI calls today.
HTTPHandlerRegister publicRegister = manager.find(CoreModule.NAME)
.provider().getService(HTTPHandlerRegister.class);
registerAll(publicRegister);
// Additionally, when admin-server is enabled.
if (manager.has(AdminServerModule.NAME)) {
HTTPHandlerRegister adminRegister = manager.find(AdminServerModule.NAME)
.provider().getService(HTTPHandlerRegister.class);
registerAll(adminRegister);
}
Inspect’s start() registers only on the admin-server register
(mirroring dsl-debugging and runtime-rule). The old
status-query-plugin module is removed.
Maven module / pom changes
One module dropped, two added, three existing poms edited. Net: same number of distribution jars, just relocated under server-admin.
| Pom | Edit |
|---|---|
oap-server/server-admin/pom.xml |
<modules> adds inspect and status (already lists admin-server, runtime-rule, dsl-debugging). |
oap-server/server-query-plugin/pom.xml |
<modules> drops status-query-plugin. |
oap-server/server-starter/pom.xml |
Removes <dependency>status-query-plugin</dependency>; adds <dependency>status</dependency> and <dependency>inspect</dependency> next to the existing admin-server / runtime-rule / dsl-debugging block. |
oap-server/server-admin/inspect/pom.xml |
New. parent=server-admin, packaging=jar, depends on server-core, admin-server, library-server. |
oap-server/server-admin/status/pom.xml |
New. Same parent / packaging; carries the deps the relocated handlers already used (notably library-server for the Armeria handler glue). |
oap-server/server-query-plugin/status-query-plugin/ |
Directory deleted along with the pom. Sources move under oap-server/server-admin/status/src/main/java/... (package rename org.apache.skywalking.oap.query.debug → org.apache.skywalking.oap.server.admin.status). |
Cost & safety
/inspect/metricsis in-process metadata only — O(N) over registered metrics, no I/O./inspect/entitiesissues exactly one storage call per request bounded byLIMIT 300. The 300 ceiling is enforced server-side and cannot be raised via parameter. This is the same shape as today’sreadLabeledMetricsValuesWithoutEntity(limit 10) just with a higher cap; no new query class is introduced.- The endpoint is bound only on the admin port (default 17128), which is not the agent or query port; existing operator-network controls apply.
- No caching layer in v1. If poll-storms become a problem we can add a
short-TTL in-process cache keyed on
(metric,start,end,step,limit), but given operators are humans and the cap is 300 rows, premature.
Compatibility
The dual-bind design keeps the Status API backwards compatible for the
UI and for any scripted consumer of http://oap:12800/status/....
| Change | Impact |
|---|---|
status-query-plugin module is removed; replaced by the new status feature module under server-admin/. The selector renames from the QUERY-plugin form (SW_QUERY=…,status-query-plugin) to a top-level SW_STATUS=default (on by default). |
Mild. Custom application.yml overrides that referenced status-query-plugin need to be repointed; default deployments do not need to change anything. URIs, payloads, and the public REST binding are preserved. |
Status API docs move from docs/en/status/ → docs/en/setup/backend/admin-api/status.md (new consolidated page). The docs/en/status/ directory is removed and docs/menu.yml updated; docs/en/status/status_apis.md becomes a redirect stub for one minor cycle. |
Doc move only. No URI or payload churn. |
Additive: second handler binding on the admin port (when admin-server is enabled), /inspect/metrics, /inspect/entities, new IMetricsQueryDAO.listEntityIdsInRange method, new modules inspect and status. |
No impact on existing deployments. UI continues to call status on the public REST port; admin-port consumers are opt-in via SW_ADMIN_SERVER=default. |
The admin-port binding for status is intentionally additional, not a replacement, so that admin-server may stay off (the default) without breaking the UI. A future SWIP can drop the public-REST binding once skywalking-ui has migrated to the admin-port path; until then, both surfaces co-exist.
Test plan
The inspect surface is e2e-validated against the existing java-agent
e2e (test/e2e-v2/cases/simple/...) — that suite already runs across
every supported storage backend (BanyanDB, Elasticsearch, JDBC), so
piggy-backing keeps the matrix free.
The agent showcase boots a provider and a consumer Java service; this
gives us live metrics across every scope inspect supports. The flow uses
curl + jq directly (no swctl wiring) so the test is portable across
the existing e2e runners:
- Wait for the showcase to ingest at least one minute window.
- (one-time)
curl http://oap:17128/debugging/config/dumpand parse the response to learn the OAP REST host:port the harness will fire MQE against. Existing endpoint; no new discovery primitive needed. curl http://oap:17128/inspect/metrics?regex=service_cpm→ assertservice_cpmappears withscope=Service,type=REGULAR_VALUE,downsamplingsincludesMINUTE. Repeat forinstance_jvm_memory_*(ServiceInstance scope) andinstance_jvm_class_loaded_count(ServiceInstance, REGULAR_VALUE).curl http://oap:17128/inspect/entities?metric=service_cpm&start=…&end=…&step=MINUTE→ expect rows for bothproviderandconsumer.jqout one row’smqeEntityblock.POSTMQE expressionservice_cpmagainst<discovered MQE URL>with the extractedmqeEntity→ assert non-empty time-series result.- Repeat (4)–(5) with
instance_jvm_memory_heap_used(ServiceInstance scope) — confirm both provider and consumer instances surface, and MQE for the picked instance returns data. - Relation coverage:
/inspect/entities?metric=service_relation_client_cpm→ assert aconsumer→providerrow, then MQEservice_relation_client_cpmagainst itsmqeEntityreturns data. Same pattern forservice_instance_relation_client_cpmandendpoint_relation_cpm(when endpoint relations are present in the showcase). - Negative cases: HEATMAP metric (
endpoint_response_time) → expect 400 from/inspect/entities;step=DAYagainst a metric registered only at MINUTE → expect 400 with the supported-downsamplings list.
Module-level unit tests live alongside each new source: EntityDecoder
for every supported scope (covering isReal=true and isReal=false
service IDs, multi-layer service merge, missing-from-cache fallback),
the /inspect/metrics filter (NOT_VALUE / Scope.All exclusion),
and the new DAO method per backend mocked at the storage interface.
Deliverables
| Item | Location |
|---|---|
| Module sources | oap-server/server-admin/inspect/, oap-server/server-admin/status/ |
| Storage DAO method | IMetricsQueryDAO.listEntityIdsInRange + 3 implementations |
| Operator doc | docs/en/setup/backend/admin-api/inspect.md (new), admin-api/status.md (new, consolidates the relocated docs/en/status/* pages) |
| Doc index updates | docs/en/setup/backend/admin-api/readme.md (add inspect + status sections), docs/menu.yml (drop docs/en/status/, add admin-api/{inspect,status}.md) |
| Redirect stub | docs/en/status/status_apis.md → one-cycle redirect pointing to admin-api/status.md |
| Changelog | docs/en/changes/changes-X.X.X.md — entries for: new /inspect/*, status-plugin → status-feature-module relocation + selector rename, new IMetricsQueryDAO.listEntityIdsInRange |
| E2E updates | test/e2e-v2/cases/simple/... — add an inspect verification step to the existing java-agent showcase case; runs unchanged across BanyanDB / ES / JDBC matrix |
| Unit tests | per-module under each new source root |
Implementation order
The status-relocation work and the inspect-API work share zero code paths and can be staged independently:
- Phase 1 — status relocation. Create
server-admin/statusmodule, move sources, set up dual-bind, deletestatus-query-plugin. UI and existing operator scripts continue to work; URI paths and payloads are unchanged. - Phase 2 — inspect. Add
IMetricsQueryDAO.listEntityIdsInRangeacross the three backends, thenserver-admin/inspectwith handlers and decoder. Independent of Phase 1 — inspect clients that need to discover the OAP REST URL parse/debugging/config/dump, an already-existing endpoint preserved by Phase 1.
Each phase is a self-contained PR.
Future work
- Inspect-Records (
/inspect/records) forSAMPLED_RECORDmetrics, backed byIRecordsQueryDAO. - Inspect-Process for
Process/ProcessRelation, contingent on a decoder forProcessID(likely aProcessTrafficjoin). - Optional
?attribute=k=vfilter pushdown to leverage the same attribute conditionsTopNConditionaccepts.