Context-Aware Panic Diagnostics Development Design
Table of Contents
- Overview
- Current Implementation Status
- Architecture
- Component Design
- Crash Artifacts
- Proxy API and Collection Semantics
- Deployment and Operations
- Testing Strategy
- Appendix
Overview
Context-aware panic diagnostics in BanyanDB and FODC provide two complementary capabilities:
- Recover selected panics with structured diagnostics and persisted artifacts.
- Surface those artifacts through the FODC agent and proxy for centralized inspection.
The implementation is no longer just a file-watching sidecar concept. The current codebase has an end-to-end flow:
pkg/panicdiagcaptures panic context, breadcrumbs, stack traces, and optional state dumps.- Recovered panics are written to an artifact directory on disk.
- The FODC agent collects crash records from both in-process panic reports and an optional watched artifact directory.
- The FODC proxy requests crash collections from connected agents over the crash diagnostics gRPC stream.
- The proxy caches deduplicated crash records and exposes them via
GET /diagnostics.
This document describes the current behavior implemented in code and highlights the parts that remain operational guidance rather than shipped features.
Current Implementation Status
Implemented
panicdiag.WithRecoveryandpanicdiag.GoWithRecoveryrecover panics, capture stack traces, and persist structured artifacts.panicdiag.WithBreadcrumbandpanicdiag.BreadcrumbsFromContextpreserve semantic execution markers.panicdiag.StateDumpersupports bounded JSON state dumps.panicdiag.CrashOutputConfig.InstallGlobalCrashOutputconfigures the default artifact root and retention for structured panic artifacts.- The FODC agent aggregates crash collections from:
- An in-process panic reporter.
- An optional filesystem watcher over the configured crash directory.
- The FODC proxy requests diagnostics from agents over
StreamCrashDiagnostics. - The proxy caches crash records keyed by
agentID::artifactDir. GET /diagnosticsreturns the proxy cache and supports filtering byroleandpod_name.
Not Implemented Here
- Automatic upload of crash artifacts from the proxy or agent to OAP.
- Alert fan-out such as Slack or webhook notifications.
- Proxy-side artifact completeness markers or remote retention coordination.
Those remain future integration possibilities, not current behavior.
Architecture
End-to-End Flow
Managed goroutine / instrumented code path
│
├── Breadcrumbs added to context.Context
│
└── panicdiag.WithRecovery / GoWithRecovery
│
├── Recover panic value
├── Capture goroutine stack
├── Read breadcrumbs from context
├── Optionally dump bounded state
└── Write artifact directory on disk
│
├── panic.json
└── deep-dump.json (optional)
FODC agent
│
├── In-process panic store
├── Optional filesystem watcher over crash directory
└── MultiCollectionProvider deduplicates by artifactDir
│
└── Crash diagnostics gRPC stream to proxy
│
├── Proxy sends RequestDiagnostics=true
├── Agent streams one message per collection
└── Proxy caches records by agentID::artifactDir
│
└── HTTP GET /diagnostics
Why the Layers Are Separate
panicdiagis responsible for local panic capture and artifact persistence.- The FODC agent is responsible for discovering crash collections from local sources.
- The FODC proxy is responsible for requesting, caching, filtering, and serving fleet-wide diagnostics.
- Global crash output remains the last-resort path for fatal runtime crashes that bypass recovery in processes that enable it, such as the FODC agent.
Component Design
1. Recovery Runtime
Purpose: Recover managed panics and persist structured diagnostics.
Core Types
RecoveryOptions
type RecoveryOptions struct {
Counter meter.Counter
Logger *logger.Logger
StateDumper StateDumper
ProcessMetadata map[string]string
Component string
ArtifactRoot string
StateLimitBytes int64
}
StateDumper
type StateDumper interface {
DumpState(context.Context) (any, error)
}
PanicRecord
type PanicRecord struct {
ProcessMetadata map[string]string `json:"processMetadata,omitempty"`
StateDump *StateDumpStatus `json:"stateDump,omitempty"`
Component string `json:"component"`
PanicValue string `json:"panicValue"`
GoroutineStack string `json:"goroutineStack"`
OccurredAt time.Time `json:"occurredAt"`
Breadcrumbs []Breadcrumb `json:"breadcrumbs,omitempty"`
Recovered bool `json:"recovered"`
}
Current Behavior
WithRecoveryrecovers panics withdeferandrecover.- The active
context.Contextis passed by pointer so breadcrumbs appended during execution are visible to recovery. - If configured, a panic counter is incremented with a
componentlabel. - If an artifact root is available,
panic.jsonis written immediately. - If a
StateDumperis configured, the runtime writes:deep-dump.json
- The in-process panic report includes breadcrumbs and state-dump status. Filesystem-backed collections derive their structured record from
panic.jsonand list dump files separately.
2. Global Panic Diagnostics
Purpose: Configure structured panic artifacts for failures that pass through WithRecovery.
Current Behavior
CrashOutputConfig.InstallGlobalCrashOutputdoes not createruntime-crash-<pid>.txtfiles.- Installing panic diagnostics sets:
- the default artifact root used by
panicdiag - the default maximum number of retained artifact directories
- the default artifact root used by
--max-diagnosis-memory-usage-percentagecan reserve memory headroom for post-panic diagnostics.
In the current FODC code, this crash-output path is enabled by the FODC agent, not the FODC proxy.
This is the safety net for fatal runtime failures and escaped panics.
3. Diagnostic Breadcrumbs
Purpose: Preserve the semantic path that led to a panic.
Core Type
type Breadcrumb struct {
Fields map[string]string `json:"fields,omitempty"`
Time time.Time `json:"time"`
Stage string `json:"stage"`
Component string `json:"component,omitempty"`
}
Current Behavior
WithBreadcrumb(ctx, stage, component, fields)returns a derived context with one additional immutable marker.BreadcrumbsFromContext(ctx)returns markers ordered oldest-to-newest.- The implementation clones field maps so callers cannot mutate already-recorded breadcrumb data.
- Breadcrumbs are already used in several query and service paths, including FODC lifecycle collection and BanyanDB stream and measure query handling.
4. Deep State Serialization
Purpose: Persist bounded state that helps explain the crash input and runtime state.
Current Behavior
StateDumper.DumpStateis called only after a panic has been recovered.- The runtime writes a bounded JSON dump.
- Serialization status is recorded in
PanicRecord.StateDump. - Serialization failure does not prevent panic recovery or artifact creation.
StateDump Status
type StateDumpStatus struct {
Path string `json:"path,omitempty"`
Error string `json:"error,omitempty"`
Truncated bool `json:"truncated,omitempty"`
}
5. Artifact Writer
Purpose: Write crash diagnostics into a stable per-panic directory.
Current Layout
<panic-diagnostics-dir>/
<timestamp>-<component>-<pid>/
panic.json
deep-dump.json (optional)
Naming and Retention
- Artifact directories are named as
<UTC timestamp>-<sanitized component>-<pid>. - Components are sanitized for safe filesystem names.
- Retention pruning removes the oldest artifact directories first.
- Only directories containing
panic.jsonare treated as crash artifacts for pruning.
6. FODC Agent Collection
Purpose: Merge locally available crash collections into one streamable view.
Sources
InProcessPanicStore- receives reports directly from
panicdiagthrough the default reporter
- receives reports directly from
DirectoryWatcher- scans a crash directory using filesystem notifications plus periodic rescans
- waits until
panic.jsonexists before storing a filesystem-backed collection
Merging
MultiCollectionProvider deduplicates by Collection.ArtifactDir so the same artifact is not reported twice when it is seen by both the in-process store and the directory watcher.
Collection Record
type CollectionRecord struct {
FetchedAt time.Time `json:"fetchedAt"`
SourceEndpoint string `json:"sourceEndpoint"`
Collection panicdiag.Collection `json:"collection"`
}
7. FODC Proxy Aggregation
Purpose: Collect diagnostics from connected agents on demand and expose a filtered fleet view.
Current Behavior
- The proxy creates the diagnostics aggregator before the gRPC service exists.
SetGRPCServicewires the gRPC sender after service construction.CollectDiagnosticsreadsgrpcServiceunder lock.- If
grpcServiceis stillnilduring startup, the proxy logs a warning and returns the cached snapshot instead of failing. - On each
GET /diagnosticsrequest, the proxy:- filters agents by
roleand optionalpod_name - sends
RequestDiagnostics(agentID)to each matching agent - waits a short fixed collection window
- returns the current cached snapshot
- filters agents by
Cache Semantics
- Each incoming
StreamCrashDiagnosticsRequestrepresents one artifact. - The proxy cache key is
agentID::artifactDir. - Repeated sends update the cached record in place.
- Removing an agent removes all cached records for that agent.
- Request failures are non-fatal and are logged as agent capability or stream-availability issues.
Crash Artifacts
Persisted panic.json Example
BanyanDB panic recovered
OccurredAt: 2026-04-01T10:11:12Z
Component: measure-query-worker
Recovered: true
Panic: runtime error: index out of range [7] with length 4
Stack:
goroutine 9123 [running]:
...
Files in a Collection
panic.jsonis the required structured summary written by the artifact writer. Filesystem collection parses its core fields intopanic_record.deep-dump.jsonis optional.
Completeness Rules
panic.jsonis required for a directory to be recognized as a crash collection bypanicdiag.ListCollections.- The agent-side
DirectoryWatchertreatspanic.jsonas the required file for a complete artifact. - Directories without
panic.jsonare ignored until a later scan observes the summary file.
Proxy API and Collection Semantics
HTTP Endpoint
The proxy exposes:
GET /diagnostics
Supported query parameters:
rolepod_name
The endpoint returns an array of aggregated records.
Returned Record Shape
{
"fetched_at": "2026-04-20T10:00:00Z",
"panic_record": {
"occurred_at": "2026-04-20T09:59:30Z",
"component": "watchdog",
"goroutine_stack": "goroutine 100 [running]:\n...",
"panic_value": "boom",
"recovered": true
},
"agent_id": "agent-1",
"pod_name": "banyand-datanode-0",
"role": "datanode",
"source_endpoint": "file:///crash",
"artifact_dir": "20260420T095930.000000000Z-watchdog-1234",
"files": [
"panic.json",
"deep-dump.json"
]
}
Request/Response Model
- The proxy does not continuously mirror all agent crash collections.
- Diagnostics are refreshed on demand when the HTTP endpoint is called.
- Agents respond by listing their currently known collections and streaming them one-by-one to the proxy.
- The proxy then serves a cached snapshot after waiting for a short response window.
This model avoids requiring a batch-end marker in the current proto while still giving the HTTP caller a coherent fleet snapshot.
Deployment and Operations
Shared Directory Model
The local crash artifact directory is still important, but it now primarily feeds the FODC agent rather than a separate sidecar uploader.
+---------------------------+ +---------------------------+
| BanyanDB / FODC process | | FODC agent |
| | | |
| panicdiag writes artifacts| -----> | In-process store |
| | local | Directory watcher |
| | dir | gRPC crash stream client |
+---------------------------+ +-------------+-------------+
|
v
+-----------------------+
| FODC proxy |
| diagnostics cache |
| GET /diagnostics |
+-----------------------+
Relevant Flags
From the current implementation:
--panic-diagnostics-enabled--panic-diagnostics-dir--panic-diagnostics-max-artifacts--max-diagnosis-memory-usage-percentage
These flags are part of the FODC agent’s CLI surface. Agent crash collection also depends on the configured watched crash source directory when filesystem-backed collection is enabled.
Operational Notes
- The artifact directory should be writable by the process generating artifacts.
- Retention is directory-count based, not total-byte based.
GOMEMLIMITheadroom helps the process finish diagnostic work under memory pressure.- Incomplete artifacts may appear transiently while files are still being written.
- During proxy startup,
/diagnosticscan return a cached snapshot even before the gRPC service has been wired into the aggregator.
Testing Strategy
Unit Tests
WithRecoverycaptures panic value, stack trace, breadcrumbs, and optional state dump status.- Breadcrumb helpers preserve ordering and clone field maps.
- Artifact writing creates
panic.json. - Panic diagnostics installation does not create runtime crash text files.
- Directory watching detects artifact directories once
panic.jsonis present. - Proxy aggregation returns a cached snapshot when the gRPC service is unset.
Integration Tests
- State dump files are surfaced in collection
Files. - Breadcrumbs written during recovered panics remain available through the in-process panic report.
- Incomplete artifact directories are ignored until
panic.jsonappears. - Proxy diagnostics requests collect records from connected agents and expose them through
GET /diagnostics.
Appendix
Layer Summary
| Layer | Name | Current Outcome |
|---|---|---|
| 0 | Recovery Runtime | Recover managed panics and persist panic.json |
| 1 | Global Panic Diagnostics | Configure artifact root, retention, and memory headroom |
| 2 | Breadcrumbs | Preserve semantic execution history in the in-process panic report |
| 3 | State Dump | Persist bounded JSON state snapshots |
| 4 | Agent Collection | Merge in-process and filesystem-backed crash collections |
| 5 | Proxy Aggregation | Request, cache, filter, and serve fleet-wide diagnostics |
Expected Operator Value
- Faster root-cause analysis because crash records include stack traces, breadcrumbs, and optional state dumps.
- Less filesystem noise because runtime crash text files are not created for normal recovered panic diagnostics.
- Fleet-level visibility because the proxy can serve crash diagnostics aggregated from connected agents through a single endpoint.