Release Apache SkyWalking BanyanDB 0.10.3

SkyWalking BanyanDB 0.10.3 is released. Go to downloads page to find release tars.

Bug Fixes

  • Persist segment end time in per-segment metadata so boundaries don’t shift across restarts or config changes.
  • Fix flaky on-disk integration tests caused by Ginkgo v2 random container shuffling closing gRPC connections prematurely.
  • ui: fix query editor refresh/reset behavior and BydbQL keyword highlighting.
  • Fix flaky file_snapshot subtest in measure/stream/trace by waiting until every introduced mem part has been flushed to disk, instead of only checking the latest snapshot creator.
  • Fix flaky TestCollectWithPartialClosedSegments by raising SegmentIdleTimeout so wall-clock variance on slow CI does not mark still-open segments as idle.
  • Fix FODC lifecycle cache poisoning where transient InspectAll failures were cached for 10 minutes and masked liaison recovery; raise FODC agent and proxy timeouts from 10s to 40s.
  • Fix FODC /cluster/lifecycle dropping zero-valued group fields (e.g. replicas=0, close=false) under encoding/json + omitempty; switch to protojson so all fields are emitted (nil nested messages serialize as null).
  • Fix trace block_writer panic on out-of-order timestamps within the same traceID, which dropped one trace-write batch per panic in multi-agent SkyWalking deployments. Spans of a single trace originate from independently-clocked services, and trace storage is organized by traceID rather than timestamp, so per-traceID timestamp monotonicity is not a writer invariant.
  • Fix nil-pointer panic on cold-tier data nodes when FODC InspectAll raced with idle-segment cleanup.
  • Add GroupLifecycleInfo.errors to surface per-group collection failures from FODC InspectAll instead of silently dropping the affected node entry.
  • Fix CollectDataInfo and CollectLiaisonInfo not handling CATALOG_PROPERTY groups.
  • Close BanyanDB merge write-path durability gap that allowed torn parts to be created by a crash between data write and metadata commit. Metadata files (metadata.json for trace/measure/stream, manifest.json for sidx, plus traceID.filter and tag.type) now go through a new WriteAtomic (write-tmp + fsync + rename + fsync-dir) sequence; data writers (seqWriter.Close, localFileSystem.Write) now propagate fdatasync errors instead of silently dropping them. mustOpenFilePart / mustOpenPart in each engine cleans up safe post-rename .tmp leftovers on open. (#13862, root cause for #13861)
  • Fix lifecycle migration where the receiving node could create segments shorter than the configured SegmentInterval.
  • Fail fast on incompatible storage version at boot. Previously the server would start in a degraded SERVING state with affected groups un-loaded because the property schema-registry retry loop swallowed the version-incompatibility panic. Compatible versions are listed in banyand/internal/storage/versions.yml.
  • Release bluge index writers on segment rotation so analysisWorker pools sized from GOMAXPROCS don’t accumulate across rotations. Two layered defects kept the existing idle-segment reclaim path from running: segmentIdleTimeout defaulted to 0 (which disabled the 10-minute reclaim ticker), and incRef refreshed lastAccessed on every rotation tick so closeIdleSegments never observed an idle segment. Defaults to time.Hour, moves the lastAccessed bump to real read/write call sites, and rewrites closeIdleSegments to take its own CAS-bumped snapshot so a concurrent reopen cannot have its only ref dropped under the reclaimer (apache/skywalking#13874).
  • Fix incorrect counts and missing trace fields in the lifecycle migration report.
  • Fix trace query identity-tag projection: when trace_id/span_id are explicitly projected, reconstruct them from span identity at response build time instead of requesting them as stored tags, and preserve tag order with null-filled per-span value alignment in the distributed trace result iterator.
  • Fix FODC proxy corrupting Prometheus metric types. The agent dropped the # TYPE line while parsing banyandb /metrics, the StreamMetrics proto carried no type field, and the proxy guessed the type from a name-suffix heuristic — downgrading counters to gauge, mislabeling _count-suffixed counters as histograms, and splitting summaries into two conflicting # TYPE lines. Capture the type with the Prometheus expfmt parser, store it in the flight recorder, thread it through a new Metric.type enum over gRPC, and emit the real type from the proxy; pre-upgrade (untyped) samples fold into the matching typed family so a mixed-version rollout never emits two conflicting # TYPE lines for one metric.
  • Trace storage metrics now expose the storage sub-scope, matching the stream_storage_* naming. The StorageMetricsFactory for trace switched from the root trace scope to trace.storage, so per-segment inverted-index metrics (inverted_index_total_updates, inverted_index_total_doc_count, inverted_index_total_term_searchers_started) are now emitted as banyandb_trace_storage_* instead of banyandb_trace_*, aligning the dashboard query names. Other trace metrics (trace_tst_*, trace_scheduler_*) are unchanged.
  • Fix FODC agent labeling metrics with node_role="ROLE_UNSPECIFIED". The agent resolved the node role exactly once at startup via a single GetCurrentNode poll whose endpoint retries spanned only ~1s; when the sibling lifecycle/banyandb gRPC server was not yet listening (connect: cannot assign requested address) the role fell back to ROLE_UNSPECIFIED permanently, so most nodes never reported their real ROLE_DATA/ROLE_LIAISON. Retry the initial node-role resolution with exponential backoff until a non-unspecified role is obtained or a 25s budget elapses.

Chores

  • Regenerate expired TLS test certificate with 100-year validity.
  • Set Ginkgo --repeat to 0 in the flaky-test workflow so the hourly run completes within the 50-minute timeout.