Troubleshooting Crash Issues
If BanyanDB processes crash or encounter file corruption, follow these steps to diagnose and recover from the issue.
Collect Crash Diagnostics (FODC)
If the cluster runs the FODC agent + proxy, panics and crash artifacts are captured automatically. Query the proxy’s GET /diagnostics endpoint to retrieve aggregated crash records — the structured panic.json (component, panic value, goroutine stack) plus any deep-dump artifacts — across all nodes, optionally filtered by role or pod_name. See Proxy APIs and CLI Flags for the request/response schema. This is usually the fastest way to recover a panic’s goroutine stack and identify the component that failed before digging into individual node logs.
Remove Corrupted Standalone Metadata
If the BanyanDB standalone process crashes due to corrupted metadata. You should remove the corrupted metadata:
- Shutdown BanyanDB:
- Before making any changes to the data files, ensure that BanyanDB is not running. This prevents further corruption and ensures data integrity.
- Send
SIGTERMorSIGINTsignals to the BanyanDB process to gracefully shut it down
- Locate the Metadata File:
- Schema metadata is stored by the property-based schema server embedded in every data node.
- Navigate to the directory pointed to by
--schema-server-root-path(default/tmp).
Remove Corrupted Stream, Measure or Trace Data
The logs may indicate that the crash was caused by corrupted data. In such cases, it is essential to remove the corrupted data to restore the integrity of the database. Follow these steps to safely remove corrupted data from BanyanDB:
-
Identify the Corrupted Data: Monitor the BanyanDB logs for any error messages indicating data corruption. The file is located in a part directory. You have to remove the whole part directory instead of a single file.
-
Shutdown BanyanDB:
- Before making any changes to the data files, ensure that BanyanDB is not running. This prevents further corruption and ensures data integrity.
- Send
SIGTERMorSIGINTsignals to the BanyanDB process to gracefully shut it down
-
Locate the Snapshot File:
- In each shard of the TSDB (Time Series Database), there is a snapshot file that contains all alive parts directories.
- Navigate to the directory where BanyanDB stores its data. This is typically specified in the flags
-
Remove the Corrupted File:
- Identify the corrupted part within the snapshot directory.
- Remove the part’s record from the snapshot file.
-
Clean Up Part:
- Remove the corrupted part directory from the disk.
-
Restart BanyanDB:
- Once the corrupted part is removed and the metadata is cleaned up, restart BanyanDB to apply the changes
-
Verify the Integrity:
- After restarting, monitor the BanyanDB logs to ensure that the corruption issues have been resolved.
- Run any necessary integrity checks or queries to verify that the database is functioning correctly.
-
Prevent Future Corruptions:
- Monitor system resources and ensure that the hardware and storage systems are functioning correctly.
- Keep BanyanDB and its dependencies updated to the latest versions to benefit from bug fixes and improvements.
By following these steps, you can safely remove corrupted data from BanyanDB and ensure the continued integrity and performance of your database.