Node Discovery
Introduction
Node discovery enables BanyanDB nodes to locate and communicate with each other in a distributed cluster. The service discovery mechanism is primarily applied in two scenarios:
- Request Routing: When liaison nodes need to send requests to data nodes for query/write execution.
- Data Migration: When the lifecycle service queries the node list to perform shard migration and rebalancing.
BanyanDB supports two discovery mechanisms to accommodate different deployment environments:
- Etcd-based Discovery: Traditional distributed consensus approach suitable for VM deployments and multi-cloud scenarios.
- DNS-based Discovery: Cloud-native solution leveraging Kubernetes service discovery infrastructure.
This document provides guidance on configuring and operating both discovery modes.
Etcd-Based Discovery
Overview
Etcd-based discovery uses a distributed key-value store to maintain cluster membership. Each node registers itself in etcd with a time-limited lease and continuously renews the lease through heartbeat mechanisms.
Configuration Flags
The following flags control etcd-based node discovery:
# Basic configuration
--node-discovery-mode=etcd # Enable etcd mode (default)
--namespace=banyandb # Namespace in etcd (default: banyandb)
--etcd-endpoints=http://localhost:2379 # Comma-separated etcd endpoints
# Authentication
--etcd-username=myuser # Etcd username (optional)
--etcd-password=mypass # Etcd password (optional)
# TLS configuration
--etcd-tls-ca-file=/path/to/ca.crt # Trusted CA certificate
--etcd-tls-cert-file=/path/to/client.crt # Client certificate
--etcd-tls-key-file=/path/to/client.key # Client private key
# Timeouts and intervals
--node-registry-timeout=2m # Node registration timeout (default: 2m)
--etcd-full-sync-interval=30m # Full state sync interval (default: 30m)
Configuration Examples
Data Node with TLS:
banyand data \
--node-discovery-mode=etcd \
--etcd-endpoints=https://etcd-1:2379,https://etcd-2:2379,https://etcd-3:2379 \
--etcd-tls-ca-file=/etc/banyandb/certs/etcd-ca.crt \
--etcd-tls-cert-file=/etc/banyandb/certs/etcd-client.crt \
--etcd-tls-key-file=/etc/banyandb/certs/etcd-client.key \
--namespace=production-banyandb
Data Node with Basic Authentication:
banyand data \
--node-discovery-mode=etcd \
--etcd-endpoints=http://etcd-1:2379,http://etcd-2:2379 \
--etcd-username=banyandb-client \
--etcd-password=${ETCD_PASSWORD} \
--namespace=production-banyandb
Etcd Key Structure
BanyanDB stores node registrations using the following pattern:
/{namespace}/nodes/{node-id}
Example: /banyandb/nodes/data-node-1:17912
Key Components:
- Namespace: Configurable via
--namespaceflag (default:banyandb). Allows multiple BanyanDB clusters to coexist in the same etcd instance - Node ID: Format
{hostname}:{port}or{ip}:{port}, uniquely identifying each node
The value stored is a Protocol Buffer serialized databasev1.Node message containing node metadata, addresses, roles, and labels.
Node Lifecycle
Registration
When a node starts, it performs the following sequence:
- Lease Creation: Creates a 5-second TTL lease with etcd, automatically renewed through keep-alive heartbeat.
- Key-Value Registration: Uses etcd transaction to ensure the key doesn’t already exist, preventing registration conflicts.
- Keep-Alive Loop: Background goroutine continuously sends heartbeat with automatic reconnection and exponential backoff on network failures.
Health Monitoring
Node health is implicitly monitored through the lease mechanism:
- Nodes with expired leases (>5 seconds without keep-alive) are automatically removed by etcd.
- Presence in etcd indicates the node is alive.
- Dead nodes disappear within 5 seconds of failure.
Event Watching
BanyanDB implements real-time cluster membership tracking:
- Watches etcd prefix
/nodes/for PUT (add/update) and DELETE events. - Revision-based streaming resumes from last known revision after reconnection.
- Periodic full sync every 30 minutes (randomized) to detect missed events.
Deregistration
Graceful Shutdown:
- Close signal triggers lease revocation.
- Etcd automatically deletes the node’s key.
- Watch streams notify all other nodes immediately.
Crash Recovery:
- Lease expires after 5 seconds without renewal.
- Etcd automatically removes the key.
- No manual intervention required for cleanup.
DNS-Based Discovery
Overview
DNS-based discovery provides a cloud-native alternative leveraging Kubernetes' built-in service discovery infrastructure. This eliminates the operational complexity of managing a separate etcd cluster.
How it Works with Kubernetes
- Create a Kubernetes headless service.
- Kubernetes automatically creates DNS SRV records:
_<port-name>._<proto>.<service-name>.<namespace>.svc.cluster.local. - BanyanDB queries these SRV records to discover pod IPs and ports.
- Connects to each endpoint via gRPC to fetch node metadata.
Note: No external dependencies beyond DNS infrastructure.
Configuration Flags
# Mode selection
--node-discovery-mode=dns # Enable DNS mode
# DNS SRV addresses (comma-separated)
--node-discovery-dns-srv-addresses=_grpc._tcp.banyandb-data.default.svc.cluster.local
# Query intervals
--node-discovery-dns-fetch-init-interval=5s # Query interval during init (default: 5s)
--node-discovery-dns-fetch-init-duration=5m # Initialization phase duration (default: 5m)
--node-discovery-dns-fetch-interval=15s # Query interval after init (default: 15s)
# gRPC settings
--node-discovery-grpc-timeout=5s # Timeout for metadata fetch (default: 5s)
# TLS configuration
--node-discovery-dns-tls=true # Enable TLS for DNS discovery (default: false)
--node-discovery-dns-ca-certs=/path/to/ca.crt # CA certificates (comma-separated)
Configuration Examples
Single Zone Discovery:
banyand data \
--node-discovery-mode=dns \
--node-discovery-dns-srv-addresses=_grpc._tcp.banyandb-data.default.svc.cluster.local \
--node-discovery-dns-fetch-interval=10s \
--node-discovery-grpc-timeout=5s
Multi-Zone Discovery:
banyand query \
--node-discovery-mode=dns \
--node-discovery-dns-srv-addresses=_grpc._tcp.data.zone1.svc.local,_grpc._tcp.data.zone2.svc.local \
--node-discovery-dns-fetch-interval=10s
Discovery Process
The DNS discovery operates continuously in the background:
-
Query DNS SRV Records (every 5s during init, then every 15s)
- Queries each configured SRV address using
net.DefaultResolver.LookupSRV() - Deduplicates addresses across multiple SRV queries
- Queries each configured SRV address using
-
Fallback on DNS Failure
- Uses cache from previous successful query
- Allows cluster to survive temporary DNS outages
-
Update Node Cache
- For each new address: connects via gRPC, fetches node metadata, notifies handlers.
- For removed addresses: deletes from cache, notifies handlers.
Kubernetes DNS TTL
Kubernetes DNS implements Time-To-Live (TTL) caching which affects discovery responsiveness:
- Default TTL: 30 seconds (CoreDNS default since Kubernetes 1.15)
- Implications: Pod IP changes may not be visible immediately. BanyanDB polls every 15 seconds, potentially using cached results for 2-3 cycles
Customizing TTL:
To reduce discovery latency, edit the CoreDNS ConfigMap:
kubectl edit -n kube-system configmap/coredns
# Modify cache plugin TTL:
# cache 5
TLS Configuration
DNS-based discovery supports per-SRV-address TLS configuration for different roles.
Configuration Rules:
- Number of CA certificate paths must match number of SRV addresses.
caCertPaths[i]is used for connections to addresses fromsrvAddresses[i].- Multiple SRV addresses can share the same certificate file.
- If multiple SRV addresses resolve to the same IP:port, certificate from first SRV address is used.
TLS Examples:
banyand data \
--node-discovery-mode=dns \
--node-discovery-dns-tls=true \
--node-discovery-dns-srv-addresses=_grpc._tcp.data.namespace.svc.local,_grpc._tcp.liaison.namespace.svc.local \
--node-discovery-dns-ca-certs=/etc/banyandb/certs/data-ca.crt,/etc/banyandb/certs/liaison-ca.crt
Certificate Auto-Reload:
BanyanDB monitors certificate files using fsnotify and automatically reloads them when changed:
- Hot reload without process restart
- 500ms debouncing for rapid file modifications
- SHA-256 validation to detect actual content changes
This enables zero-downtime certificate rotation.
Kubernetes Deployment
Headless Service
Create a headless service for automatic DNS SRV record generation:
apiVersion: v1
kind: Service
metadata:
name: banyandb-data
namespace: default
labels:
app: banyandb
component: data
spec:
clusterIP: None # Headless service
selector:
app: banyandb
component: data
ports:
- name: grpc
port: 17912
protocol: TCP
targetPort: grpc
This creates DNS SRV record: _grpc._tcp.banyandb-data.default.svc.cluster.local
Choosing a Discovery Mode
Etcd Mode - Best For
- Traditional VM/bare-metal deployments
- Multi-cloud deployments requiring global service discovery
- Environments where etcd is already deployed
- Need for strong consistency guarantees
DNS Mode - Best For
- Kubernetes-native deployments
- Simplified operations (no external etcd management)
- Cloud-native architecture alignment
- StatefulSets with stable network identities
- Rapid deployment without external dependencies