FODC Proxy Development Design

Overview

The FODC Proxy is the central control plane and data aggregator for the First Occurrence Data Collection (FODC) infrastructure. It acts as a unified gateway that aggregates observability data from multiple FODC Agents (each co-located with a BanyanDB node) and exposes ecosystem-friendly interfaces to external systems such as Prometheus and other observability platforms.

The Proxy provides:

  1. Agent Management: Registration, health monitoring, and lifecycle management of FODC Agents
  2. Metrics Aggregation: Collects and aggregates metrics from all agents with enriched metadata
  3. Cluster Topology: Maintains an up-to-date view of cluster topology, roles, and node states
  4. Configuration Collection: Aggregates and exposes node configurations for consistency verification

Responsibilities

FODC Proxy Core Responsibilities

  • Accept bi-directional gRPC connections from FODC Agents
  • Register and track agent lifecycle (online/offline, heartbeat monitoring)
  • Aggregate metrics from all agents with node metadata enrichment
  • Maintain cluster topology view based on agent registrations
  • Collect and expose node configurations for audit and consistency checks
  • Expose unified REST/Prometheus-style interfaces for external consumption
  • Provide proxy-level metrics (health, agent count, RPC latency, etc.)

Component Design

1. Proxy Components

1.1 Agent Registry Component

Purpose: Manages the lifecycle and state of all connected FODC Agents

Core Responsibilities
  • Agent Registration: Accepts agent registration requests via gRPC
  • Health Monitoring: Tracks agent heartbeat and connection status
  • State Management: Maintains agent state (online/offline, last heartbeat time)
  • Topology Building: Aggregates agent registrations into cluster topology view
  • Connection Management: Handles connection failures, reconnections, and cleanup
Core Types

AgentInfo

type AgentInfo struct {
	AgentID           string                    // Unique agent identifier (UUID, used as registry key)
	AgentIdentity     AgentIdentity              // Agent identity (IP+port+role+labels)
	NodeRole          databasev1.Role          // Node role (liaison, datanode-hot, etc.)
	PrimaryAddress    Address                  // Primary agent gRPC address
	SecondaryAddresses map[string]Address      // Secondary addresses with names (e.g., "metrics": Address, "gossip": Address)
	Labels            map[string]string        // Node labels/metadata
	RegisteredAt      time.Time                // Registration timestamp
	LastHeartbeat     time.Time               // Last heartbeat timestamp
	Status            AgentStatus               // Current agent status
}

type Address struct {
	IP   string
	Port int
}

type AgentStatus string

const (
	AgentStatusOnline  AgentStatus = "online"
	AgentStatusOffline AgentStatus = "unconnected"
)

AgentRegistry

type AgentRegistry struct {
	agents    map[string]*AgentInfo      // Map from unique agent ID to agent info
	mu        sync.RWMutex               // Protects agents map
	logger    *logger.Logger
	heartbeatTimeout time.Duration       // Timeout for considering agent offline
}

type AgentIdentity struct {
	IP     string                 // Primary IP address
	Port   int                    // Primary port
	Role   databasev1.Role       // Node role
	Labels map[string]string      // Node labels (used for lookup and identification)
}
Key Functions

RegisterAgent(ctx context.Context, info *AgentInfo) (string, error)

  • Registers a new agent or updates existing agent information
  • Generates unique agent ID (UUID) for the agent
  • Creates AgentIdentity from primary IP + port + role + labels and stores in AgentInfo
  • Uses unique agent ID as the map key
  • Validates primary address and role
  • Updates topology view
  • Returns agent ID and error if registration fails

UnregisterAgent(agentID string) error

  • Removes agent from registry using unique agent ID
  • Cleans up associated resources
  • Updates topology view
  • Called in the following scenarios:
    • When agent’s registration stream closes (connection lost)
    • When agent’s all streams are closed and connection is terminated
    • When agent has been offline longer than --agent-cleanup-timeout (detected by heartbeat health check)
    • During graceful shutdown or manual agent removal
    • When agent explicitly requests unregistration via stream

UpdateHeartbeat(agentID string) error

  • Updates last heartbeat timestamp for agent using unique agent ID
  • Marks agent as online if it was offline
  • Returns error if agent not found

GetAgent(ip string, port int, role databasev1.Role, labels map[string]string) (*AgentInfo, error)

  • Retrieves agent information by primary IP + port + role + labels
  • Creates AgentIdentity from the provided parameters
  • Searches through agents map to find matching AgentIdentity
  • Returns error if agent not found

GetAgentByID(agentID string) (*AgentInfo, error)

  • Retrieves agent information by unique agent ID
  • Direct lookup in agents map
  • Returns error if agent not found

ListAgents() []*AgentInfo

  • Returns list of all registered agents
  • Thread-safe read operation

ListAgentsByRole(role databasev1.Role) []*AgentInfo

  • Returns agents filtered by role
  • Useful for role-specific operations

CheckAgentHealth() error

  • Periodically checks agent health based on heartbeat timeout
  • Marks agents as offline if heartbeat timeout exceeded
  • Continuously runs heartbeat health checks regardless of cleanup timeout
  • Unregisters agents that have been offline longer than --agent-cleanup-timeout
  • Agents that cannot maintain connection will be removed after the cleanup timeout period
  • Returns aggregated health status
Configuration Flags

--agent-heartbeat-timeout

  • Type: duration
  • Default: 30s
  • Description: Timeout for considering an agent offline if no heartbeat received

--max-agents

  • Type: int
  • Default: 1000
  • Description: Maximum number of agents that can be registered

--agent-cleanup-timeout

  • Type: duration
  • Default: 5m
  • Minimum: Must be greater than --agent-heartbeat-timeout
  • Description: Timeout for automatically unregistering agents that have been offline. Agents that cannot maintain connection will be removed after being offline longer than this timeout. The heartbeat health check continues running regardless of this timeout. This timeout must be greater than --agent-heartbeat-timeout to allow for proper health checking.

1.2 gRPC Server Component

Purpose: Handles bi-directional gRPC communication with FODC Agents

Core Responsibilities
  • Agent Connection Handling: Accepts and manages gRPC connections from agents
  • Streaming Support: Supports bi-directional streaming for metrics
  • Protocol Implementation: Implements FODC gRPC service protocol
  • Connection Lifecycle: Manages connection establishment, maintenance, and cleanup
Core Types

FODCService (gRPC Service Implementation)

type FODCService struct {
	registry        *AgentRegistry
	metricsAggregator *MetricsAggregator
	logger          *logger.Logger
}

// Example gRPC service methods (to be defined in proto)
// RegisterAgent(stream RegisterAgentRequest) (stream RegisterAgentResponse) error
// StreamMetrics(stream StreamMetricsRequest) (stream StreamMetricsResponse) error

AgentConnection

type AgentConnection struct {
	AgentID     string                // Unique agent ID for registry lookup
	Identity    AgentIdentity         // Agent identity (IP+port+role+labels)
	Stream      grpc.ServerStream
	Context     context.Context
	Cancel      context.CancelFunc
	LastActivity time.Time
}
Key Functions

RegisterAgent(stream FODCService_RegisterAgentServer) error

  • Handles bi-directional agent registration stream
  • Receives registration requests from agent (includes primary address, role, labels)
  • Creates AgentIdentity from primary IP + port + role + labels
  • Validates registration information
  • Registers agent with AgentRegistry (generates unique agent ID)
  • Stores AgentIdentity in AgentInfo for lookup and identification purposes
  • Sends registration responses
  • Maintains stream for heartbeat and re-registration

StreamMetrics(stream FODCService_StreamMetricsServer) error

  • Handles bi-directional metrics streaming
  • Sends metrics requests from Proxy to agent (on-demand collection)
  • Receives metrics from agent at Proxy in response to requests
  • Proxy initiates by sending a metrics request via StreamMetricsResponse when an external client queries metrics
  • Agent responds with StreamMetricsRequest containing the collected metrics
  • Manages stream lifecycle
Connection Lifecycle Management

Stream Closure Handling

  • When a stream closes (due to network error, agent shutdown, or timeout), the gRPC server should:
    1. Detect stream closure via context cancellation or stream error
    2. Extract unique agent ID from the connection
    3. Check if this is the last active stream for the agent
    4. If all streams are closed, call AgentRegistry.UnregisterAgent(agentID) to clean up
    5. Update topology to reflect agent offline status

Graceful vs. Ungraceful Disconnection

  • Graceful: Agent sends explicit disconnect message before closing stream → immediate unregistration
  • Ungraceful: Stream closes unexpectedly → unregistration happens after detecting all streams closed
  • Heartbeat Timeout: Agent marked offline by CheckAgentHealth() → unregistration occurs after being offline longer than --agent-cleanup-timeout. Heartbeat health checks continue running regardless.
Configuration Flags

--grpc-listen-addr

  • Type: string
  • Default: :17900
  • Description: gRPC server address where the Proxy listens for agent connections

--grpc-max-msg-size

  • Type: int
  • Default: 4194304 (4MB)
  • Description: Maximum message size for gRPC messages

1.3 Metrics Aggregator Component

Purpose: Aggregates and enriches metrics from all agents

Core Responsibilities
  • On-Demand Metrics Collection: Requests metrics from agents via gRPC streams when external clients query metrics
  • Metrics Request Coordination: Coordinates metrics requests to multiple agents concurrently
  • Metadata Enrichment: Adds node metadata (role, agent ID, labels) to metrics
  • Normalization: Normalizes metric formats and labels across agents
  • Time Window Filtering: Filters metrics by time window when requested by external clients (agents filter based on Flight Recorder data)
Core Types

AggregatedMetric

type AggregatedMetric struct {
	Name        string                 // Metric name
	Labels      map[string]string      // Metric labels (including node metadata)
	Value       float64                // Metric value
	Timestamp   time.Time              // Metric timestamp
	AgentID     string                 // Source agent ID
	NodeRole    databasev1.Role       // Source node role
	Description string                 // Metric description/HELP text
}

MetricsAggregator

type MetricsAggregator struct {
	registry      *AgentRegistry
	grpcService   *FODCService        // For requesting metrics from agents
	mu            sync.RWMutex
	logger        *logger.Logger
}

type MetricsFilter struct {
	AgentIDs []string              // Filter by specific agent IDs (empty = all agents)
	Role    databasev1.Role       // Filter by node role (optional)
	StartTime *time.Time          // Start time for time window (optional)
	EndTime   *time.Time          // End time for time window (optional)
}
Key Functions

CollectMetricsFromAgents(ctx context.Context, filter *MetricsFilter) ([]*AggregatedMetric, error)

  • Requests metrics from all agents (or filtered agents) when external client queries
  • Sends metrics request via StreamMetrics() to each agent with time window filter (if specified)
  • Agents retrieve metrics from Flight Recorder (in-memory storage) filtered by time window and respond
  • Waits for StreamMetricsRequest responses from agents (with timeout)
  • Identifies agent connection from stream context and looks up unique agent ID
  • Retrieves AgentInfo using agent ID
  • Enriches metrics with node metadata (IP, port, role, labels) from AgentInfo
  • Combines metrics from all agents into a single list
  • Returns aggregated metrics (not stored, only returned)
  • Returns error if collection fails

GetLatestMetrics(ctx context.Context) ([]*AggregatedMetric, error)

  • Triggers on-demand collection from all agents
  • Calls CollectMetricsFromAgents() with no time filter to get current metrics
  • Returns latest metrics from all agents
  • Used for /metrics-windows endpoint without time parameters
  • Returns error if collection fails

GetMetricsWindow(ctx context.Context, startTime, endTime time.Time, filter *MetricsFilter) ([]*AggregatedMetric, error)

  • Triggers on-demand collection from all agents
  • Calls CollectMetricsFromAgents() with time window filter
  • Agents filter metrics from Flight Recorder by the specified time range
  • Returns metrics within specified time range
  • Used for /metrics-windows endpoint with time parameters
  • Returns error if collection fails
Configuration Flags

Note: No configuration flags needed for MetricsAggregator since metrics are collected on-demand and not stored.

1.4 HTTP/REST API Server Component

Purpose: Exposes REST and Prometheus-style endpoints for external consumption

Core Responsibilities
  • REST API: Implements REST endpoints for cluster topology and configuration
  • Prometheus Integration: Exposes Prometheus-compatible metrics endpoints
  • Request Handling: Handles HTTP requests and routes to appropriate handlers
  • Response Formatting: Formats responses in appropriate formats (JSON, Prometheus text)
API Endpoints

GET /metrics

  • Returns latest metrics from all agents (on-demand collection, not stored in Proxy)
  • Includes node metadata (role, agent ID, labels)
  • Format: Prometheus text format
  • Query parameters:
    • role (optional): Filter by role (liaison, datanode-hot, etc.)
    • address (optional): Filter by ip or ip:port
  • Used by: Prometheus scrapers, monitoring systems

GET /metrics-windows

  • Returns metrics from all agents (on-demand collection, not stored in Proxy)
  • Includes node metadata (role, agent ID, labels)
  • Format: JSON
  • Query parameters:
    • start_time: Start time for time window (optional) - filters metrics by start time (agents filter from Flight Recorder)
    • end_time: End time for time window (optional) - filters metrics by end time (agents filter from Flight Recorder)
    • role: Filter by node role (optional)
    • address (optional): Filter by ip or ip:port

GET /cluster

  • Returns cluster topology and node status
  • Format: JSON
  • Response includes:
    • List of all registered nodes
    • Agent identity (agent ID, address)
    • Node role
    • Node labels
    • Agent status (online/offline, last heartbeat)
    • Node relationships
  • Note: Node information is obtained from agent registration data stored in AgentRegistry

GET /cluster/config

  • Returns node configurations
  • Format: JSON
  • Query parameters:
    • agent_id: Filter by agent ID (optional)
    • role: Filter by node role (optional)

GET /health

  • Health check endpoint
  • Format: JSON
  • Returns proxy health status
Core Types

APIServer

type APIServer struct {
	metricsAggregator   *MetricsAggregator
	server              *http.Server
	logger              *logger.Logger
}
Key Functions

Start(listenAddr string) error

  • Starts HTTP server
  • Registers all API endpoints
  • Returns error if start fails

Stop() error

  • Gracefully stops HTTP server
  • Waits for in-flight requests
  • Returns error if stop fails
Configuration Flags

--http-listen-addr

  • Type: string
  • Default: :17901
  • Description: HTTP server listen address

--http-read-timeout

  • Type: duration
  • Default: 10s
  • Description: HTTP read timeout

--http-write-timeout

  • Type: duration
  • Default: 10s
  • Description: HTTP write timeout

2. Agent Components

Purpose: Components that run within FODC Agents to communicate with the Proxy

2.1 Proxy Client Component

Purpose: Manages connection and communication with the FODC Proxy

Core Responsibilities
  • Connection Management: Establishes and maintains gRPC connection to Proxy
  • Registration: Registers agent with Proxy on startup
  • Heartbeat Management: Sends periodic heartbeats to maintain connection
  • Stream Management: Manages bi-directional streams for metrics
  • Reconnection Logic: Handles connection failures and automatic reconnection
Core Types

ProxyClient

type ProxyClient struct {
	proxyAddr       string
	nodeIP          string
	nodePort        int
	nodeRole        databasev1.Role
	labels          map[string]string
	conn            *grpc.ClientConn
	client          fodcv1.FODCServiceClient
	heartbeatTicker *time.Ticker
	mu              sync.RWMutex
	logger          *logger.Logger
}

type MetricsRequestFilter struct {
	StartTime *time.Time  // Start time for time window (optional)
	EndTime   *time.Time  // End time for time window (optional)
}

Note: The ProxyClient integrates with Flight Recorder, which continuously collects and stores metrics in-memory. When the Proxy requests metrics, the agent retrieves them from Flight Recorder rather than collecting them on-demand.

Key Functions

Connect(ctx context.Context) error

  • Establishes gRPC connection to Proxy
  • Returns error if connection fails

StartRegistrationStream(ctx context.Context) error

  • Establishes bi-directional registration stream with Proxy
  • Sends initial RegisterAgentRequest with primary address (IP + port), role, and labels
  • Receives RegisterAgentResponse with heartbeat interval
  • Maintains stream for periodic heartbeat and re-registration
  • Returns error if stream establishment fails

StartMetricsStream(ctx context.Context) error

  • Establishes bi-directional metrics stream with Proxy
  • Maintains stream for on-demand metrics requests
  • Listens for metrics request commands from Proxy
  • When request received, retrieves metrics from Flight Recorder and sends response
  • Returns error if stream fails

RetrieveAndSendMetrics(ctx context.Context, filter *MetricsRequestFilter) error

  • Retrieves metrics from Flight Recorder when requested by Proxy
  • Queries Flight Recorder (in-memory storage) for metrics
  • Flight Recorder contains continuously collected metrics from:
    • BanyanDB /metrics endpoint scraping
    • KTM/OS telemetry collection
    • Other observability sources
  • Filters metrics by time window if specified in request
  • Packages metrics in StreamMetricsRequest (includes metrics and timestamp)
  • Sends metrics to Proxy via StreamMetrics() stream
  • Returns error if retrieval or send fails

SendHeartbeat(ctx context.Context) error

  • Sends heartbeat to Proxy
  • Updates connection status
  • Returns error if heartbeat fails

Disconnect() error

  • Closes connection to Proxy
  • Cleans up resources
  • Returns error if disconnect fails
Configuration Flags

--proxy-addr

  • Type: string
  • Default: localhost:17900
  • Description: FODC Proxy gRPC address

--node-ip

  • Type: string
  • Default: (required, no default)
  • Description: IP address for this BanyanDB node’s primary gRPC address. Used as part of AgentIdentity for agent identification.

--node-port

  • Type: int
  • Default: (required, no default)
  • Description: Port number for this BanyanDB node’s primary gRPC address. Used as part of AgentIdentity for agent identification.

--node-role

  • Type: string
  • Default: (required, no default)
  • Description: Role of this BanyanDB node. Valid values: liaison, datanode-hot, datanode-warm, datanode-cold, etc. Must match the node’s actual role in the cluster.

--node-labels

  • Type: string (comma-separated key=value pairs)
  • Default: (optional)
  • Description: Labels/metadata for this node. Format: key1=value1,key2=value2. Examples: zone=us-west-1,env=production. Used for filtering and grouping nodes in the Proxy.

--heartbeat-interval

  • Type: duration
  • Default: 10s
  • Description: Interval for sending heartbeats to Proxy. Note: The Proxy may override this value in RegisterAgentResponse.

--reconnect-interval

  • Type: duration
  • Default: 5s
  • Description: Interval for reconnection attempts when connection to Proxy is lost

API Design

gRPC API

Service Definition


syntax = "proto3";

package banyandb.fodc.v1;

import "google/protobuf/timestamp.proto";

option go_package = "github.com/apache/skywalking-banyandb/api/proto/banyandb/fodc/v1";

service FODCService {
  // Bi-directional stream for agent registration and heartbeat
  rpc RegisterAgent(stream RegisterAgentRequest) returns (stream RegisterAgentResponse);
  
  // Bi-directional stream for metrics
  // Agent sends StreamMetricsRequest (metrics data), Proxy sends StreamMetricsResponse (metrics requests)
  rpc StreamMetrics(stream StreamMetricsRequest) returns (stream StreamMetricsResponse);
}

message RegisterAgentRequest {
  string node_role = 1;
  map<string, string> labels = 2;
  Address primary_address = 3;
  map<string, Address> secondary_addresses = 4;
}

message Address {
  string ip = 1;
  int32 port = 2;
}

message RegisterAgentResponse {
  bool success = 1;
  string message = 2;
  int64 heartbeat_interval_seconds = 3;
}

message StreamMetricsRequest {
  repeated Metric metrics = 1;
  google.protobuf.Timestamp timestamp = 2;
}

message Metric {
  string name = 1;
  map<string, string> labels = 2;
  double value = 3;
  string description = 4;
}

message StreamMetricsResponse {
  google.protobuf.Timestamp start_time = 1;  // Optional start time for time window
  google.protobuf.Timestamp end_time = 2;    // Optional end time for time window
}

REST API

Endpoints

GET /metrics

  • Description: Returns latest metrics from all agents (on-demand collection, not stored in Proxy)
  • Query Parameters:
    • address (optional): Filter by ip or ip:port
    • role (optional): Filter by role (liaison, datanode-hot, etc.)
  • Response: Prometheus text format with node metadata labels
  • Note: Metrics are collected on-demand from agents' Flight Recorder. No metrics are stored in the Proxy.
  • Example:
# HELP banyandb_stream_tst_inverted_index_total_doc_count Total document count
# TYPE banyandb_stream_tst_inverted_index_total_doc_count gauge
banyandb_stream_tst_inverted_index_total_doc_count{index="test",agent_id="agent-uuid-123",node_role="datanode-hot"} 12345

GET /metrics-windows

  • Description: Returns metrics from all agents (on-demand collection, not stored in Proxy)
  • Query Parameters:
    • start_time (optional): RFC3339 timestamp - filters metrics by start time (agents filter from Flight Recorder)
    • end_time (optional): RFC3339 timestamp - filters metrics by end time (agents filter from Flight Recorder)
    • address (optional): Filter by ip or ip:port
    • role (optional): Filter by role (liaison, datanode-hot, etc.)
  • Response: JSON array of metrics, each with time series data for line charts
  • Note: Metrics are collected on-demand from agents' Flight Recorder. No metrics are stored in the Proxy.
  • Format: Each metric is grouped by name and labels, with a data array containing timestamp-value pairs for charting
  • Example:
[
  {
    "name": "banyandb_stream_tst_inverted_index_total_doc_count",
    "description": "Total document count",
    "labels": {
      "index": "test",
      "node_role": "datanode-hot"
    },
    "agent_id": "agent-uuid-123",
    "ip": "192.168.1.1",
    "port": 17900,
    "data": [
      {
        "timestamp": "2024-01-01T11:00:00Z",
        "value": 12345
      },
      {
        "timestamp": "2024-01-01T11:05:00Z",
        "value": 12350
      },
      {
        "timestamp": "2024-01-01T12:00:00Z",
        "value": 12355
      }
    ]
  },
  {
    "name": "http_requests_total",
    "description": "Total requests",
    "labels": {
      "index": "test",
      "node_role": "datanode-hot"
    },
    "agent_id": "agent-uuid-123",
    "ip": "192.168.1.1",
    "port": 17900,
    "data": [
      {
        "timestamp": "2024-01-01T11:00:00Z",
        "value": 300
      },
      {
        "timestamp": "2024-01-01T11:05:00Z",
        "value": 200
      },
      {
        "timestamp": "2024-01-01T12:00:00Z",
        "value": 250
      }
    ]
  },
  {
    "name": "banyandb_stream_tst_inverted_index_total_doc_count",
    "description": "Total document count",
    "labels": {
      "index": "test",
      "node_role": "liaison"
    },
    "agent_id": "agent-uuid-456",
    "ip": "192.168.1.2",
    "port": 17900,
    "data": [
      {
        "timestamp": "2024-01-01T11:00:00Z",
        "value": 54321
      },
      {
        "timestamp": "2024-01-01T11:05:00Z",
        "value": 54330
      },
      {
        "timestamp": "2024-01-01T12:00:00Z",
        "value": 54340
      }
    ]
  }
]

GET /cluster

  • Description: Returns cluster topology and node information
  • Response: JSON
  • Example:
{
  "nodes": [
    {
      "node_id": "node1",
      "node_role": "liaison",
      "primary_address": {
        "ip": "192.168.1.1",
        "port": 17900
      },
      "secondary_addresses": {
        "metrics": {
          "ip": "192.168.1.1",
          "port": 17901
        },
        "admin": {
          "ip": "192.168.1.1",
          "port": 17902
        }
      },
      "labels": {
        "zone": "us-west-1"
      },
      "status": "online",
      "last_heartbeat": "2024-01-01T12:00:00Z",
    },
    {
      "node_id": "node2",
      "node_role": "data-hot",
      "primary_address": {
        "ip": "192.168.1.1",
        "port": 17903
      },
      "secondary_addresses": {
        "metrics": {
          "ip": "192.168.1.1",
          "port": 17904
        },
        "admin": {
          "ip": "192.168.1.1",
          "port": 17905
        }
      },
      "labels": {
        "zone": "us-west-1"
      },
      "status": "online",
      "last_heartbeat": "2024-01-01T12:00:00Z",
    }
  ],
  "calls": [
    {
      "id": "call1",
      "source": "node1",
      "target": "node2",
    },
  ]
  "updated_at": "2024-01-01T12:00:00Z"
}

GET /cluster/config

  • Description: Returns node configurations
  • Query Parameters:
    • address (optional): Filter by ip or ip:port
    • role (optional): Filter by role
  • Response: JSON
  • Example:
{
  "configurations": [
    {
      "agent_id": "agent-uuid-123",
      "node_role": "liaison",
      "startup_params": {
        "--listen-addr": ":17900",
        "--data-path": "/data"
      },
      "runtime_config": {
        "max_connections": 1000
      },
      "collected_at": "2024-01-01T12:00:00Z"
    }
  ]
}

GET /health

  • Description: Health check endpoint
  • Response: JSON
  • Example:
{
  "status": "healthy",
  "agents_online": 5,
  "agents_total": 5,
  "uptime_seconds": 3600
}

Data Flow

Agent Registration Flow

1. FODC Agent Starts
   - Reads node information from configuration flags:
     * --node-ip: Node IP
     * --node-port: Node port
     * --node-role: Node role (liaison, datanode-hot, etc.)
     * --node-labels: Optional node labels/metadata
   ↓
2. Agent Connects to Proxy via gRPC
   - Uses --proxy-addr flag to determine Proxy address
   ↓
3. Agent Establishes RegisterAgent() Bi-directional Stream
   - Opens stream to Proxy
   ↓
4. Agent Sends RegisterAgentRequest
   - Sends primary address (IP + port), role, and labels from configuration flags
   ↓
5. Proxy Validates Registration
   - Creates AgentIdentity from primary IP + port + role + labels
   - Generates unique agent ID (UUID) for the agent
   - Validates role and primary address
   - Note: AgentIdentity may be repetitious, so unique agent ID is used as registry key
   ↓
6. Proxy Registers Agent
   - Creates AgentInfo in AgentRegistry
   - Updates ClusterTopology
   ↓
7. Proxy Sends RegisterAgentResponse via Stream
   - Heartbeat interval
   ↓
8. Agent Maintains Registration Stream
   - Sends periodic heartbeats via RegisterAgentRequest
   - Receives RegisterAgentResponse updates
   - Stream remains open for re-registration if needed
   ↓
9. Agent Establishes Metrics Stream
   - StreamMetrics() for metrics collection

Metrics Collection Flow

1. External Client Requests Metrics
   - GET /metrics (latest metrics from all agents)
   - GET /metrics-windows (agent metrics with time window)
   ↓
2. APIServer Receives Request
   - Routes to metrics handler
   - Extracts query parameters (start_time, end_time, agent_id, role)
   ↓
3. Proxy Requests Metrics from All Agents
   - For each registered agent, sends metrics request via StreamMetrics()
   - For `/metrics`: requests latest metrics (no time window filter)
   - For `/metrics-windows`: includes time window filter (start_time, end_time) if specified
   - Uses bi-directional stream to request current metrics
   - May filter by agent_id or role if specified in query
   ↓
4. Agents Retrieve Metrics from Flight Recorder
   - Each agent receives metrics request via StreamMetrics() with time window filter (if specified)
   - Agent retrieves metrics from Flight Recorder (in-memory storage)
   - Flight Recorder continuously collects and stores metrics from:
     * BanyanDB /metrics endpoint scraping
     * KTM/OS telemetry collection
   - Agent filters metrics by time window if specified in StreamMetricsResponse
   - Packages filtered metrics in StreamMetricsRequest
   ↓
5. Agents Send Metrics to Proxy
   - Each agent sends StreamMetricsRequest via StreamMetrics()
   - Includes metrics and timestamp
   ↓
6. Proxy Receives Metrics from All Agents
   - Collects StreamMetricsRequest from each agent
   - Waits for responses from all requested agents (with timeout)
   ↓
7. Proxy Enriches Metrics
   - Identifies agent connection from stream context
   - Looks up unique agent ID from agent connection
   - Retrieves AgentInfo using agent ID
   - Adds node metadata (IP, port, role, labels) from AgentInfo to each metric
   - Normalizes metric format across agents
   ↓
8. MetricsAggregator Aggregates Metrics
   - Combines metrics from all agents into a single list
   - Filters by time window if specified (agents already filtered by time window from Flight Recorder)
   - Associates metrics with agent using unique agent ID from stream context
   - Does not store metrics (on-demand collection only)
   ↓
9. Proxy Formats and Returns Response
   - Converts aggregated metrics to Prometheus text format
   - Includes node metadata labels
   - Returns HTTP 200 response to external client

Metrics Query Flow

Note: This flow is integrated with Metrics Collection Flow. When an external request comes in, the Proxy requests metrics from all agents. Agents retrieve metrics from Flight Recorder (in-memory storage) which continuously collects metrics.

For /metrics endpoint (Latest metrics from all agents):

1. External Client Requests Latest Metrics
   - GET /metrics?role=Y (optional filters)
   ↓
2. Triggers Metrics Collection Flow
   - Proxy requests latest metrics from all agents (or filtered agents)
   - No time window filter (returns latest metrics)
   - See "Metrics Collection Flow" above for detailed steps
   ↓
3. Response Sent to Client
   - HTTP 200 with Prometheus text format
   - Includes aggregated latest metrics from all agents with node metadata

For /metrics-windows endpoint (Agent metrics):

1. External Client Requests Agent Metrics
   - GET /metrics-windows?start_time=X&end_time=Y&role=W
   ↓
2. Triggers Metrics Collection Flow
   - Proxy requests metrics from all agents (or filtered agents)
   - See "Metrics Collection Flow" above for detailed steps
   ↓
3. Response Sent to Client
   - HTTP 200 with Prometheus text format
   - Includes aggregated metrics from all agents with node metadata

Testing Strategy

Unit Testing

Agent Registry Package

  • Test RegisterAgent() with valid and invalid inputs
  • Test UnregisterAgent() cleanup behavior
  • Test UpdateHeartbeat() timestamp updates
  • Test GetAgent() and ListAgents() retrieval
  • Test ListAgentsByRole() filtering
  • Test CheckAgentHealth() timeout detection
  • Test concurrent registration/unregistration
  • Test error handling for duplicate registrations

gRPC Service Package

  • Test RegisterAgent() bi-directional streaming
  • Test StreamMetrics() bi-directional streaming
  • Test connection lifecycle management
  • Test error handling for invalid requests
  • Test stream reconnection behavior
  • Test concurrent stream handling

Metrics Aggregator Package

  • Test CollectMetricsFromAgents() on-demand collection
  • Test GetLatestMetrics() without time filter
  • Test GetMetricsWindow() with time filter
  • Test metrics enrichment with node metadata
  • Test concurrent metrics collection from multiple agents
  • Test time window filtering (agents filter from Flight Recorder)
  • Test node and role filtering

HTTP/REST API Package

  • Test all REST endpoints (GET/POST handlers)
  • Test request validation
  • Test response formatting (JSON, Prometheus text)
  • Test query parameter parsing
  • Test error handling and status codes
  • Test concurrent request handling
  • Test rate limiting (if implemented)

Integration Testing

Test Case 1: Agent Registration and Metrics Flow

  • Start Proxy server
  • Start mock Agent
  • Agent registers with Proxy
  • Query /metrics-windows endpoint (triggers metrics request)
  • Verify Proxy requests metrics from agent via StreamMetrics()
  • Verify agent retrieves metrics from Flight Recorder and sends response
  • Verify metrics appear in /metrics-windows endpoint response
  • Verify agent appears in /cluster endpoint
  • Stop agent and verify offline status

Test Case 2: Metrics Time Window

  • Start Proxy and multiple Agents
  • Query /metrics-windows with time range (triggers metrics request)
  • Verify Proxy requests metrics from all agents
  • Verify agents retrieve metrics from Flight Recorder and send responses
  • Verify metrics are filtered by time window if specified
  • Test edge cases (empty window, full window)

Test Case 3: Agent Reconnection

  • Start Proxy and Agent
  • Agent registers with Proxy
  • Query /metrics-windows endpoint (triggers metrics collection)
  • Simulate agent disconnection
  • Verify agent marked as offline
  • Query /metrics-windows endpoint again
  • Verify Proxy handles missing agent gracefully
  • Agent reconnects and re-registers
  • Verify agent marked as online
  • Query /metrics-windows endpoint again
  • Verify metrics collection resumes successfully

Test Case 4: Multiple Agents and Roles

  • Start Proxy
  • Start multiple Agents with different roles
  • Verify topology shows all nodes correctly
  • Verify metrics aggregation includes all nodes
  • Test role-based filtering in APIs

E2E Testing

Test Case 1: Full FODC Proxy Workflow

  • Deploy Proxy in test environment
  • Deploy multiple Agents (liaison, datanode-hot, datanode-warm)
  • Agents register with Proxy
  • External Prometheus scraper queries /metrics-windows endpoint
  • Verify Proxy requests metrics from all agents
  • Verify agents retrieve metrics from Flight Recorder and respond
  • Verify metrics include node metadata
  • Verify cluster topology API

Test Case 2: High Availability and Scalability

  • Deploy Proxy with multiple agent connections (100+)
  • Verify Proxy handles concurrent connections
  • Verify on-demand metrics collection performance
  • Verify Proxy does not store metrics (memory efficient)
  • Test agent reconnection under load

Test Case 3: Failure Scenarios

  • Test Proxy behavior when agents disconnect
  • Test Proxy behavior with invalid agent data
  • Test Proxy behavior with network partitions
  • Test Proxy recovery after failures
  • Verify data consistency

Test Case 4: Prometheus Integration

  • Deploy Proxy with Prometheus
  • Configure Prometheus to scrape /metrics and /metrics-windows
  • Verify metrics appear in Prometheus
  • Verify metric labels and metadata
  • Test Prometheus query performance