Kubernetes Network monitoring

SkyWalking leverages SkyWalking Rover network profiling feature to measure network performance for particular pods on-demand, including metrics of L4(TCP) and L7(HTTP) traffic and raw data of HTTP requests and responses. Underlying, SkyWalking Rover converts data from socket data to metrics using eBPF technology.

Data flow

  1. SkyWalking OAP server observes which specific k8s pod needs to monitor the network.
  2. SkyWalking Rover receives tasks from SkyWalking OAP server and executes them, and converts the network data into metrics send to the backend service.
  3. The SkyWalking OAP Server accesses K8s’s API Server to fetch meta info and parses the expression with MAL to aggregate.

Setup

  1. Setup SkyWalking Rover.
  2. Enable the network profiling MAL file in the OAP server.
agent-analyzer:
  selector: ${SW_AGENT_ANALYZER:default}
  default:
    meterAnalyzerActiveFiles: ${SW_METER_ANALYZER_ACTIVE_FILES:network-profiling}

Sampling config

Notice the precondition, the HTTP request must have the trace header in SkyWalking(sw8 header) or Zipkin(b3 header(s)) format.

The sampling configurations define the sampling boundaries for the HTTP traffic. When a HTTP calling is sampled, the SkyWalking Rover could collect the HTTP request/response raw data and upload it to the span attached event.

The sampling config contains multiple rules, and each of rules has the following configurations:

  1. URI Regex: The match pattern for HTTP requests is HTTP URI-oriented. Match all requests if the URI regex is not set.
  2. Minimal Request Duration (ms): Sample the HTTP requests with slower latency than this threshold.
  3. Sample HTTP requests and responses with tracing when the response code is between 400 and 499: This is OFF by default.
  4. Sample HTTP requests and responses with tracing when the response code is between 500 and 599: This is ON by default.

Supported metrics

After SkyWalking OAP server receives the metrics from the SkyWalking Rover, it supports to analysis the following data:

  1. Topology: Based on the process and peer address, the following topology data is supported:
    1. Relation: Analyze the relationship between local processes, or local process with external pods or services.
    2. SSL: The socket read or write package with SSL.
    3. Protocol: The protocols for write or read data.
  2. TCP socket read and write metrics, including following types:
    1. Call Per Minute: The count of the socket read or write.
    2. Bytes: The package size of the socket data.
    3. Execute Time: The executed time of the socket read or write.
    4. Connect: The socket connect/accept with peer address count and execute time.
    5. Close: The socket close the socket count and execute time.
    6. RTT: The RTT(Round Trip Time) of socket communicate with peer address.
  3. Local process communicate with peer address exception data, including following types:
    1. Retransmit: The count of TCP package is retransmitted.
    2. Drop: The count of TCP package is dropped.
  4. HTTP/1.x request/response related metrics, including following types:
    1. Request CPM: The calls per minute of requests.
    2. Response CPM: The calls per minute of responses with status code.
    3. Request Package Size: The size(KB) of the request package.
    4. Response Package Size: The size(KB) of the response package.
    5. Client Side Response Duration: The duration(ms) of the client receive the response.
    6. Server Side Response Duration: The duration(ms) of the server send the response.
  5. HTTP sampled request with traces, including following types:
    1. Slow traces: The traces which have slow duration.
    2. Traces from HTTP Code in [400, 500) (ms): The traces which response status code in [400, 500).
    3. Traces from HTTP Code in [500, 600) (ms): The traces which response status code in [500, 600).