Profiling
The profiling is used to profiling the processes from the Service Discovery, and send the snapshot to the backend server.
Configuration
Name | Default | Environment Key | Description |
---|---|---|---|
profiling.active | true | ROVER_PROFILING_ACTIVE | Is active the process profiling. |
profiling.check_interval | 10s | ROVER_PROFILING_CHECK_INTERVAL | Check the profiling task interval. |
profiling.flush_interval | 5s | ROVER_PROFILING_FLUSH_INTERVAL | Combine existing profiling data and report to the backend interval. |
profiling.task.on_cpu.dump_period | 9ms | ROVER_PROFILING_TASK_ON_CPU_DUMP_PERIOD | The profiling stack dump period. |
profiling.task.network.report_interval | 2s | ROVER_PROFILING_TASK_NETWORK_TOPOLOGY_REPORT_INTERVAL | The interval of send metrics to the backend. |
profiling.task.network.meter_prefix | rover_net_p | ROVER_PROFILING_TASK_NETWORK_TOPOLOGY_METER_PREFIX | The prefix of network profiling metrics name. |
profiling.task.network.protocol_analyze.per_cpu_buffer | 400KB | ROVER_PROFILING_TASK_NETWORK_PROTOCOL_ANALYZE_PER_CPU_BUFFER | The size of socket data buffer on each CPU. |
profiling.task.network.protocol_analyze.parallels | 2 | ROVER_PROFILING_TASK_NETWORK_PROTOCOL_ANALYZE_PARALLELS | The count of parallel protocol analyzer. |
profiling.task.network.protocol_analyze.queue_size | 5000 | ROVER_PROFILING_TASK_NETWORK_PROTOCOL_ANALYZE_QUEUE_SIZE | The size of per paralleled analyzer queue. |
profiling.task.network.protocol_analyze.sampling.http.default_request_encoding | UTF-8 | ROVER_PROFILING_TASK_NETWORK_PROTOCOL_ANALYZE_SAMPLING_HTTP_DEFAULT_REQUEST_ENCODING | The default body encoding when sampling the request. |
profiling.task.network.protocol_analyze.sampling.http.default_response_encoding | UTF-8 | ROVER_PROFILING_TASK_NETWORK_PROTOCOL_ANALYZE_SAMPLING_HTTP_DEFAULT_RESPONSE_ENCODING | The default body encoding when sampling the response. |
profiling.continuous.meter_prefix | rover_con_p | ROVER_PROFILING_CONTINUOUS_METER_PREFIX | The continuous related meters prefix name. |
profiling.continuous.fetch_interval | 1s | ROVER_PROFILING_CONTINUOUS_FETCH_INTERVAL | The interval of fetch metrics from the system, such as Process CPU, System Load, etc. |
profiling.continuous.check_interval | 5s | ROVER_PROFILING_CONTINUOUS_CHECK_INTERVAL | The interval of check metrics is reach the thresholds. |
profiling.continuous.trigger.execute_duration | 10m | ROVER_PROFILING_CONTINUOUS_TRIGGER_EXECUTE_DURATION | The duration of the profiling task. |
profiling.continuous.trigger.silence_duration | 20m | ROVER_PROFILING_CONTINUOUS_TRIGGER_SILENCE_DURATION | The minimal duration between the execution of the same profiling task. |
Prepare service
Before profiling your service, please make sure your service already has the symbol data inside the binary file. So we could locate the stack symbol, It could be checked following these ways:
- objdump: Using
objdump --syms path/to/service
. - readelf: Using
readelf --syms path/to/service
.
Profiling Type
All the profiling tasks are using the Linux Official Function and kprobe
or uprobe
to open perf event,
and attach the eBPF Program to dump stacks.
On CPU
On CPU Profiling task is using PERF_COUNT_SW_CPU_CLOCK
to profiling the process with the CPU clock.
Off CPU
Off CPU Profiling task is attach the finish_task_switch
in krobe
to profiling the process.
Network
Network Profiling task is intercept IO-related syscall and urprobe
in process to identify the network traffic and generate the metrics.
Also, the following protocol are supported for analyzing using OpenSSL library, BoringSSL library, GoTLS, NodeTLS or plaintext:
- HTTP/1.x
- HTTP/2
- MySQL
- CQL(The Cassandra Query Language)
- MongoDB
- Kafka
- DNS
Collecting data
Network profiling uses metrics, logs send to the backend service.
Data Type
The network profiling has customized the following two types of metrics to represent the network data:
- Counter: Records the total number of data in a certain period of time. Each counter containers the following data:
- Count: The count of the execution.
- Bytes: The package size of the execution.
- Exe Time: The consumed time(nanosecond) of the execution.
- Histogram: Records the distribution of the data in the bucket.
- TopN: Record the highest latency data in a certain period of time.
Labels
Each metric contains the following labels to identify the process relationship:
Name | Type | Description |
---|---|---|
client_process_id or server_process_id | string | The ID of the current process, which is determined by the role of the current process in the connection as server or client. |
client_local or server_local | boolean | The remote process is a local process. |
client_address or server_address | string | The remote process address. ex: IP:port . |
side | enum | The current process is either “client” or “server” in this connection. |
protocol | string | Identification the protocol based on the package data content. |
is_ssl | bool | Is the current connection using SSL. |
Layer-4 Data
Based on the above two data types, the following metrics are provided.
Name | Type | Unit | Description |
---|---|---|---|
write | Counter | nanosecond | The socket write counter |
read | Counter | nanosecond | The socket read counter |
write RTT | Counter | microsecond | The socket write RTT counter |
connect | Counter | nanosecond | The socket connect/accept with other server/client counter |
close | Counter | nanosecond | The socket close counter |
retransmit | Counter | nanosecond | The socket retransmit package counter |
drop | Counter | nanosecond | The socket drop package counter |
write RTT | Histogram | microsecond | The socket write RTT execute time histogram |
write execute time | Histogram | nanosecond | The socket write data execute time histogram |
read execute time | Histogram | nanosecond | The socket read data execute time histogram |
connect execute time | Histogram | nanosecond | The socket connect/accept with other server/client execute time histogram |
close execute time | Histogram | nanosecond | The socket close execute time histogram |
HTTP/1.x Data
Metrics
Name | Type | Unit | Description |
---|---|---|---|
http1_request_cpm | Counter | count | The HTTP request counter |
http1_response_status_cpm | Counter | count | The count of per HTTP response code |
http1_request_package_size | Histogram | Byte size | The request package size |
http1_response_package_size | Histogram | Byte size | The response package size |
http1_client_duration | Histogram | millisecond | The duration of single HTTP response on the client side |
http1_server_duration | Histogram | millisecond | The duration of single HTTP response on the server side |
Logs
Name | Type | Unit | Description |
---|---|---|---|
slow_traces | TopN | millisecond | The Top N slow trace(id)s |
status_4xx | TopN | millisecond | The Top N trace(id)s with response status in 400-499 |
status_5xx | TopN | millisecond | The Top N trace(id)s with response status in 500-599 |
Span Attached Event
Name | Description |
---|---|
HTTP Request Sampling | Complete information about the HTTP request, it’s only reported when it matches slow/4xx/5xx traces. |
HTTP Response Sampling | Complete information about the HTTP response, it’s only reported when it matches slow/4xx/5xx traces. |
Syscall xxx | The methods to use when the process invoke with the network-related syscall method. It’s only reported when it matches slow/4xx/5xx traces. |
Continuous Profiling
The continuous profiling feature monitors low-power target process information, including process CPU usage and network requests, based on configuration passed from the backend. When a threshold is met, it automatically initiates a profiling task(on/off CPU, Network) to provide more detailed analysis.
Monitor Type
System Load
Monitor the average system load for the last minute, which is equivalent to using the first value of the load average
in the uptime
command.
Process CPU
The target process utilizes a certain percentage of the CPU on the current host.
Process Thread Count
The real-time number of threads in the target process.
Network
Network monitoring uses eBPF technology to collect real-time performance data of the current process responding to requests. Requests sent upstream are not monitored by the system.
Currently, network monitoring supports parsing of the HTTP/1.x protocol and supports the following types of monitoring:
Error Rate
: The percentage of network request errors, such as HTTP status codes within the range of[500-600)
, is considered as erroneous.Avg Response Time
: Average response time(ms) for specified URI.
Metrics
Rover would periodically send collected monitoring data to the backend using the Native Meter Protocol
.
Name | Unit | Description |
---|---|---|
process_cpu | (0-100)% | The CPU usage percent |
process_thread_count | count | The thread count of process |
system_load | count | The average system load for the last minute, each process have same value |
http_error_rate | (0-100)% | The network request error rate percentage |
http_avg_response_time | ms | The network average response duration |