The profiling is an on-demand diagnosing method to locate bottleneck of the services. These typical scenarios usually are suitable for profiling through various profiling tools
- Some methods slow down the API performance.
- Too many threads and/or high-frequency I/O per OS process reduce the CPU efficiency.
- Massive RPC requests block the network to cause responding slowly.
- Unexpected network requests caused by security issues or codes' bug.
In the SkyWalking landscape, we provided three ways to support profiling within reasonable resource cost.
- In-process profiling is bundled with auto-instrument agents.
- Out-of-process profiling is powered by eBPF agent.
- Continuous profiling is powered by eBPF agent.
In-process profiling is primarily provided by auto-instrument agents in the VM-based runtime. This feature resolves the issue <1> through capture the snapshot of the thread stacks periodically. The OAP would aggregate the thread stack per RPC request, and provide a hierarchy graph to indicate the slow methods based on continuous snapshot.
The period is usually every 10-100 milliseconds, which is not recommended to be less, due to this capture would usually cause classical stop-the-world for the VM, which would impact the whole process performance.
Learn more tech details from the post, Use Profiling to Fix the Blind Spot of Distributed Tracing.
For now, Java and Python agents support this.
Out-of-process profiling leverage eBPF technology with origins in the Linux kernel. It provides a way to extend the capabilities of the kernel safely and efficiently.
On-CPU profiling is suitable for analyzing thread stacks when service CPU usage is high.
If the stack is dumped more times, it means that the thread stack occupies more CPU resources.
This is pretty similar with in-process profiling to resolve the issue <1>, but it is made out-of-process and based on Linux eBPF. Meanwhile, this is made for languages without VM mechanism, which caused not supported by in-process agents, such as, C/C++, Rust. Golang is a special case, it exposed the metadata of the VM for eBPF, so, it could be profiled.
Off-CPU profiling is suitable for performance issues that are not caused by high CPU usage, but may be on high CPU load. This profiling aims to resolve the issue <2>.
- When there are too many threads in one service, using off-CPU profiling could reveal which threads spend more time context switching.
- Codes heavily rely on disk I/O or remote service performance would slow down the whole process.
Off-CPU profiling provides two perspectives
- Thread switch count: The number of times a thread switches context. When the thread returns to the CPU, it completes one context switch. A thread stack with a higher switch count spends more time context switching.
- Thread switch duration: The time it takes a thread to switch the context. A thread stack with a higher switch duration spends more time off-CPU.
Learn more tech details about ON/OFF CPU profiling from the post, Pinpoint Service Mesh Critical Performance Impact by using eBPF
Network profiling captures the network packages to analysis traffic at L4(TCP) and L7(HTTP) to recognize network traffic from a specific process or a k8s pod. Through this traffic analysis, locate the root causes of the issues <3> and <4>.
Network profiling provides
- Network topology and identify processes.
- Observe TCP traffic metrics with TLS status.
- Observe HTTP traffic metrics.
- Sample HTTP request/response raw data within tracing context.
- Observe time costs for local I/O costing on the OS. Such as the time of Linux process HTTP request/response.
Learn more tech details from the post, Diagnose Service Mesh Network Performance with eBPF
Continuous Profiling utilizes monitoring of system, processes, and network, and automatically initiates profiling tasks when conditions meet the configured thresholds and time windows.
Continuous profiling periodically collects the following types of performance metrics for processes and systems:
- System Load: Monitor current system load value.
- Process CPU: Monitor process CPU usage percent, value in [0-100].
- Process Thread Count: Monitor process thread count.
- HTTP Error Rate: Monitor the process HTTP(/1.x) response error(response status >= 500) percent, value in [0-100].
- HTTP Avg Response Time: Monitor the process HTTP(/1.x) response duration(ms).
When the collected metric data matches the configured threshold, the following types of profiling tasks could be triggered:
- On CPU Profiling: Perform eBPF On CPU Profiling on processes that meet the threshold.
- Off CPU Profiling: Perform eBPF Off CPU Profiling on processes that meet the threshold.
- Network Profiling: Perform eBPF Network Profiling on all processes within the same instance as the processes that meet the threshold.