Monitoring LLM Applications with SkyWalking 10.4: Insights into Performance and Cost

SkyWalking 10.4 introduces the Virtual GenAI dashboard. This article explores how to use this new feature to monitor and optimize LLM integrations.

By YiMing Shao | Sunday, April 05, 2026

Tags | GenAI LLM Observability Opentelemetry

With the deep penetration of Generative AI (GenAI) into enterprise workflows, developers face a challenging paradox: while powerful LLM capabilities are easily integrated via Spring AI or OpenAI SDKs, the actual performance and reliability of these calls remain largely invisible.

1. The “Black Box” of Cost and Performance: Is the Expensive Model Worth It?

Facing high LLM bills, organizations often only see a total sum paid to a provider, but cannot calculate the “ROI” within the application.

Blind Upgrades: You might switch to a premium flagship model for a better experience. But in your specific business scenario, does paying several times more per token actually yield lower latency or a faster TTFT (Time to First Token)?
Lack of Real-World Benchmarks: Official benchmarks mean little without your real-world business requests. You need to know which model achieves the perfect balance between “Token/Cost Consumption” and “Response Speed” under your actual prompt lengths and concurrency levels.

2. The Vanishing “Golden Timeout”

Many teams set timeouts for LLM calls arbitrarily (e.g., 30s or 60s).

Too Short: During peak periods or long-text generation, requests are frequently interrupted, causing business failure rates to soar.
Too Long: If a provider hangs, requests pile up in memory, blocking execution threads and potentially leading to the collapse of the entire Java application or microservice cluster. Only by mastering the P99/P95 Latency can you set rational timeout policies based on data rather than intuition.

3. The Overlooked Experience Killer: TTFT

In GenAI scenarios, a user’s perception of speed depends less on the total duration of the conversation and more on “when the first word appears.” * A streaming response with a 10s total duration but a 500ms TTFT feels instantaneous.

A non-streaming response with a 5s total duration but a 4s TTFT feels “frozen.” If your observability system only tracks total latency, you miss the core UX metric that explains why users complain about “AI slowness.”

SkyWalking 10.4: A “Digital Dashboard”
From the Application Perspective The Virtual GenAI capability introduced in Apache SkyWalking 10.4 fills this “observability vacuum.” It avoids reliance on external gateways by using application-side probes (like the Java Agent) to collect the most authentic data from the client’s perspective.

Precise Latency Distribution: Multi-dimensional metrics (P50, P90, P99) help visualize LLM fluctuations to inform dynamic timeout strategies.
Core UX Metric — TTFT Monitoring: Native support for first-token latency in streaming calls.
Multi-dimensional Model Profiling: Aligns token usage, estimated cost, and performance across Providers and Models, helping you choose the most cost-effective solution for your specific needs.

Virtual GenAI Observability

Virtual GenAI represents Generative AI service nodes detected by probe plugins. All performance metrics are based on the GenAI Client Perspective.

For instance, the Spring AI plugin in the Java Agent detects the response latency of a Chat Completion request. SkyWalking then visualizes these in the dashboard:

Traffic & Success Rate (CPM & SLA)
Latency & TTFT
Token Usage (Input/Output)
Estimated Cost

Screenshots:

How It Works

When the SkyWalking Java Agent or OTLP probes intercept calls to mainstream AI frameworks (e.g., Spring AI, OpenAI SDK), they report Trace data to the SkyWalking OAP. The OAP aggregates and computes this data to generate performance metrics for both Providers and Models, which are then rendered in the built-in Virtual-GenAI dashboards.

Installation & Configuration

Requirements

SkyWalking Java Agent: >= 9.7
SkyWalking OAP: >= 10.4

Semantic Conventions & Compatibility

SkyWalking Virtual GenAI follows OpenTelemetry GenAI Semantic Conventions. OAP identifies GenAI-related Spans based on:

SkyWalking Java Agent

Spans must be of type Exit, have the SpanLayer attribute set to GENAI, and contain the gen_ai.response.model tag.

OTLP / Zipkin Probes

Spans must contain the gen_ai.response.model tag.

For details, refer to the E2E configurations:

GenAI Estimated Cost Configuration

Overview

SkyWalking provides a built-in GenAI Billing Configuration File.

This file defines how SkyWalking maps model names from Trace data to their corresponding providers and estimates the token cost for each LLM call. The estimated cost is displayed in the SkyWalking UI alongside trace and metric data, helping users intuitively understand the financial impact of their GenAI usage.

Important: The pricing in this file is intended for cost estimation only and must not be treated as actual billing or invoice amounts. Users are advised to regularly verify the latest rates on the providers’ official pricing pages.

Configuration Structure

Top-level Fields

Field	Type	Description
`last-updated`	`date`	The last update date of the pricing data. All prices are based on public billing standards announced by providers prior to this date.
`providers`	`list`	List of GenAI provider definitions. Each entry contains matching rules and specific model pricing information.

Provider Definition

Each entry under providers defines a GenAI provider:

providers:
- provider: <provider-name>
  prefix-match:
    - <prefix-1>
    - <prefix-2>
  models:
    - name: <model-name>
      aliases: [<alias-1>, <alias-2>]
      input-estimated-cost-per-m: <cost>
      output-estimated-cost-per-m: <cost>

Field	Type	Required	Description
`provider`	`string`	Yes	The provider identifier (e.g., `openai`, `anthropic`, `gemini`). It is displayed as the Virtual GenAI service name in SkyWalking.
`prefix-match`	`list[string]`	Yes	A list of prefixes used to match model names to this provider. If a model name in the Trace data starts with any of these prefixes, it will be mapped to this provider.
`models`	`list[model]`	No	A list of model definitions containing pricing information. If omitted, the system can still identify the provider but will not perform cost estimation.

Model Definition

Each entry under models defines the pricing for a specific model:

Field	Type	Required	Description
`name`	`string`	Yes	The standard model name used for matching.
`aliases`	`list[string]`	No	Alternative names that should resolve to the same billing entry. This is useful when providers use different naming conventions (see the “Model Aliases” section).
`input-estimated-cost-per-m`	`float`	No	Estimated cost per 1,000,000 (one million) input (Prompt) tokens. The default unit is USD.
`output-estimated-cost-per-m`	`float`	No	Estimated cost per 1,000,000 (one million) output (Completion) tokens. The default unit is USD.

Model Matching Mechanism

Provider-Level Prefix Matching

When SkyWalking receives a Trace containing a GenAI call, it determines the Provider based on the following priority order:

gen_ai.provider.name tag: This tag is retrieved first. It follows the latest OpenTelemetry GenAI semantic conventions.
gen_ai.system tag: If the above tag is missing, the system falls back to this legacy tag. Note: This tag is only parsed when processing OTLP or Zipkin format data, primarily for compatibility with older versions of libraries like the Python auto-instrumentation.
Prefix Matching: If neither of the above tags exists, SkyWalking reads the prefix-match rules defined in gen-ai-config.yml and attempts to identify the provider by matching the Model Name.

- provider: openai
  prefix-match:
    - gpt

Any model name starting with gpt (such as gpt-4o, gpt-4.1-mini, or gpt-5-nano) will be mapped to the openai provider. A single provider can have multiple prefixes:

- provider: tencent
  prefix-match:
    - hunyuan
    - Tencent

Model-level Longest-Prefix Matching

Once the provider is determined, SkyWalking uses a Trie-based longest-prefix matching algorithm to find the best billing entry. This is crucial because model names returned in provider API responses often include version numbers or timestamps, differing from the base model name in the config. Example OpenAI config:

models:
- name: gpt-4o
  input-estimated-cost-per-m: 2.5
  output-estimated-cost-per-m: 10.0
- name: gpt-4o-mini
  input-estimated-cost-per-m: 0.15
  output-estimated-cost-per-m: 0.6

Matching behavior:

Model Name in Trace	Matched Configuration Entry	Reason
`gpt-4o`	`gpt-4o`	Exact match
`gpt-4o-2024-08-06`	`gpt-4o`	Longest prefix is `gpt-4o`
`gpt-4o-mini`	`gpt-4o-mini`	Exact match (Longer prefix `gpt-4o-mini` takes priority over `gpt-4o`)
`gpt-4o-mini-2024-07-18`	`gpt-4o-mini`	Longest prefix is `gpt-4o-mini`

This mechanism ensures versioned API model names map to the correct pricing tier without requiring exact full names in the configuration file.

Model Aliases

Some providers use different naming conventions across API responses and documentation. For example, Anthropic’s model might appear as claude-4-sonnet or claude-sonnet-4. The aliases field supports both formats under a single billing entry:

- name: claude-4-sonnet
  aliases: [claude-sonnet-4]
  input-estimated-cost-per-m: 3.0
  output-estimated-cost-per-m: 15.0

Under this configuration, claude-4-sonnet and claude-sonnet-4 (as well as any versioned variants, such as claude-sonnet-4-20250514) will resolve to the same billing entry.
Note: Aliases also participate in longest prefix matching. Therefore, claude-sonnet-4-20250514 will match the alias claude-sonnet-4, which in turn resolves to the pricing information for claude-4-sonnet.

Custom Configuration

Adding a New Provider

To add a provider that is not included in the default configuration:

providers:
# ... Existing providers ...

- provider: ollama
  prefix-match:
    - mymodel
  models:
    - name: mymodel-large
      input-estimated-cost-per-m: 1.0
      output-estimated-cost-per-m: 5.0
    - name: mymodel-small
      input-estimated-cost-per-m: 0.1
      output-estimated-cost-per-m: 0.5

For OTLP/Zipkin data, a dedicated estimated tag has been added. You can now view the cost of each GenAI call directly on the UI. otlp-estimated-tag

Main Metrics

1.Provider Level

Metric ID	Description	Meaning
`gen_ai_provider_cpm`	Calls Per Minute	Requests per minute (Throughput)
`gen_ai_provider_sla`	Success Rate	Request success rate
`gen_ai_provider_resp_time`	Avg Response Time	Average response time
`gen_ai_provider_latency_percentile`	Latency Percentiles	Response time percentiles (P50, P75, P90, P95, P99)
`gen_ai_provider_input_tokens_sum/avg`	Input Token Usage	Total and average input token usage
`gen_ai_provider_output_tokens_sum/avg`	Output Token Usage	Total and average output token usage
`gen_ai_provider_total_estimated_cost/avg`	Estimated Cost	Total estimated cost and average cost per call

2. Model Level

Metric ID	Description	Meaning
`gen_ai_model_call_cpm`	Calls Per Minute	Requests per minute for this specific model
`gen_ai_model_sla`	Success Rate	Model-specific request success rate
`gen_ai_model_latency_avg/percentile`	Latency	Average and percentiles of model response duration
`gen_ai_model_ttft_avg/percentile`	TTFT	Time to First Token (Streaming only)
`gen_ai_model_input_tokens_sum/avg`	Input Token Usage	Detailed input token consumption for the model
`gen_ai_model_output_tokens_sum/avg`	Output Token Usage	Detailed output token consumption for the model
`gen_ai_model_total_estimated_cost/avg`	Estimated Cost	Estimated total cost and average cost for the model

Recommended Usage Scenarios

Performance Evaluation: Use Latency and Time to First Token (TTFT) metrics to analyze model inference efficiency and the end-user interaction experience.
Token Monitoring: Real-time monitoring of Input and Output token consumption to analyze resource utilization across different business scenarios.
Cost Alerting: Set alert thresholds based on Estimated Cost or token consumption to promptly detect abnormal calls and prevent budget overruns.

← Previous