Events

SkyWalking already supports the three pillars of observability, namely logs, metrics, and traces. In reality, a production system experiences many other events that may affect the performance of the system, such as upgrading, rebooting, chaos testing, etc. Although some of these events are reflected in the logs, many others are not. Hence, SkyWalking provides a more native way to collect these events. This doc details how SkyWalking collects events and what events look like in SkyWalking.

How to Report Events

The SkyWalking backend supports three protocols to collect events: gRPC, HTTP, and Kafka. Any agent or CLI that implements one of these protocols can report events to SkyWalking. Currently, the officially supported clients to report events are:

  • Java Agent Toolkit: Using the Java agent toolkit to report events within the applications.
  • SkyWalking CLI: Using the CLI to report events from the command line interface.
  • Kubernetes Event Exporter: Deploying an event exporter to refine and report Kubernetes events.

Event Definitions

An event contains the following fields. The definitions of event can be found at the protocol repo.

UUID

Unique ID of the event. Since an event may span a long period of time, the UUID is necessary to associate the start time with the end time of the same event.

Source

The source object on which the event occurs. In SkyWalking, the object is typically a service, service instance, etc.

Name

Name of the event. For example, Start, Stop, Crash, Reboot, Upgrade, etc.

Type

Type of the event. This field is friendly for UI visualization, where events of type Normal are considered normal operations, while Error is considered unexpected operations, such as Crash events. Marking them with different colors allows us to more easily identify them.

Message

The detail of the event that describes why this event happened. This should be a one-line message that briefly describes why the event is reported. Examples of an Upgrade event may be something like Upgrade from ${from_version} to ${to_version}. It’s NOT recommended to include the detailed logs of this event, such as the exception stack trace.

Parameters

The parameters in the message field. This is a simple <string,string> map.

Start Time

The start time of the event. This field is mandatory when an event occurs.

End Time

The end time of the event. This field may be empty if the event has not ended yet, otherwise there should be a valid timestamp after startTime.

NOTE: When reporting an event, you typically call the report function twice, the first time for starting of the event and the second time for ending of the event, both with the same UUID. There are also cases where you would already have both the start time and end time. For example, when exporting events from a third-party system, the start time and end time are already known so you may simply call the report function once.

How to Configure Alarms for Events

Events are derived from metrics, and can be the source to trigger alarms. For example, if a specific event occurs for a certain times in a period, alarms can be triggered and sent.

Every event has a default value = 1, when n events with the same name are reported, they are aggregated into value = n as follows.

Event{name=Unhealthy, source={service=A,instance=a}, ...}
Event{name=Unhealthy, source={service=A,instance=a}, ...}
Event{name=Unhealthy, source={service=A,instance=a}, ...}
Event{name=Unhealthy, source={service=A,instance=a}, ...}
Event{name=Unhealthy, source={service=A,instance=a}, ...}
Event{name=Unhealthy, source={service=A,instance=a}, ...}

will be aggregated into

Event{name=Unhealthy, source={service=A,instance=a}, ...} <value = 6>

so you can configure the following alarm rule to trigger alarm when Unhealthy event occurs more than 5 times within 10 minutes.

rules:
  unhealthy_event_rule:
    metrics-name: Unhealthy
    # Healthiness check is usually a scheduled task,
    # they may be unhealthy for the first few times,
    # and can be unhealthy occasionally due to network jitter,
    # please adjust the threshold as per your actual situation.
    threshold: 5
    op: ">"
    period: 10
    count: 1
    message: Service instance has been unhealthy for 10 minutes

For more alarm configuration details, please refer to the alarm doc.

Note that the Unhealthy event above is only for demonstration, they are not detected by default in SkyWalking, however, you can use the methods in How to Report Events to report this kind of events.

Known Events

Name Type When Where
Start Normal When your Java Application starts with SkyWalking Agent installed, the Start Event will be created. Reported from SkyWalking agent.
Shutdown Normal When your Java Application stops with SkyWalking Agent installed, the Shutdown Event will be created. Reported from SkyWalking agent.
Alarm Error When the Alarm is triggered, the corresponding Alarm Event will is created. Reported from internal SkyWalking OAP.

The following events are all reported by Kubernetes Event Exporter, in order to see these events, please make sure you have deployed the exporter.

Name Type When Where
Killing Normal When the Kubernetes Pod is being killing. Reporter by Kubernetes Event Exporter.
Pulling Normal When a docker image is being pulled for deployment. Reporter by Kubernetes Event Exporter.
Pulled Normal When a docker image is pulled for deployment. Reporter by Kubernetes Event Exporter.
Created Normal When a container inside a Pod is created. Reporter by Kubernetes Event Exporter.
Started Normal When a container inside a Pod is started. Reporter by Kubernetes Event Exporter.
Unhealthy Error When the readiness probe failed. Reporter by Kubernetes Event Exporter.

The complete event lists can be found in the Kubernetes codebase, please note that not all the events are supported by the exporter for now.