Tracing

Overview

Distributed tracing allows developers to obtain visualizations of call flows in large service oriented architectures. It can be invaluable in understanding serialization, parallelism, and sources of latency. Envoy supports three features related to system wide tracing:

Request ID generation: Envoy will generate UUIDs when needed and populate the x-request-id HTTP header. Applications can forward the x-request-id header for unified logging as well as tracing. The behavior can be configured on a per HTTP connection manager basis using an extension.
Client trace ID joining: The x-client-trace-id header can be used to join untrusted request IDs to the trusted internal x-request-id.
External trace service integration: Envoy supports pluggable external trace visualization providers, that are divided into two subgroups:
- External tracers which are part of the Envoy code base, like Zipkin, Jaeger, Datadog, SkyWalking, AWS X-Ray, and Fluentd.
- External tracers which come as a third party plugin, like Instana.

How to initiate a trace

The HTTP connection manager that handles the request must have the tracing object set. There are several ways tracing can be initiated:

By an external client via the x-client-trace-id header.
By an internal service via the x-envoy-force-trace header.
Randomly sampled via the random_sampling runtime setting.

Trace context propagation

Envoy provides the capability for reporting tracing information regarding communications between services in the mesh. However, to be able to correlate the pieces of tracing information generated by the various proxies within a call flow, the services must propagate certain trace context between the inbound and outbound requests.

Whichever tracing provider is being used, the service should propagate the x-request-id to enable logging across the invoked services to be correlated.

Attention

Envoy’s request ID implementation is extensible and defaults to the UuidRequestIdConfig implementation. Configuration for this extension can be provided within the HTTP connection manager field (see the documentation for that field for an example). The default implementation will modify the request ID UUID4 to pack the final trace reason into the UUID. This feature allows stable sampling across a fleet of Envoys as documented in the x-request-id header documentation. However, trace reason packing my break externally generated request IDs that must be maintained. The pack_trace_reason field can be used to disable this behavior at the expense of also disabling stable trace reason propagation and associated features within a deployment.

Attention

The sampling policy for Envoy is determined by the value of x-request-id by default. However, such a sampling policy is only valid for a fleet of Envoys. If a service proxy that is not Envoy is present in the fleet, sampling is performed without considering the policy of that proxy. For meshes consisting of multiple service proxies such as this, it is more effective to bypass Envoy’s sampling policy and sample based on the trace provider’s sampling policy. This can be achieved by setting use_request_id_for_trace_sampling to false.

The tracing providers also require additional context, to enable the parent/child relationships between the spans (logical units of work) to be understood. This can be achieved by using the LightStep (via OpenTelemetry API) or Zipkin tracer directly within the service itself, to extract the trace context from the inbound request and inject it into any subsequent outbound requests. This approach would also enable the service to create additional spans, describing work being done internally within the service, that may be useful when examining the end-to-end trace.

Alternatively the trace context can be manually propagated by the service:

When using the LightStep tracer, Envoy relies on the service to propagate the x-ot-span-context HTTP header while sending HTTP requests to other services.
When using the Zipkin tracer, Envoy relies on the service to propagate the B3 HTTP headers ( x-b3-traceid, x-b3-spanid, x-b3-parentspanid, x-b3-sampled, and x-b3-flags). The x-b3-sampled header can also be supplied by an external client to either enable or disable tracing for a particular request. In addition, the single b3 header propagation format is supported, which is a more compressed format.

The Zipkin tracer can optionally be configured to support both B3 and W3C trace context formats for improved interoperability. This is controlled by the trace_context_option configuration option. When set to USE_B3_WITH_W3C_PROPAGATION, the tracer will:
- For downstream requests: Extract trace context from B3 headers first, fallback to W3C trace headers (traceparent and tracestate) when B3 headers are not present.
- For upstream requests: Inject both B3 and W3C trace headers to maximize compatibility.
This option is disabled by default (USE_B3) to maintain backward compatibility, where only B3 headers are used for both extraction and injection.
When using the Datadog tracer, Envoy relies on the service to propagate the Datadog-specific HTTP headers ( x-datadog-trace-id, x-datadog-parent-id, x-datadog-sampling-priority).
When using the SkyWalking tracer, Envoy relies on the service to propagate the SkyWalking-specific HTTP headers ( sw8).
When using the AWS X-Ray tracer, Envoy relies on the service to propagate the X-Ray-specific HTTP headers ( x-amzn-trace-id).

What data each trace contains

An end-to-end trace is comprised of one or more spans. A span represents a logical unit of work that has a start time and duration and can contain metadata associated with it. Each span generated by Envoy contains the following data:

Originating service cluster set via --service-cluster.
Start time and duration of the request.
Originating host set via --service-node.
Downstream cluster set via the x-envoy-downstream-service-cluster header.
HTTP request URL, method, protocol and user-agent.
Additional custom tags set via custom_tags.
Upstream cluster name, observability name, and address.
HTTP response status code.
GRPC response status and message (if available).
An error tag when HTTP status is 5xx or GRPC status is not “OK” and represents a server side error. See GRPC’s documentation for more information about GRPC status code.
Tracing system-specific metadata.

The span also includes a name (or operation) which by default is defined as the host of the invoked service. However this can be customized using a config.route.v3.Decorator on the route. The name can also be overridden using the x-envoy-decorator-operation header.

Envoy automatically sends spans to tracing collectors. Depending on the tracing collector, multiple spans are stitched together using common information such as the globally unique request ID x-request-id (LightStep) or the trace ID configuration (Zipkin and Datadog). See v3 API reference for more information on how to setup tracing in Envoy.

Baggage

Baggage provides a mechanism for data to be available throughout the entirety of a trace. While metadata such as tags are usually communicated to collectors out-of-band, baggage data is injected into the actual request context and available to applications during the duration of the request. This enables metadata to transparently travel from the beginning of the request throughout your entire mesh without relying on application-specific modifications for propagation. See OpenTelemetry’s documentation for more information about baggage.

Tracing providers have varying level of support for getting and setting baggage:

Lightstep (and any OpenTelemetry-compliant tracer) can read/write baggage
Zipkin support is not yet implemented
X-Ray and Fluentd don’t support baggage

Different types of span

As mentioned in the previous paragraph, a trace is composed of one or more spans, which may have different types. Tracing systems such as SkyWalking, ZipKin, and OpenTelemetry, among others, offer the same or similar span types. The most common types are CLIENT and SERVER. A CLIENT type span is generated by a client for a request that is sent to a server, while a SERVER type span is generated by a server for a request that is received from a client.

A basic trace chain looks like the following snippet. Typically, the parent span of a server span should be a client span. Every hop in the chain must ensure the correctness of the span type.

-> [SERVER -> CLIENT] -> [SERVER -> CLIENT] -> ...
          App A                 App B

Different modes of Envoy

Because Envoy is widely used in the service mesh as sidecar, it is important to understand the different tracing modes of Envoy.

In the first mode, Envoy is used as a sidecar. The sidecar and its associated application are treated as a single hop in the trace chain. If a tracing system with typed spans is used, the ideal trace chain might look like the following snippet.

-> [[SERVER (inbound sidecar) -> App -> CLIENT (outbound sidecar)]] -> ...
                                 App

As you can see, in the first mode, the inbound sidecar will always generate a SERVER span, and the outbound sidecar will always generate a CLIENT span. The application will not generate any spans but will only propagate the trace context.

In the second mode, Envoy is used as a gateway. Or, Envoy can be used as a sidecar, but in this case, the sidecar and its application are treated as separate hops in the trace chain. If a tracing system with typed spans is used, the ideal trace chain might look like the following snippet.

-> [SERVER -> CLIENT] -> [SERVER -> CLIENT] -> [SERVER -> CLIENT] -> [SERVER -> CLIENT] -> ...
        Gateway           Inbound Sidecar            App             Outbound Sidecar

As you can see, in the second mode, Envoy will generate a SERVER span for downstream requests and a CLIENT span for upstream requests. The application may also generate spans for its own work.

To enable this mode, please set spawn_upstream_span to true explicitly. This tells the tracing provider to generate a CLIENT span for upstream requests and treat Envoy as an independent hop in the trace chain.