Health checking

Active health checking can be configured on a per upstream cluster basis. As described in the service discovery section, active health checking and the EDS service discovery type go hand in hand. However, there are other scenarios where active health checking is desired even when using the other service discovery types. Envoy supports three different types of health checking along with various settings (check interval, failures required before marking a host unhealthy, successes required before marking a host healthy, etc.):

HTTP: During HTTP health checking Envoy will send an HTTP request to the upstream host. By default, it expects a 200 response if the host is healthy. Expected and retriable response codes are configurable. The upstream host can return a non-expected or non-retriable status code (any non-200 code by default) if it wants to immediately notify downstream hosts to no longer forward traffic to it.
gRPC: During gRPC health checking Envoy will send a gRPC request to the upstream host. By default, it expects a 200 response if the host is healthy. gRPC health checks are configurable here.
L3/L4: During L3/L4 health checking, Envoy will send a configurable byte buffer to the upstream host. It expects the byte buffer to be echoed in the response if the host is to be considered healthy. Envoy also supports connect only L3/L4 health checking.
Redis: Envoy will send a Redis PING command and expect a PONG response. The upstream Redis server can respond with anything other than PONG to cause an immediate active health check failure. Optionally, Envoy can perform EXISTS on a user-specified key. If the key does not exist it is considered a passing healthcheck. This allows the user to mark a Redis instance for maintenance by setting the specified key to any value and waiting for traffic to drain. See redis_key.
Thrift: Envoy will send a Thrift request and expect a success response. The upstream host may also respond with an exception to cause the health check to fail. See thrift.

Health checks occur over the transport socket specified for the cluster. This implies that if a cluster is using a TLS-enabled transport socket, the health check will also occur over TLS. The TLS options used for health check connections can be specified, which is useful if the corresponding upstream is using ALPN-based FilterChainMatch with different protocols for health checks versus data connections.

Per cluster member health check config

If active health checking is configured for an upstream cluster, a specific additional configuration for each registered member can be specified by setting the HealthCheckConfig in the Endpoint of an LbEndpoint of each defined LocalityLbEndpoints in a ClusterLoadAssignment.

An example of setting up health check config to set a cluster member’s alternative health check address and port is:

load_assignment:
  endpoints:
  - lb_endpoints:
    - endpoint:
        health_check_config:
          port_value: 8080
          address:
            socket_address:
              address: 127.0.0.1
              port_value: 80
        address:
          socket_address:
            address: localhost
            port_value: 80

Health check event logging

A per-healthchecker log of ejection and addition events can optionally be produced by Envoy by specifying a log file path in the HealthCheck config event_log_path. The log is structured as JSON dumps of HealthCheckEvent messages.

Note: the HealthCheck config event_log_path is deperated in favor of HealthCheck event_logger extension. The event_log_path is used in the file sink extension for the JSON dumps.

A new event sink extension catalog envoy.health_check.event_sinks is created, and APIs can be found here.

Envoy can be configured to log all health check failure events by setting the always_log_health_check_failures flag to true.

Passive health checking

Envoy also supports passive health checking via outlier detection.

Connection pool interactions

See here for more information.

HTTP health checking filter

When an Envoy mesh is deployed with active health checking between clusters, a large amount of health checking traffic can be generated. Envoy includes an HTTP health checking filter that can be installed in a configured HTTP listener. This filter is capable of a few different modes of operation:

No pass through: In this mode, the health check request is never passed to the local service. Envoy will respond with a 200 or a 503 depending on the current draining state of the server.
No pass through, computed from upstream cluster health: In this mode, the health checking filter will return a 200 or a 503 depending on whether at least a specified percentage of the servers are available (healthy + degraded) in one or more upstream clusters. (If the Envoy server is in a draining state, though, it will respond with a 503 regardless of the upstream cluster health.)
Pass through: In this mode, Envoy will pass every health check request to the local service. The service is expected to return a 200 or a 503 depending on its health state.
Pass through with caching: In this mode, Envoy will pass health check requests to the local service, but then cache the result for some period of time. Subsequent health check requests will return the cached value up to the cache time. When the cache time is reached, the next health check request will be passed to the local service. This is the recommended mode of operation when operating a large mesh. Envoy uses persistent connections for health checking traffic and health check requests have very little cost to Envoy itself. Thus, this mode of operation yields an eventually consistent view of the health state of each upstream host without overwhelming the local service with a large number of health check requests.

Active health checking fast failure

When using active health checking along with passive health checking (outlier detection), it is common to use a long health checking interval to avoid a large amount of active health checking traffic. In this case, it is still useful to be able to quickly drain an upstream host when using the /healthcheck/fail admin endpoint. To support this, the router filter and the HTTP active health checker will respond to the x-envoy-immediate-health-check-fail header. If this header is set by an upstream host, Envoy will immediately mark the host as being failed for active health check and excluded from load balancing. Note that this only occurs if the host’s cluster has active health checking configured. The health checking filter will automatically set this header if Envoy has been marked as failed via the /healthcheck/fail admin endpoint.

Health check identity

Just verifying that an upstream host responds to a particular health check URL does not necessarily mean that the upstream host is valid. For example, when using eventually consistent service discovery in a cloud auto scaling or container environment, it’s possible for a host to go away and then come back with the same IP address, but as a different host type. One solution to this problem is having a different HTTP health checking URL for every service type. The downside of that approach is that overall configuration becomes more complicated as every health check URL is fully custom.

The Envoy HTTP health checker supports the service_name_matcher option. If this option is set, the health checker additionally compares the value of the x-envoy-upstream-healthchecked-cluster response header to service_name_matcher. If the values do not match, the health check does not pass. The upstream health check filter appends x-envoy-upstream-healthchecked-cluster to the response headers. The appended value is determined by the --service-cluster command line option.

Degraded health

When using the HTTP health checker, an upstream host can return x-envoy-degraded to inform the health checker that the host is degraded. See here for how this affects load balancing.