Health check

  • Health checking architecture overview.

  • If health checking is configured for a cluster, additional statistics are emitted. They are documented here.

config.core.v3.HealthCheck

[config.core.v3.HealthCheck proto]

{
  "timeout": "{...}",
  "interval": "{...}",
  "initial_jitter": "{...}",
  "interval_jitter": "{...}",
  "interval_jitter_percent": "...",
  "unhealthy_threshold": "{...}",
  "healthy_threshold": "{...}",
  "reuse_connection": "{...}",
  "http_health_check": "{...}",
  "tcp_health_check": "{...}",
  "grpc_health_check": "{...}",
  "custom_health_check": "{...}",
  "no_traffic_interval": "{...}",
  "no_traffic_healthy_interval": "{...}",
  "unhealthy_interval": "{...}",
  "unhealthy_edge_interval": "{...}",
  "healthy_edge_interval": "{...}",
  "event_log_path": "...",
  "always_log_health_check_failures": "...",
  "tls_options": "{...}",
  "transport_socket_match_criteria": "{...}"
}
timeout

(Duration, REQUIRED) The time to wait for a health check response. If the timeout is reached the health check attempt will be considered a failure.

interval

(Duration, REQUIRED) The interval between health checks.

initial_jitter

(Duration) An optional jitter amount in milliseconds. If specified, Envoy will start health checking after for a random time in ms between 0 and initial_jitter. This only applies to the first health check.

interval_jitter

(Duration) An optional jitter amount in milliseconds. If specified, during every interval Envoy will add interval_jitter to the wait time.

interval_jitter_percent

(uint32) An optional jitter amount as a percentage of interval_ms. If specified, during every interval Envoy will add interval_ms * interval_jitter_percent / 100 to the wait time.

If interval_jitter_ms and interval_jitter_percent are both set, both of them will be used to increase the wait time.

unhealthy_threshold

(UInt32Value, REQUIRED) The number of unhealthy health checks required before a host is marked unhealthy. Note that for http health checking if a host responds with 503 this threshold is ignored and the host is considered unhealthy immediately.

healthy_threshold

(UInt32Value, REQUIRED) The number of healthy health checks required before a host is marked healthy. Note that during startup, only a single successful health check is required to mark a host healthy.

reuse_connection

(BoolValue) Reuse health check connection between health checks. Default is true.

http_health_check

(config.core.v3.HealthCheck.HttpHealthCheck) HTTP health check.

Precisely one of http_health_check, tcp_health_check, grpc_health_check, custom_health_check must be set.

tcp_health_check

(config.core.v3.HealthCheck.TcpHealthCheck) TCP health check.

Precisely one of http_health_check, tcp_health_check, grpc_health_check, custom_health_check must be set.

grpc_health_check

(config.core.v3.HealthCheck.GrpcHealthCheck) gRPC health check.

Precisely one of http_health_check, tcp_health_check, grpc_health_check, custom_health_check must be set.

custom_health_check

(config.core.v3.HealthCheck.CustomHealthCheck) Custom health check.

Precisely one of http_health_check, tcp_health_check, grpc_health_check, custom_health_check must be set.

no_traffic_interval

(Duration) The “no traffic interval” is a special health check interval that is used when a cluster has never had traffic routed to it. This lower interval allows cluster information to be kept up to date, without sending a potentially large amount of active health checking traffic for no reason. Once a cluster has been used for traffic routing, Envoy will shift back to using the standard health check interval that is defined. Note that this interval takes precedence over any other.

The default value for “no traffic interval” is 60 seconds.

no_traffic_healthy_interval

(Duration) The “no traffic healthy interval” is a special health check interval that is used for hosts that are currently passing active health checking (including new hosts) when the cluster has received no traffic.

This is useful for when we want to send frequent health checks with no_traffic_interval but then revert to lower frequency no_traffic_healthy_interval once a host in the cluster is marked as healthy.

Once a cluster has been used for traffic routing, Envoy will shift back to using the standard health check interval that is defined.

If no_traffic_healthy_interval is not set, it will default to the no traffic interval and send that interval regardless of health state.

unhealthy_interval

(Duration) The “unhealthy interval” is a health check interval that is used for hosts that are marked as unhealthy. As soon as the host is marked as healthy, Envoy will shift back to using the standard health check interval that is defined.

The default value for “unhealthy interval” is the same as “interval”.

unhealthy_edge_interval

(Duration) The “unhealthy edge interval” is a special health check interval that is used for the first health check right after a host is marked as unhealthy. For subsequent health checks Envoy will shift back to using either “unhealthy interval” if present or the standard health check interval that is defined.

The default value for “unhealthy edge interval” is the same as “unhealthy interval”.

healthy_edge_interval

(Duration) The “healthy edge interval” is a special health check interval that is used for the first health check right after a host is marked as healthy. For subsequent health checks Envoy will shift back to using the standard health check interval that is defined.

The default value for “healthy edge interval” is the same as the default interval.

event_log_path

(string) Specifies the path to the health check event log. If empty, no event log will be written.

always_log_health_check_failures

(bool) If set to true, health check failure events will always be logged. If set to false, only the initial health check failure event will be logged. The default value is false.

tls_options

(config.core.v3.HealthCheck.TlsOptions) This allows overriding the cluster TLS settings, just for health check connections.

transport_socket_match_criteria

(Struct) Optional key/value pairs that will be used to match a transport socket from those specified in the cluster’s tranport socket matches. For example, the following match criteria

transport_socket_match_criteria:
  useMTLS: true

Will match the following cluster socket match

transport_socket_matches:
- name: "useMTLS"
  match:
    useMTLS: true
  transport_socket:
    name: envoy.transport_sockets.tls
    config: { ... } # tls socket configuration

If this field is set, then for health checks it will supersede an entry of envoy.transport_socket in the LbEndpoint.Metadata. This allows using different transport socket capabilities for health checking versus proxying to the endpoint.

If the key/values pairs specified do not match any transport socket matches, the cluster’s transport socket will be used for health check socket configuration.

config.core.v3.HealthCheck.Payload

[config.core.v3.HealthCheck.Payload proto]

Describes the encoding of the payload bytes in the payload.

{
  "text": "..."
}
text

(string, REQUIRED) Hex encoded payload. E.g., “000000FF”.

config.core.v3.HealthCheck.HttpHealthCheck

[config.core.v3.HealthCheck.HttpHealthCheck proto]

{
  "host": "...",
  "path": "...",
  "request_headers_to_add": [],
  "request_headers_to_remove": [],
  "expected_statuses": [],
  "codec_client_type": "...",
  "service_name_matcher": "{...}"
}
host

(string) The value of the host header in the HTTP health check request. If left empty (default value), the name of the cluster this health check is associated with will be used. The host header can be customized for a specific endpoint by setting the hostname field.

path

(string, REQUIRED) Specifies the HTTP path that will be requested during health checking. For example /healthcheck.

request_headers_to_add

(config.core.v3.HeaderValueOption) Specifies a list of HTTP headers that should be added to each request that is sent to the health checked cluster. For more information, including details on header value syntax, see the documentation on custom request headers.

request_headers_to_remove

(string) Specifies a list of HTTP headers that should be removed from each request that is sent to the health checked cluster.

expected_statuses

(type.v3.Int64Range) Specifies a list of HTTP response statuses considered healthy. If provided, replaces default 200-only policy - 200 must be included explicitly as needed. Ranges follow half-open semantics of Int64Range. The start and end of each range are required. Only statuses in the range [100, 600) are allowed.

codec_client_type

(type.v3.CodecClientType) Use specified application protocol for health checks.

service_name_matcher

(type.matcher.v3.StringMatcher) An optional service name parameter which is used to validate the identity of the health checked cluster using a StringMatcher. See the architecture overview for more information.

config.core.v3.HealthCheck.TcpHealthCheck

[config.core.v3.HealthCheck.TcpHealthCheck proto]

{
  "send": "{...}",
  "receive": []
}
send

(config.core.v3.HealthCheck.Payload) Empty payloads imply a connect-only health check.

receive

(config.core.v3.HealthCheck.Payload) When checking the response, “fuzzy” matching is performed such that each binary block must be found, and in the order specified, but not necessarily contiguous.

config.core.v3.HealthCheck.RedisHealthCheck

[config.core.v3.HealthCheck.RedisHealthCheck proto]

{
  "key": "..."
}
key

(string) If set, optionally perform EXISTS <key> instead of PING. A return value from Redis of 0 (does not exist) is considered a passing healthcheck. A return value other than 0 is considered a failure. This allows the user to mark a Redis instance for maintenance by setting the specified key to any value and waiting for traffic to drain.

config.core.v3.HealthCheck.GrpcHealthCheck

[config.core.v3.HealthCheck.GrpcHealthCheck proto]

grpc.health.v1.Health-based healthcheck. See gRPC doc for details.

{
  "service_name": "...",
  "authority": "..."
}
service_name

(string) An optional service name parameter which will be sent to gRPC service in grpc.health.v1.HealthCheckRequest. message. See gRPC health-checking overview for more information.

authority

(string) The value of the :authority header in the gRPC health check request. If left empty (default value), the name of the cluster this health check is associated with will be used. The authority header can be customized for a specific endpoint by setting the hostname field.

config.core.v3.HealthCheck.CustomHealthCheck

[config.core.v3.HealthCheck.CustomHealthCheck proto]

Custom health check.

{
  "name": "...",
  "typed_config": "{...}"
}
name

(string, REQUIRED) The registered name of the custom health checker.

typed_config

(Any) A custom health checker specific configuration which depends on the custom health checker being instantiated. See envoy/config/health_checker for reference.

config.core.v3.HealthCheck.TlsOptions

[config.core.v3.HealthCheck.TlsOptions proto]

Health checks occur over the transport socket specified for the cluster. This implies that if a cluster is using a TLS-enabled transport socket, the health check will also occur over TLS.

This allows overriding the cluster TLS settings, just for health check connections.

{
  "alpn_protocols": []
}
alpn_protocols

(string) Specifies the ALPN protocols for health check connections. This is useful if the corresponding upstream is using ALPN-based FilterChainMatch along with different protocols for health checks versus data connections. If empty, no ALPN protocols will be set on health check connections.

Enum config.core.v3.HealthStatus

[config.core.v3.HealthStatus proto]

Endpoint health status.

UNKNOWN

(DEFAULT) ⁣The health status is not known. This is interpreted by Envoy as HEALTHY.

HEALTHY

⁣Healthy.

UNHEALTHY

⁣Unhealthy.

DRAINING

⁣Connection draining in progress. E.g., https://aws.amazon.com/blogs/aws/elb-connection-draining-remove-instances-from-service-with-care/ or https://cloud.google.com/compute/docs/load-balancing/enabling-connection-draining. This is interpreted by Envoy as UNHEALTHY.

TIMEOUT

⁣Health check timed out. This is part of HDS and is interpreted by Envoy as UNHEALTHY.

DEGRADED

⁣Degraded.