Outlier detection

Outlier detection and ejection is the process of dynamically determining whether some number of hosts in an upstream cluster are performing unlike the others and removing them from the healthy load balancing set. Performance might be along different axes such as consecutive failures, temporal success rate, temporal latency, etc. Outlier detection is a form of passive health checking. Envoy also supports active health checking. Passive and active health checking can be enabled together or independently, and form the basis for an overall upstream health checking solution.

Ejection algorithm

Depending on the type of outlier detection, ejection either runs inline (for example in the case of consecutive 5xx) or at a specified interval (for example in the case of periodic success rate). The ejection algorithm works as follows:

  1. A host is determined to be an outlier.
  2. If no hosts have been ejected, Envoy will eject the host immediately. Otherwise, it checks to make sure the number of ejected hosts is below the allowed threshold (specified via the outlier_detection.max_ejection_percent setting). If the number of ejected hosts is above the threshold, the host is not ejected.
  3. The host is ejected for some number of milliseconds. Ejection means that the host is marked unhealthy and will not be used during load balancing unless the load balancer is in a panic scenario. The number of milliseconds is equal to the outlier_detection.base_ejection_time_ms value multiplied by the number of times the host has been ejected. This causes hosts to get ejected for longer and longer periods if they continue to fail.
  4. An ejected host will automatically be brought back into service after the ejection time has been satisfied. Generally, outlier detection is used alongside active health checking for a comprehensive health checking solution.

Detection types

Envoy supports the following outlier detection types:

Consecutive 5xx

If an upstream host returns some number of consecutive 5xx, it will be ejected. Note that in this case a 5xx means an actual 5xx respond code, or an event that would cause the HTTP router to return one on the upstream’s behalf (reset, connection failure, etc.). The number of consecutive 5xx required for ejection is controlled by the outlier_detection.consecutive_5xx value.

Consecutive Gateway Failure

If an upstream host returns some number of consecutive “gateway errors” (502, 503 or 504 status code), it will be ejected. Note that this includes events that would cause the HTTP router to return one of these status codes on the upstream’s behalf (reset, connection failure, etc.). The number of consecutive gateway failures required for ejection is controlled by the outlier_detection.consecutive_gateway_failure value.

Success Rate

Success Rate based outlier ejection aggregates success rate data from every host in a cluster. Then at given intervals ejects hosts based on statistical outlier detection. Success Rate outlier ejection will not be calculated for a host if its request volume over the aggregation interval is less than the outlier_detection.success_rate_request_volume value. Moreover, detection will not be performed for a cluster if the number of hosts with the minimum required request volume in an interval is less than the outlier_detection.success_rate_minimum_hosts value.

Ejection event logging

A log of outlier ejection events can optionally be produced by Envoy. This is extremely useful during daily operations since global stats do not provide enough information on which hosts are being ejected and for what reasons. The log is structured as protobuf-based dumps of OutlierDetectionEvent messages. Ejection event logging is configured in the Cluster manager outlier detection configuration.

Configuration reference