Load balancing

When a filter needs to acquire a connection to a host in an upstream cluster, the cluster manager uses a load balancing policy to determine which host is selected. The load balancing policies are pluggable and are specified on a per upstream cluster basis in the configuration. Note that if no active health checking policy is configured for a cluster, all upstream cluster members are considered healthy.

Supported load balancers

Weighted round robin

This is a simple policy in which each healthy upstream host is selected in round robin order. If weights are assigned to endpoints in a locality, then a weighted round robin schedule is used, where higher weighted endpoints will appear more often in the rotation to achieve the effective weighting.

Weighted least request

The least request load balancer uses different algorithms depending on whether any of the hosts have weight greater than 1.

  • all weights 1: An O(1) algorithm which selects N random healthy hosts as specified in the configuration (2 by default) and picks the host which has the fewest active requests (Research has shown that this approach is nearly as good as an O(N) full scan). This is also known as P2C (power of two choices). The P2C load balancer has the property that a host with the highest number of active requests in the cluster will never receive new requests. It will be allowed to drain until it is less than or equal to all of the other hosts.

  • all weights not 1: If any host in the cluster has a load balancing weight greater than 1, the load balancer shifts into a mode where it uses a weighted round robin schedule in which weights are dynamically adjusted based on the host’s request load at the time of selection (weight is divided by the current active request count. For example, a host with weight 2 and an active request count of 4 will have a synthetic weight of 2 / 4 = 0.5). This algorithm provides good balance at steady state but may not adapt to load imbalance as quickly. Additionally, unlike P2C, a host will never truly drain, though it will receive fewer requests over time.

    Note

    If all weights are not 1, but are the same (e.g., 42), Envoy will still use the weighted round robin schedule instead of P2C.

Ring hash

The ring/modulo hash load balancer implements consistent hashing to upstream hosts. The algorithm is based on mapping all hosts onto a circle such that the addition or removal of a host from the host set changes only affect 1/N requests. This technique is also commonly known as “ketama” hashing. A consistent hashing load balancer is only effective when protocol routing is used that specifies a value to hash on. The minimum ring size governs the replication factor for each host in the ring. For example, if the minimum ring size is 1024 and there are 16 hosts, each host will be replicated 64 times. The ring hash load balancer does not currently support weighting.

When priority based load balancing is in use, the priority level is also chosen by hash, so the endpoint selected will still be consistent when the set of backends is stable.

Note

The ring hash load balancer does not support locality weighted load balancing.

Maglev

The Maglev load balancer implements consistent hashing to upstream hosts. It uses the algorithm described in section 3.4 of this paper with a fixed table size of 65537 (see section 5.3 of the same paper). Maglev can be used as a drop in replacement for the ring hash load balancer any place in which consistent hashing is desired. Like the ring hash load balancer, a consistent hashing load balancer is only effective when protocol routing is used that specifies a value to hash on.

In general, when compared to the ring hash (“ketama”) algorithm, Maglev has substantially faster table lookup build times as well as host selection times (approximately 10x and 5x respectively when using a large ring size of 256K entries). The downside of Maglev is that it is not as stable as ring hash. More keys will move position when hosts are removed (simulations show approximately double the keys will move). With that said, for many applications including Redis, Maglev is very likely a superior drop in replacement for ring hash. The advanced reader can use this benchmark to compare ring hash versus Maglev with different parameters.

Random

The random load balancer selects a random healthy host. The random load balancer generally performs better than round robin if no health checking policy is configured. Random selection avoids bias towards the host in the set that comes after a failed host.

Original destination

This is a special purpose load balancer that can only be used with an original destination cluster. Upstream host is selected based on the downstream connection metadata, i.e., connections are opened to the same address as the destination address of the incoming connection was before the connection was redirected to Envoy. New destinations are added to the cluster by the load balancer on-demand, and the cluster periodically cleans out unused hosts from the cluster. No other load balancing policy can be used with original destination clusters.

Original destination host request header

Envoy can also pick up the original destination from a HTTP header called x-envoy-orignal-dst-host. Please note that fully resolved IP address should be passed in this header. For example if a request has to be routed to a host with IP address 10.195.16.237 at port 8888, the request header value should be set as 10.195.16.237:8888.

Overprovisioning Factor

Priority levels and localities are considered overprovisioned with this percentage. Envoy doesn’t consider a priority level or locality unhealthy until the percentage of healthy hosts multiplied by the overprovisioning factor drops below 100. The default value is 1.4, so a priority level or locality will not be considered unhealthy until the percentage of healthy endpoints goes below 72%.

Priority levels

During load balancing, Envoy will generally only consider hosts configured at the highest priority level. For each EDS LocalityLbEndpoints an optional priority may also be specified. When endpoints at the highest priority level (P=0) are healthy, all traffic will land on endpoints in that priority level. As endpoints for the highest priority level become unhealthy, traffic will begin to trickle to lower priority levels.

The system can be overprovisioned with a configurable overprovisioning factor, which currently defaults to 1.4 (this document will assume this value). If 80% of the endpoints in a priority level are healthy, that level is still considered fully healthy because 80*1.4 > 100. So, level 0 endpoints will continue to receive all traffic until less than ~71.4% of them are healthy.

The priority level logic works with integer health scores. The health score of a level is (percent of healthy hosts in the level) * (overprovisioning factor), capped at 100%. P=0 endpoints receive (level 0’s health score) percent of the traffic, with the rest flowing to P=1 (assuming P=1 is 100% healthy - more on that later). For instance, when 50% of P=0 endpoints are healthy, they will receive 50 * 1.4 = 70% of the traffic. The integer percents of traffic that each priority level receives are collectively called the system’s “priority load”. More examples (with 2 priority levels, P=1 100% healthy):

P=0 healthy endpoints Traffic to P=0 Traffic to P=1
100% 100% 0%
72% 100% 0%
71% 99% 1%
50% 70% 30%
25% 35% 65%
0% 0% 100%

Attention

In order for the load distribution algorithm and normalized total health calculation to work properly, each priority level must be able to handle (100% * overprovision factor) of the traffic: Envoy assumes a 100% healthy P=1 can take over entirely for an unhealthy P=0, etc. If P=0 has 10 hosts but P=1 only has 2 hosts, that assumption probably will not hold.

The health score represents a level’s current ability to handle traffic, after factoring in how overprovisioned the level originally was, and how many endpoints are currently unhealthy. Therefore, if the sum across all levels’ health scores is < 100, then Envoy believes there are not enough healthy endpoints to fully handle the traffic. This sum is called the “normalized total health.” When normalized total health drops below 100, traffic is distributed after normalizing the levels’ health scores to that sub-100 total. E.g. healths of {20, 30} (yielding a normalized total health of 50) would be normalized, and result in a priority load of {40%, 60%} of traffic.

P=0 healthy endpoints P=1 healthy endpoints Traffic to P=0 Traffic to P=1
100% 100% 100% 0%
72% 72% 100% 0%
71% 71% 99% 1%
50% 50% 70% 30%
25% 100% 35% 65%
25% 25% 50% 50%

As more priorities are added, each level consumes load equal to its normalized effective health, unless the healths of the levels above it sum to 100%, in which case it receives no load.

P=0 healthy endpoints P=1 healthy endpoints P=2 healthy endpoints Traffic to P=0 Traffic to P=1 Traffic to P=2
100% 100% 100% 100% 0% 0%
72% 72% 100% 100% 0% 0%
71% 71% 100% 99% 1% 0%
50% 50% 100% 70% 30% 0%
25% 100% 100% 35% 65% 0%
25% 25% 100% 35% 35% 30%
25% 25% 20% 36% 36% 28%

To sum this up in pseudo algorithms:

health(P_X) = min(100, 1.4 * 100 * healthy_P_X_backends / total_P_X_backends)
normalized_total_health = min(100, Σ(health(P_0)...health(P_X)))
priority_load(P_0) = min(100, health(P_0) / normalized_total_health)
priority_load(P_X) = min(100 - Σ(priority_load(P_0)..priority_load(P_X-1)),
                         health(P_X) / normalized_total_health)

Panic threshold

During load balancing, Envoy will generally only consider healthy hosts in an upstream cluster. However, if the percentage of healthy hosts in the cluster becomes too low, Envoy will disregard health status and balance amongst all hosts. This is known as the panic threshold. The default panic threshold is 50%. This is configurable via runtime as well as in the cluster configuration. The panic threshold is used to avoid a situation in which host failures cascade throughout the cluster as load increases.

Panic thresholds work in conjunction with priorities. If the number of healthy hosts in a given priority goes down, Envoy will try to shift some traffic to lower priorities. If it succeeds in finding enough healthy hosts in lower priorities, Envoy will disregard panic thresholds. In mathematical terms, if normalized total health across all priority levels is 100%, Envoy disregards panic thresholds and continues to distribute traffic load across priorities according to the algorithm described here. However, when normalized total health drops below 100%, Envoy assumes that there are not enough healthy hosts across all priority levels. It continues to distribute traffic load across priorities, but if a given priority level’s health is below the panic threshold, traffic will go to all hosts in that priority level regardless of their health.

The following examples explain the relationship between normalized total health and panic threshold. It is assumed that the default value of 50% is used for the panic threshold.

Assume a simple set-up with 2 priority levels, P=1 100% healthy. In this scenario normalized total health is always 100%, P=0 never enters panic mode, and Envoy is able to shift as much traffic as necessary to P=1.

P=0 healthy endpoints
Traffic
to P=0
P=0 in panic Traffic to P=1 P=1 in panic normalized total health
72% 100% NO 0% NO 100%
71% 99% NO 1% NO 100%
50% 70% NO 30% NO 100%
25% 35% NO 65% NO 100%
0% 0% NO 100% NO 100%

If P=1 becomes unhealthy, panic threshold continues to be disregarded until the sum of the health P=0 + P=1 goes below 100%. At this point Envoy starts checking panic threshold value for each priority.

P=0 healthy endpoints P=1 healthy endpoints Traffic to P=0 P=0 in panic Traffic to P=1 P=1 in panic normalized total health
72% 72% 100% NO 0% NO 100%
71% 71% 99% NO 1% NO 100%
50% 60% 50% NO 50% NO 100%
25% 100% 25% NO 75% NO 100%
25% 25% 50% YES 50% YES 70%
5% 65% 7% YES 93% NO 98%

Note that panic thresholds can be configured per-priority.

Zone aware routing

We use the following terminology:

  • Originating/Upstream cluster: Envoy routes requests from an originating cluster to an upstream cluster.
  • Local zone: The same zone that contains a subset of hosts in both the originating and upstream clusters.
  • Zone aware routing: Best effort routing of requests to an upstream cluster host in the local zone.

In deployments where hosts in originating and upstream clusters belong to different zones Envoy performs zone aware routing. There are several preconditions before zone aware routing can be performed:

  • Both originating and upstream cluster are not in panic mode.
  • Zone aware routing is enabled.
  • The originating cluster has the same number of zones as the upstream cluster.
  • The upstream cluster has enough hosts. See here for more information.

The purpose of zone aware routing is to send as much traffic to the local zone in the upstream cluster as possible while roughly maintaining the same number of requests per second across all upstream hosts (depending on load balancing policy).

Envoy tries to push as much traffic as possible to the local upstream zone as long as roughly the same number of requests per host in the upstream cluster are maintained. The decision of whether Envoy routes to the local zone or performs cross zone routing depends on the percentage of healthy hosts in the originating cluster and upstream cluster in the local zone. There are two cases with regard to percentage relations in the local zone between originating and upstream clusters:

  • The originating cluster local zone percentage is greater than the one in the upstream cluster. In this case we cannot route all requests from the local zone of the originating cluster to the local zone of the upstream cluster because that will lead to request imbalance across all upstream hosts. Instead, Envoy calculates the percentage of requests that can be routed directly to the local zone of the upstream cluster. The rest of the requests are routed cross zone. The specific zone is selected based on the residual capacity of the zone (that zone will get some local zone traffic and may have additional capacity Envoy can use for cross zone traffic).
  • The originating cluster local zone percentage is smaller than the one in upstream cluster. In this case the local zone of the upstream cluster can get all of the requests from the local zone of the originating cluster and also have some space to allow traffic from other zones in the originating cluster (if needed).

Note that when using multiple priorities, zone aware routing is currently only supported for P=0.

Locality weighted load balancing

Another approach to determining how to weight assignments across different zones and geographical locations is by using explicit weights supplied via EDS in the LocalityLbEndpoints message. This approach is mutually exclusive with the above zone aware routing, since in the case of locality aware LB, we rely on the management server to provide the locality weighting, rather than the Envoy-side heuristics used in zone aware routing.

When all endpoints are healthy, the locality is picked using a weighted round-robin schedule, where the locality weight is used for weighting. When some endpoints in a locality are unhealthy, we adjust the locality weight to reflect this. As with priority levels, we assume an over-provision factor (default value 1.4), which means we do not perform any weight adjustment when only a small number of endpoints in a locality are unhealthy.

Assume a simple set-up with 2 localities X and Y, where X has a locality weight of 1 and Y has a locality weight of 2, L=Y 100% healthy, with default overprovisioning factor 1.4.

L=X healthy endpoints Percent of traffic to L=X Percent of traffic to L=Y
100% 33% 67%
70% 33% 67%
69% 32% 68%
50% 26% 74%
25% 15% 85%
0% 0% 100%

To sum this up in pseudo algorithms:

health(L_X) = 140 * healthy_X_backends / total_X_backends
effective_weight(L_X) = locality_weight_X * min(100, health(L_X))
load to L_X = effective_weight(L_X) / Σ_c(effective_weight(L_c))

Note that the locality weighted pick takes place after the priority level is picked. The load balancer follows these steps:

  1. Pick priority level.
  2. Pick locality (as described in this section) within priority level from (1).
  3. Pick endpoint using cluster specified load balancer within locality from (2).

Locality weighted load balancing is configured by setting locality_weighted_lb_config in the cluster configuration and providing weights in LocalityLbEndpoints via load_balancing_weight.

This feature is not compatible with load balancer subsetting, since it is not straightforward to reconcile locality level weighting with sensible weights for individual subsets.

Load Balancer Subsets

Envoy may be configured to divide hosts within an upstream cluster into subsets based on metadata attached to the hosts. Routes may then specify the metadata that a host must match in order to be selected by the load balancer, with the option of falling back to a predefined set of hosts, including any host.

Subsets use the load balancer policy specified by the cluster. The original destination policy may not be used with subsets because the upstream hosts are not known in advance. Subsets are compatible with zone aware routing, but be aware that the use of subsets may easily violate the minimum hosts condition described above.

If subsets are configured and a route specifies no metadata or no subset matching the metadata exists, the subset load balancer initiates its fallback policy. The default policy is NO_ENDPOINT, in which case the request fails as if the cluster had no hosts. Conversely, the ANY_ENDPOINT fallback policy load balances across all hosts in the cluster, without regard to host metadata. Finally, the DEFAULT_SUBSET causes fallback to load balance among hosts that match a specific set of metadata.

Subsets must be predefined to allow the subset load balancer to efficiently select the correct subset of hosts. Each definition is a set of keys, which translates to zero or more subsets. Conceptually, each host that has a metadata value for all of the keys in a definition is added to a subset specific to its key-value pairs. If no host has all the keys, no subsets result from the definition. Multiple definitions may be provided, and a single host may appear in multiple subsets if it matches multiple definitions.

During routing, the route’s metadata match configuration is used to find a specific subset. If there is a subset with the exact keys and values specified by the route, the subset is used for load balancing. Otherwise, the fallback policy is used. The cluster’s subset configuration must, therefore, contain a definition that has the same keys as a given route in order for subset load balancing to occur.

This feature can only be enabled using the V2 configuration API. Furthermore, host metadata is only supported when using the EDS discovery type for clusters. Host metadata for subset load balancing must be placed under the filter name "envoy.lb". Similarly, route metadata match criteria use the "envoy.lb" filter name. Host metadata may be hierarchical (e.g., the value for a top-level key may be a structured value or list), but the subset load balancer only compares top-level keys and values. Therefore when using structured values, a route’s match criteria will only match if an identical structured value appears in the host’s metadata.

Examples

We’ll use simple metadata where all values are strings. Assume the following hosts are defined and associated with a cluster:

Host Metadata
host1 v: 1.0, stage: prod
host2 v: 1.0, stage: prod
host3 v: 1.1, stage: canary
host4 v: 1.2-pre, stage: dev

The cluster may enable subset load balancing like this:

---
name: cluster-name
type: EDS
eds_cluster_config:
  eds_config:
    path: '.../eds.conf'
connect_timeout:
  seconds: 10
lb_policy: LEAST_REQUEST
lb_subset_config:
  fallback_policy: DEFAULT_SUBSET
  default_subset:
    stage: prod
  subset_selectors:
  - keys:
    - v
    - stage
  - keys:
    - stage

The following table describes some routes and the result of their application to the cluster. Typically the match criteria would be used with routes matching specific aspects of the request, such as the path or header information.

Match Criteria Balances Over Reason
stage: canary host3 Subset of hosts selected
v: 1.2-pre, stage: dev host4 Subset of hosts selected
v: 1.0 host1, host2 Fallback: No subset selector for “v” alone
other: x host1, host2 Fallback: No subset selector for “other”
(none) host1, host2 Fallback: No subset requested

Metadata match criteria may also be specified on a route’s weighted clusters. Metadata match criteria from the selected weighted cluster are merged with and override the criteria from the route:

Route Match Criteria Weighted Cluster Match Criteria Final Match Criteria
stage: canary stage: prod stage: prod
v: 1.0 stage: prod v: 1.0, stage: prod
v: 1.0, stage: prod stage: canary v: 1.0, stage: canary
v: 1.0, stage: prod v: 1.1, stage: canary v: 1.1, stage: canary
(none) v: 1.0 v: 1.0
v: 1.0 (none) v: 1.0

Example Host With Metadata

An EDS LbEndpoint with host metadata:

---
endpoint:
  address:
    socket_address:
      protocol: TCP
      address: 127.0.0.1
      port_value: 8888
metadata:
  filter_metadata:
    envoy.lb:
      version: '1.0'
      stage: 'prod'

Example Route With Metadata Criteria

An RDS Route with metadata match criteria:

---
match:
  prefix: /
route:
  cluster: cluster-name
  metadata_match:
    filter_metadata:
      envoy.lb:
        version: '1.0'
        stage: 'prod'