Load-aware locality load balancing

Attention

This extension is work-in-progress and is not yet implemented.

The load-aware locality LB policy (envoy.load_balancing_policies.load_aware_locality) is a locality-picking load balancer designed for deployments where incoming load is not evenly distributed across zones, causing some localities to run hotter than others. It uses per-endpoint utilization from ORCA reports to weight each locality by its available headroom, preferring the local zone when load is balanced and spilling to remote zones as the local zone heats up.

Choosing this policy

Use this policy when upstream endpoints report ORCA utilization and Envoy should make cross-zone routing decisions from observed backend load. It is most useful when traffic should stay local while zones are similarly loaded, then spill toward remote localities with more available headroom as the local zone’s load rises.

Envoy offers three locality-selection strategies. The right choice depends on whether ORCA reporting is available, whether the control plane supplies locality weights, and whether the deployment must react to runtime load imbalance.

  • Pick this policy when upstream endpoints emit ORCA utilization and routing should react to runtime load imbalance from the data plane, with no control-plane involvement in locality weighting.

  • Pick zone-aware routing when only local-zone preference is needed and traffic is already balanced by other means. It has no ORCA dependency and is simpler to operate, but it only applies at priority 0 and does not react to backend load.

  • Pick WrrLocality when the control plane owns locality weights and should compute them centrally via EDS. Weights are static between updates and do not react to runtime load.

  • For deterministic routing (session affinity, consistent hashing), use ring hash or Maglev. They can also be configured as the endpoint-picking child policy of this policy, but see Caveats for the resulting behavior.

Architecture

The policy operates at two levels: locality picking (this policy, by ORCA-derived headroom) and endpoint picking (a configurable child policy). The split lets you pair load-aware locality selection with whatever endpoint-picking strategy fits your workload.

Request path:

Incoming request
  |
  +-- 1. Priority selection  (standard healthy/degraded priority load)
  |
  +-- 2. Locality selection  (this policy: weighted random by ORCA headroom)
  |
  +-- 3. Endpoint selection  (child LB)
  |
  v
Chosen upstream host

Implementation model

The policy is implemented as a ThreadAwareLoadBalancer:

  • A main-thread timer recomputes per-locality weights from ORCA data and publishes an immutable snapshot to worker threads via a thread-local slot. The snapshot carries a generation counter; workers rebuild per-locality child LB instances when membership changes bump the generation.

  • Worker threads read the latest snapshot lock-free on the request path, pick a locality, and delegate endpoint selection to the child LB for that locality.

  • ORCA reports flow through per-host HostLbPolicyData slots shared with other ORCA consumers; see ORCA data flow below for coexistence details.

  • When enable_oob_load_report is set, a cluster-level OOB manager runs on the main-thread dispatcher and owns one ORCA gRPC streaming session per host, reacting to membership updates to add and remove sessions. It decodes reports into the same shared report handler that backs the in-band path, so workers see OOB and in-band samples through the same HostLbPolicyData slots.

ORCA data flow

Upstream endpoints must report ORCA utilization. The policy supports both in-band and out-of-band reporting modes:

  • In-band (default). ORCA reports returned on the response headers or trailers of upstream responses. Sample rate is tied to the request rate to each host, so probing (remote_probe_fraction) is required to keep remote-locality data fresh.

  • Out-of-band (OOB). When enable_oob_load_report is set, the policy opens a per-host ORCA gRPC stream and the endpoint pushes reports every oob_reporting_period independent of request traffic. OOB reuses the same central ORCA client as CSWRR: a cluster-level OOB manager owns one streaming session per host, reacts to membership changes to add and remove sessions, and feeds every decoded report into the shared ORCA report handler. Because OOB decouples sample rate from request rate, remote_probe_fraction may safely be set to 0.

Either way reports land in the same per-host HostLbPolicyData slots, so weight computation is identical regardless of reporting mode.

Pairing this policy with CSWRR as endpoint_picking_policy yields two-level ORCA-aware balancing: locality selection by aggregate headroom, endpoint selection by per-endpoint capacity. Each consumer attaches independent HostLbPolicyData entries, so the two policies do not interfere.

Utilization is derived from each host’s ORCA report using the same extraction as CSWRR (precedence may be flipped by the envoy.reloadable_features.orca_weight_manager_use_named_metrics_first runtime feature). By default:

  1. application_utilization – value in [0, 1], used when reported and greater than 0.

  2. Named metrics via metric_names_for_computing_utilization – max of present values, used when application_utilization is not reported.

  3. cpu_utilization – final fallback.

Weight computation

On each weight_update_period tick, the main thread recomputes per- locality routing weights in five stages:

  1. Filter and average. Drop hosts whose last ORCA report is older than weight_expiration_period and average utilization across the remaining hosts in each locality. The locality’s EWMA state continues unchanged over the remaining reporters – there is no synthetic reset. If every host in a locality is stale, the locality is marked stale and falls back to host-count weighting in stage 3.

  2. Smooth. Apply EWMA smoothing per locality. The first sample for a locality is applied raw (no blending) so the policy begins differentiating within a single tick after cold start; subsequent samples blend with the prior smoothed value. At startup with no ORCA data every locality defaults to utilization 0 (full headroom), so weights reduce to host counts – equivalent to round-robin locality selection until the first reports arrive.

  3. Headroom weight. Compute each locality’s base weight as host_count * (1 - smoothed_util) – capacity-weighted headroom. Stale localities fall back to host_count so traffic keeps flowing without artificially boosting them.

  4. Local preference. If the local locality’s smoothed utilization is at most utilization_variance_threshold above the host-count-weighted remote average, snap to all-local routing. One-sided: if the local locality is less loaded than the remote localities, all-local routing always applies regardless of gap size.

  5. Probe floor. Enforce remote_probe_fraction by taking a slice of local weight and redistributing it across remote localities in proportion to host count – not headroom – so all remotes are sampled fairly. The amount taken from local is capped at the local weight itself.

Worked example

Three localities, default variance threshold 0.1:

  • A (local): 10 hosts, utilization 0.7

  • B (remote): 10 hosts, utilization 0.3

  • C (remote): 10 hosts, utilization 0.4

Host-count-weighted remote average: (0.3*10 + 0.4*10) / 20 = 0.35. Local (0.7) exceeds 0.35 + 0.1 = 0.45, so spillover is active.

Headroom weights: A=3, B=7, C=6, total=16. Traffic split: A ~19%, B ~44%, C ~37% – traffic flows from the hot local zone toward localities with more headroom.

If load rebalances and all localities converge to ~0.45, local is within threshold and the policy snaps to 100% local (minus the 3% remote probe). Asymmetric host counts shift the weighted average accordingly: a larger remote locality pulls the average toward its own utilization.

Pseudocode

The pseudocode below specifies the exact semantics of the five stages above:

# Per-tick smoothing factor (consistent settling regardless of tick rate)
alpha = 1 - exp(-weight_update_period / smoothing_time_constant)

# Per-host sample validity filter (excludes hosts whose last ORCA report
# is older than weight_expiration_period; if disabled, all hosts qualify)
valid(h)         = (now - last_report_time(h)) <= weight_expiration_period
valid_hosts(L)   = { h in hosts(L) : valid(h) }

# Per-locality utilization (EWMA smoothed; first sample applied raw so
# the policy reacts within one tick instead of waiting ~5 time constants
# to converge from the cold-start prior of 0)
if valid_hosts(L) is empty:
    smoothed_util(L) = prev_smoothed_util(L)   # carry prior value
    stale(L) = true
else:
    raw_util(L) = avg over h in valid_hosts(L) of util(h)
    if no prior smoothed_util(L):              # first sample for L
        smoothed_util(L) = raw_util(L)
    else:
        smoothed_util(L) = alpha * raw_util(L)
                           + (1 - alpha) * prev_smoothed_util(L)
    stale(L) = false

# Base headroom weight; stale localities use host_count baseline
if stale(L):
    base_weight(L) = host_count(L)
else:
    headroom(L)    = max(0, 1 - smoothed_util(L))
    base_weight(L) = host_count(L) * headroom(L)

total_base_weight  = sum(base_weight(L_i) for all L_i)
remote_host_count  = sum(host_count(R_i) for all remote R_i)

# All-overloaded fallback
if total_base_weight == 0:
    adjusted_weight(L_i) = host_count(L_i)
else:
    adjusted_weight(L_i) = base_weight(L_i)

    if local exists and remote_host_count > 0:
        # Local preference (one-sided: local must not be too far ABOVE remote)
        remote_weighted_avg = sum(smoothed_util(R_i) * host_count(R_i))
                              / remote_host_count
        if smoothed_util(local) <= remote_weighted_avg + utilization_variance_threshold:
            adjusted_weight(local) = total_base_weight
            adjusted_weight(R_i) = 0


        # Remote probe enforcement. Conserve total weight: take only as
        # much from local as it actually has, and redistribute exactly
        # that amount across remotes.
        total_adjusted_weight = sum(adjusted_weight(L_i) for all L_i)
        remote_weight = sum(adjusted_weight(R_i) for all remote R_i)
        remote_share = remote_weight / total_adjusted_weight
        if remote_share < remote_probe_fraction:
            deficit          = remote_probe_fraction * total_adjusted_weight - remote_weight
            take_from_local  = min(deficit, adjusted_weight(local))
            adjusted_weight(local) -= take_from_local
            for each remote R_i:
                adjusted_weight(R_i) += take_from_local * host_count(R_i) / remote_host_count

routing_share(L) = adjusted_weight(L) / sum(adjusted_weight(L_i) for all L_i)

Host-count proportional probe redistribution is intentional: when probing for fresh data, the policy samples remote localities fairly rather than biasing toward localities whose current (possibly stale) utilization happens to look lower.

Example configuration

Minimal configuration with round robin endpoint picking:

load_balancing_policy:
  policies:
  - typed_extension_config:
      name: envoy.load_balancing_policies.load_aware_locality
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.load_balancing_policies.load_aware_locality.v3.LoadAwareLocality
        endpoint_picking_policy:
          policies:
          - typed_extension_config:
              name: envoy.load_balancing_policies.round_robin
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.load_balancing_policies.round_robin.v3.RoundRobin

Configuration parameters

Parameter

Default

Description

endpoint_picking_policy

(required)

Child LB policy for selecting an endpoint within the chosen locality. Any LB policy may be configured here, including ring hash and Maglev, though policies that build cluster-wide structures will operate over only the chosen locality’s host slice. See Caveats.

weight_update_period

1 s

How often locality weights are recomputed from ORCA data. Must be at least 100 ms.

metric_names_for_computing_utilization

(unset)

Named ORCA metrics used to compute utilization when application_utilization is not reported. The max of matching values is taken. Map entries use <map_field_name>.<map_key> (e.g. named_metrics.foo). See Weight computation for precedence.

utilization_variance_threshold

0.1

When the local locality’s utilization exceeds the host-count-weighted remote average by no more than this threshold, all traffic routes locally. One-sided check: if the local locality is less loaded than the remote localities, all-local routing always applies. Range: [0, 1].

smoothing_time_constant

5 s

EWMA time constant for per-locality utilization smoothing. The per-tick smoothing factor is derived as alpha = 1 - exp(-weight_update_period / smoothing_time_constant), so settling time is independent of the configured tick rate. Larger values produce more stable weights; smaller values react faster. Must be greater than 0 s.

remote_probe_fraction

0.03

Minimum fraction of traffic sent to non-local localities to keep ORCA data fresh in all-local mode. The deficit is redistributed proportionally to host count. Set to 0 to disable (safe only with out-of-band ORCA reporting or when cross-zone traffic must be strictly avoided). Range: [0, 1). See Caveats for scaling notes.

weight_expiration_period

3 minutes

Per-host sample validity window. Hosts that have not reported within this duration are excluded from their locality’s utilization aggregation. The locality’s EWMA continues over the remaining reporting hosts; if every host in a locality is stale, the locality falls back to host-count-proportional weighting. Tune higher to tolerate longer reporting gaps; tune lower to prune draining backends faster. Set to 0 s to disable expiration.

enable_oob_load_report

(unset / false)

Enables out-of-band (OOB) ORCA utilization reporting. When set, the policy opens a per-host ORCA gRPC stream and the endpoint pushes reports on its own schedule rather than piggybacking on responses. When unset, only in-band reports on response headers/trailers are consumed. See ORCA data flow.

oob_reporting_period

10 s

Requested load-reporting interval, used only when enable_oob_load_report is true. The upstream may report less frequently than requested.

Priority support

The policy respects Envoy’s priority levels. Priority selection happens first via the standard healthy/degraded priority load calculation; locality selection then applies within the chosen priority. Unlike zone-aware routing (priority 0 only), this policy applies at all priority levels.

Three independent weight sets are maintained per priority:

  • Healthy – common case, healthy hosts only.

  • Degraded – when Envoy selects degraded hosts.

  • All-host – when the priority is in panic mode.

Each set tracks its own per-locality utilization average and headroom weight, computed from the same per-host ORCA data in a single tick pass.

Caveats and known limitations

  • Probing is required with in-band reporting. In the default in-band mode, a locality only produces fresh ORCA samples when it receives traffic, so remote_probe_fraction must stay above 0 to keep remote localities reporting. Set it to 0 only when enable_oob_load_report is on (OOB streams report independently of traffic) or when cross-zone traffic must be strictly avoided.

  • Hash-based child policies. Ring hash and Maglev build their hash structures from the host set they are given. With this policy, that set is the chosen locality’s hosts, not the full cluster. The same request hash will not necessarily map to the same endpoint cluster-wide, so consistency guarantees apply only within a locality.

  • Probe-fraction scaling. remote_probe_fraction is a global value divided across all remote localities, then again across each locality’s hosts. The per-host probe rate is therefore approximately Total RPS * remote_probe_fraction / (N remotes * Hosts/locality), and the expected interval between consecutive probes to a given host is the reciprocal. When that interval exceeds weight_expiration_period, hosts are likely to go stale between probes and the locality falls back to host-count weighting – defeating the load-awareness this policy provides.

    Approximate sample intervals per host at the default remote_probe_fraction of 0.03:

    Total RPS

    N remotes

    Hosts/locality

    Sample interval/host

    1000

    3

    10

    ~1 s

    1000

    100

    10

    ~33 s

    100

    100

    10

    ~5.5 min

    The top row is comfortably faster than the 3-minute default expiration. The middle row is still safe but leaves less headroom. In the bottom row, samples expire before the next probe arrives, so remote localities will alternate between fresh data and host-count fallback every few ticks. To avoid this, either reduce locality count, raise remote_probe_fraction, raise weight_expiration_period to tolerate longer gaps, or enable OOB ORCA reporting via enable_oob_load_report (which decouples sample rate from request rate entirely).

  • Variance-threshold oscillation. Workloads sitting near the utilization_variance_threshold boundary can theoretically oscillate between snap-to-local and spillover modes across consecutive ticks. EWMA smoothing dampens this in practice; tune smoothing_time_constant higher if oscillation is observed.

  • Subsetting. Load balancer subsetting partitions hosts orthogonally to locality boundaries. The policy will operate over the post-subset host slice; per-locality weights are computed over whatever hosts remain after subsetting filters them. This is rarely the behavior subset users expect.

Statistics

The policy emits stats under cluster.<cluster_name>.load_aware_locality.*:

Counter

Increments when

recompute_total

Per main-thread tick that recomputes weights.

all_overloaded_total

Per tick where every locality’s headroom was 0 (fallback to host-count weighting).

local_preferred_total

Per tick where the variance-threshold check snapped routing to 100% local.

probe_active_total

Per tick where remote_probe_fraction redistribution kicked in.

stale_locality_total

Incremented once per stale locality per tick (a locality whose hosts were all stale and fell back to host-count baseline). A 5-locality cluster with 2 stale localities adds 2 each tick.

Migrating from zone-aware routing? The closest counter mappings:

Zone-aware counter

Load-aware-locality equivalent

lb_zone_routing_all_directly

load_aware_locality.local_preferred_total

lb_recalculate_zone_structures

load_aware_locality.recompute_total