What are best practices for benchmarking Envoy?

There is no single QPS, latency or throughput overhead that can characterize a network proxy such as Envoy. Instead, any measurements need to be contextually aware, ensuring an apples-to-apples comparison with other systems by configuring and load testing Envoy appropriately. As a result, we can’t provide a canonical benchmark configuration, but instead offer the following guidance:

A release Envoy binary should be used. If building, please ensure that -c opt is used on the Bazel command line. When consuming Envoy point releases, make sure you are using the latest point release; given the pace of Envoy development it’s not reasonable to pick older versions when making a statement about Envoy performance. Similarly, if working on a main build, please perform due diligence and ensure no regressions or performance improvements have landed proximal to your benchmark work and that your are close to HEAD.
The --concurrency Envoy CLI flag should be unset (providing one worker thread per logical core on your machine) or set to match the number of cores/threads made available to other network proxies in your comparison.
Disable circuit breaking. A common issue during benchmarking is that Envoy’s default circuit breaker limits are low, leading to connection and request queuing.
Disable generate_request_id.
Disable dynamic_stats. If you are measuring the overhead vs. a direct connection, you might want to consider disabling all stats via reject_all.
Ensure that the networking and HTTP filter chains are reflective of comparable features in the systems that Envoy is being compared with.
Ensure that TLS settings (if any) are realistic and that consistent cyphers are used in any comparison. Session reuse may have a significant impact on results and should be tracked via listener SSL stats.
Ensure that HTTP/2 settings, in particular those that affect flow control and stream concurrency, are consistent in any comparison. Ideally taking into account BDP and network link latencies when optimizing any HTTP/2 settings.
Verify in the listener and cluster stats that the number of streams, connections and errors matches what is expected in any given experiment.
Make sure you are aware of how connections created by your load generator are distributed across Envoy worker threads. This is especially important for benchmarks that use low connection counts and perfect keep-alive. You should be aware that Envoy will allocate all streams for a given connection to a single worker thread. This means, for example, that if you have 72 logical cores and worker threads, but only a single HTTP/2 connection from your load generator, then only 1 worker thread will be active.
Make sure request-release timing expectations line up with what is intended. Some load generators produce naturally jittery and/or batchy timings. This might end up being an unintended dominant factor in certain tests.
The specifics of how your load generator reuses connections is an important factor (e.g. MRU, random, LRU, etc.) as this impacts work distribution.
If you’re trying to measure small (say < 1ms) latencies, make sure the measurement tool and environment have the required sensitivity and the noise floor is sufficiently low.
Be critical of your bootstrap or xDS configuration. Ideally every line has a motivation and is necessary for the benchmark under consideration.
Consider using Nighthawk as your load generator and measurement tool. We are committed to building out benchmarking and latency measurement best practices in this tool.
Examine perf profiles of Envoy during the benchmark run, e.g. with flame graphs. Verify that Envoy is spending its time doing the expected essential work under test, rather than some unrelated or tangential work.
Familiarize yourself with latency measurement best practices. In particular, never measure latency at max load, this is not generally meaningful or reflecting of real system performance; aim to measure below the knee of the QPS-latency curve. Prefer open vs. closed loop load generators.
Avoid benchmarking crimes.