engnotes.dev
NotebookTopicsAbout

Subscribe

One email when a new post goes up. Nothing else.

one per post · no tracking · also on RSS

Site

  • Notebook
  • Topics
  • About
  • Contact

Topics

Project Loom9Structured Concurrency9Tail Latency & System Behavior4

Elsewhere

  • GitHub
  • X
  • LinkedIn
  • Email
engnotes.dev© 2026 Jagdish Salgotra · written on personal time. not on employer time.
PrivacyTermsCookies
blog/tail-latency-system-behavior/part 3
Tail Latency & System Behavior · Part 3 of 4

Hedged Requests & Speculative Execution

A deterministic Java hedging simulation where a p95 threshold cuts p99 from 200ms to 43ms for 3.7% extra load, while a p99 threshold improves nothing and burns the most wasted work. Hedging is a latency technique you buy with capacity.

J
Jagdish Salgotra
2026-06-14·7 min read·~1,200 words

Series navigation

← Previous · Part 2Queueing Theory for EngineersNext · Part 4 →The Coordinated Omission Problem
Code repositoryproduction-systems-labs
#tail-latency-system-behavior
share
J

Written by

Jagdish Salgotra

Distributed systems, cloud-native architecture, and the JVM. mostly shipping, occasionally reading.

all posts

Keep reading · rest of the series

  • 2026-05-319 min read
    Part 1
    Why Average Latency Lies
  • 2026-06-0710 min read
    Part 2
    Queueing Theory for Engineers
  • 2026-06-216 min read
    Part 4
    The Coordinated Omission Problem
Was this article helpful? or email →
anonymous · no account needed

On this page

Reading progress

0 min of 7 · ~7 left

Ask the post

Any answer points back at the paragraph it came from.

Production System Labs - Series Tail Latency System Behavior, Post 3 Runnable Java experiments on the failure patterns that show up under real load. Deterministic outputs, checked-in CSVs, reproducible on any machine.


The slow path you cannot debug

A service is mostly fast and occasionally, unpredictably, slow. The median is fine, p95 is fine, and then a thin slice of requests falls into a slow path and sits there while everything else sails through. You cannot find a bug to fix because there is no bug. The work is just sometimes slow, and the slowness is not correlated with anything you can act on in the moment.

This is the baseline the lab starts from, and it is the same downstream sampler from Post 1: most requests land near 20 ms, a small fraction take the slow path. The deterministic baseline snapshot in golden/post3/post3-baseline.csv shows exactly that personality, calm in the middle and ugly in the deep tail:

text
elapsed_s  p50_ms  p95_ms  p99_ms  p999_ms  throughput_rps
1          20.0    29.0    34.0    1532.0   1000.0
2          20.0    28.0    36.0     972.0   1000.0
3          20.0    28.0   200.0     653.0   1000.0
4          20.0    29.0   200.0     919.0   1000.0
5          20.0    28.0    34.0    1383.0   1000.0

The throughput_rps column is the deterministic harness request rate, not a measured closed-loop throughput claim. HedgingScenario.java sets it from concurrency * 100, so this run records 10 * 100 = 1000 synthetic requests per second.

Hedging is the obvious move here. If a request has not answered by some deadline, send a second copy and take whichever comes back first. The slow path on one attempt is usually independent of the slow path on a retry, so a duplicate often finishes while the original is still stuck.

The technique is real and it works. The interesting part of this post is not that it works. It is what it costs, and the two ways teams get the cost wrong.

The cost table is the reason this experiment exists. Hedging changes the tail by spending duplicate work, so the duplicate work has to be measured next to the tail improvement.

I once watched a team flip hedging on fleet-wide to chase a p99 problem and quietly add a large fraction of duplicate downstream load in the process, because nobody put a number on the extra work before shipping it.


Too early floods, too late moves nothing

The two failure modes are symmetric: hedge too early and you flood your own dependencies with duplicate work for very little tail improvement; hedge too late and you spend almost as much duplicate work on requests that were already doomed, and the tail does not move at all.

The lab makes both ends concrete. HedgingScenario.java sweeps the hedge threshold across p90, p95, and p99 of the baseline distribution and records the trade-off in golden/post3/post3-cost-table.csv:

text
threshold  threshold_ms  baseline_p99_ms  hedged_p99_ms  p99_improvement_pct  extra_load_pct  hedged_requests  total_requests  wasted_work_ms  wasted_completions
p90        27            200.0            41.0           79.5                 7.5             377              5000            3412            0
p95        29            200.0            43.0           78.5                 3.7             186              5000            2987            0
p99        200           200.0            200.0          0.0                  0.6             31               5000            6845            0

The last row deserves a careful read, because it is the one that surprises people. Setting the hedge delay at the p99 of the distribution (200 ms) fires the second attempt only 31 times out of 5000, which looks cheap at 0.6% extra load. But it improves p99 by 0.0%, and it burns 6845 ms of wasted work, more than either of the earlier, more aggressive thresholds. That is not a contradiction. By the time you wait 200 ms to hedge, the only requests that trip the hedge are the genuinely deep-tail ones, the second attempt has to run a full duplicate from scratch, and it rarely beats the original by enough to matter. You paid the most per hedge for the least benefit.

The p90 row is the opposite mistake in milder form. A 27 ms threshold fires 377 times, costs 7.5% extra load, and does cut p99 hard. But you are now hedging a meaningful slice of normal traffic, and on a dependency that is already warm you are spending real capacity to shave a tail that the next threshold shaves almost as well for half the load.


A hedge is not "always send two"

The mechanism is small enough to read, and it is where the cost discipline has to live. A hedge is not "always send two." It is "send one, and only send the second if the first is late." The primitive in HedgedRequest.java makes that conditional explicit:

java
public HedgedRequestResult execute(long primaryLatencyMs, long secondaryLatencyMs, long hedgeDelayMs) {
    if (primaryLatencyMs <= hedgeDelayMs) {
        sleepIfLive(primaryLatencyMs);
        return new HedgedRequestResult(primaryLatencyMs, false, 1, 0, 0, false);
    }
    // primary is past the threshold: launch the duplicate and race them
    ...
}

The race itself is a structured-concurrency operation, not a pile of futures and manual cancellation. The wrapper in ScopedRunner.java forks both attempts and returns the first success, and the scope cancels the loser on the way out:

java
public static <T> T hedge(Callable<T> primary, Callable<T> secondary, WastedWorkCounter wasted) {
    try (var scope = StructuredTaskScope.open(StructuredTaskScope.Joiner.<T>anySuccessfulResultOrThrow())) {
        Subtask<T> primaryTask = scope.fork(primary);
        Subtask<T> secondaryTask = scope.fork(secondary);
        T result = scope.join();
        if (primaryTask.state() == Subtask.State.SUCCESS
                && secondaryTask.state() == Subtask.State.SUCCESS) {
            wasted.increment();
        }
        return result;
    }
    ...
}

anySuccessfulResultOrThrow is the whole point: the join returns as soon as either attempt succeeds, and the scope tears the other one down. The wasted-work counter is there precisely because eager cancellation is best-effort. Sometimes the second attempt has already finished by the time the first one wins, and that completed-but-discarded work is load you paid for and threw away.

When the threshold is set well, that cost is small and the payoff is large. The p95 threshold run in golden/post3/post3-hedged-p95.csv shows the deep tail collapsing while the middle of the distribution does not move at all:

text
elapsed_s  p50_ms  p95_ms  p99_ms  p999_ms  throughput_rps
1          20.0    29.0    34.0    54.0     1000.0
2          20.0    28.0    36.0    55.0     1000.0
3          20.0    28.0    45.0    55.0     1000.0
4          20.0    29.0    47.0    55.0     1000.0
5          20.0    28.0    34.0    58.0     1000.0

The p99.9 column is the clean comparison against the baseline above. Baseline p99.9 swings between 653 ms and 1532 ms. With a p95 hedge it sits between 54 ms and 58 ms. The slow path did not get faster; the request stopped waiting for it.


Price the duplicate before you ship it

Hedging is a latency technique you buy with capacity, so you do not get to turn it on without pricing it.

The hedge delay should come from the distribution you actually have, not a round number. A threshold near p95 in this lab bought a 78.5% p99 improvement for 3.7% extra load; the same idea at p90 cost 7.5% for barely more benefit, and at p99 it cost the most wasted work for none.

Extra load and wasted work belong next to tail improvement as first-class metrics. A p99 number that improves while duplicate load doubles is not a win; it is a transfer of pain to the dependency. A duplicate-rate cap also belongs in the design, because a downstream slowdown is exactly when hedging fires most and exactly when the dependency can least afford a second copy of everything.

Non-idempotent work is the wrong place for this primitive, and correlated slow paths are not much better. If the second attempt hits the same saturated shard, capacity has been spent to wait for the same stall twice.

Hedging treats variance, not saturation. If the tail comes from a queue that is over its service rate, a second request joins the same queue and makes it worse. Post 2 is the warning label for that case: fix the saturation first, then hedge the residual variance.

Hedging earns its place when slow is rare, independent, and expensive to wait on. The moment it stops being rare, the cost model in the table above is the thing that tells you to back off.

The command that regenerates the numbers used here is:

bash
./gradlew :latency-lab:runHedgedRequests \
  -Pargs="--deterministic --duration 5s --concurrency 10 --output-dir ./results/hedged-requests --snapshot-interval 1s --hedge-threshold p95"

The generated artifacts used in this post are:

text
golden/post3/post3-baseline.csv
golden/post3/post3-hedged-p95.csv
golden/post3/post3-cost-table.csv
golden/post3/post3-latency-comparison.png
golden/post3/post3-cost-table.png

GoldenOutputTest.java runs this deterministic Article 3 path and compares post3-baseline.csv, post3-hedged-p95.csv, and post3-cost-table.csv against the checked-in golden files. A fresh run for this post wrote results/article-grounding/post3/manifest.json; all three Article 3 CSV artifacts had golden_match: true, with SHA-256 values matching their golden SHA-256 values.

References used for the underlying hedging ideas, not for the generated lab numbers:

  • Jeffrey Dean and Luiz André Barroso, The Tail at Scale.
  • OpenJDK, JEP 505: Structured Concurrency (Fifth Preview).