Tail Latency & System Behavior · Part 1 of 6

Why Average Latency Lies

A deterministic Java simulation where baseline p99=34ms becomes fan-out p99=597ms without changing the downstream latency sampler. Average latency is structurally incapable of catching that tail pain.

Jagdish Salgotra

2026-05-31·9 min read·~1,300 words

Next · Part 2 →Queueing Theory for Engineers

Code repositoryproduction-systems-labs

#tail-latency-system-behavior

Written by

Jagdish Salgotra

Distributed systems, cloud-native architecture, and the JVM. mostly shipping, occasionally reading.

all posts

Keep reading · rest of the series

Was this article helpful? or email →

anonymous · no account needed

Tail Latency & System Behavior · Part 1 of 6

Why Average Latency Lies

Jagdish Salgotra

2026-05-31·9 min read·~1,300 words

Production System Labs - Series Tail Latency System Behavior, Runnable Java experiments on the failure patterns that show up under real load. Deterministic outputs, checked-in CSVs, reproducible on any machine.

The average hides the failure

I have lost real debugging time to this pattern.

Not because the graphs looked dramatic. They looked boring. p50 was low, average latency was stable, host graphs were calm, and that made everyone suspicious of the wrong thing while a few users were still stuck inside requests that felt broken.

So this lab makes that specific failure small enough to inspect.

The numbers are synthetic by design: deterministic output from the checked-in Java simulation and golden CSVs under golden/post1/, not production telemetry. The pattern is the real part: one mostly fast downstream path, a small slow tail, and a top-level endpoint that waits for several downstream calls before it answers.

Run the baseline and final snapshot: p50 20 ms, p99 34 ms. The middle looks fine. But p99.9 is already at 1532 ms in the first second of the run. The tail is already there before fan-out enters the picture.

Now run the fan-out version with the exact same downstream sampler, except every top-level request waits for the slowest of five downstream calls. p50 only moves to 26 ms, but p99 jumps to 597 ms.

Nothing in the downstream service got slower. The only difference was the request shape.

That single change is the whole lesson. Once a user request waits on multiple branches, they do not feel the median child call. They feel the slowest one in that moment, and the request boundary now has its own distribution.

This is the quiet failure mode the lab isolates:

A mostly fast latency distribution.
A small but real slow tail.
A top-level request that waits for several downstream calls.
A dashboard habit that only watches the middle.

I left out the usual production chaos on purpose: GC pauses, retry storms, noisy neighbors, bad deploys, connection pool weirdness. Add all of that and the first lesson gets easier to dodge. Composition by itself is already enough to turn a small downstream tail into something users notice.

The baseline chart is intentionally plain. p50 sits near the normal path. The tail is visible only if you read p99 and p99.9.

The fan-out chart is the same lab after changing the request shape. Five branches. One top-level response. The max branch wins.

Fan-out turns the same downstream distribution into top-level tail amplification

The mean throws away the part we need

Average latency is useful for some questions. It helps with broad capacity accounting. It can tell you something about total work. It is not useless.

It is just the wrong first number for this failure.

text

mean latency = sum(all request latencies) / request count

That formula spreads the bad requests across the good ones. Users do not experience that blend. They experience the one request they are waiting on.

The lab sampler in LatencyInjector.java is small enough to read without guessing what it does. Most requests land near 20 ms: normal mean 20 ms, stddev 5 ms. One percent take the slow path.

java

if (random.nextDouble() < TAIL_PROBABILITY) {
    double tailSample = -TAIL_MEAN_MS * Math.log(1.0 - random.nextDouble());
    long clamped = Math.round(tailSample);
    return Math.max(TAIL_MIN_MS, Math.min(TAIL_MAX_MS, clamped));
} else {
    double normalSample = random.nextGaussian() * NORMAL_STDDEV_MS + NORMAL_MEAN_MS;
    long clamped = Math.round(normalSample);
    return Math.max(NORMAL_MIN_MS, Math.min(NORMAL_MAX_MS, clamped));
}

It is a controlled input: fast path, rare slow path, repeatable output.

The baseline CSV is the part I would look at first:

text

elapsed_s  p50_ms  p95_ms  p99_ms  p999_ms  throughput_rps
1          20.0    29.0    34.0    1532.0   1000.0
2          20.0    28.0    36.0     972.0   1000.0
3          20.0    28.0   200.0     653.0   1000.0
4          20.0    29.0   200.0     919.0   1000.0
5          20.0    28.0    34.0    1383.0   1000.0

The middle is stable. The tail is not gone. The throughput_rps column is the injected open-loop arrival rate, not a measured closed-loop figure; queueing and backpressure are not modeled in this experiment and are covered in a later post.

At 1000 rps, p99.9 is not trivia. It is roughly one request per second at or beyond that boundary. If that request is login, checkout, payment, or search, someone is waiting for it.

The fan-out math is not complicated:

text

P(at least one of 5 branches is slow) = 1 - P(all 5 branches are fast)
                                      = 1 - (0.99)^5
                                      = 4.9%

A one-percent slow path in one dependency becomes close to a five-percent chance that the top-level request sees at least one slow branch. This is why "each dependency is within target" can be true while the composed endpoint is still bad.

The code in TailAmplificationScenario.java is just that math in request form:

java

List<Callable<Long>> downstream = new ArrayList<>(FAN_OUT_DEGREE);
for (int j = 0; j < FAN_OUT_DEGREE; j++) {
    downstream.add(() -> {
        long latencyMs = downstreamInjector.sampleLatencyMs();
        Thread.sleep(latencyMs);
        return latencyMs;
    });
}

List<Long> results = ScopedRunner.fanOut(downstream);
long maxLatency = results.stream().mapToLong(Long::longValue).max().orElse(0L);
histogram.recordLatency(maxLatency);

No mystery. No clever failure. The request records the max downstream latency.

The fan-out CSV shows what that does:

text

elapsed_s  p50_ms  p95_ms  p99_ms  p999_ms  throughput_rps
1          26.0   200.0    808.0   1599.0   1000.0
2          26.0    35.0    731.0   1722.0   1000.0
3          26.0    35.0    929.0   2459.0   1000.0
4          26.0   200.0    587.0   1922.0   1000.0
5          26.0    34.0    597.0   1849.0   1000.0

p50 moved by 6 ms. p99 moved by 563 ms.

That gap is where average latency gets people in trouble.

The top-level path owns the tail

The comparison chart is the simplest artifact in this post. The normal p99 and fan-out p99 come from the same deterministic lab run. They split when the request starts waiting for the slowest branch.

Normal p99 versus fan-out p99 shows the amplification directly

The data here is generated lab data. That is the point. The post uses checked-in outputs from:

text

golden/post1/post1-baseline.csv
golden/post1/post1-tail-amplification.csv

GoldenOutputTest.java runs the deterministic Article 1 path and compares those generated CSVs against the checked-in golden files. A fresh run for this post wrote results/article-grounding/post1/manifest.json; both Article 1 CSV artifacts had golden_match: true, with SHA-256 values matching their golden SHA-256 values.

The deterministic path records 1000 rps in both runs, so the article is not making a claim about how fast this laptop happened to schedule virtual threads. It is showing how the distribution changes when a request waits for five branches.

This is also why per-service dashboards are not enough.

A dependency dashboard can tell you each child service looks reasonable in isolation. A trace can explain one slow request after you find it. Neither replaces the metric you need to catch this early: the top-level percentile distribution for the composed endpoint.

If the endpoint is GET /checkout, measure GET /checkout.

Not just pricing.

Not just inventory.

Not just payment.

The endpoint owns the user-visible tail because the endpoint owns the wait.

HdrHistogram is the recording tool in this repo, but the tool is secondary to the boundary. The wrapper is small:

java

public void recordLatency(long latencyMs) {
    recorder.recordValue(Math.min(latencyMs, MAX_LATENCY_MS));
}

public Histogram intervalHistogram() {
    return recorder.getIntervalHistogram();
}

HdrHistogram will not save a bad test harness. Post 4 gets into coordinated omission for that reason. For this post, the requirement is simpler: record every top-level request at the boundary you care about, and keep the tail instead of folding it into a mean.

This lab does not include retries, hedging, cancellation, deadline propagation, queueing, or coordinated omission. Those are later posts. Pulling them in here would make the first post feel more complete and less useful, which is usually how a simple measurement lesson gets turned back into fog.

The first post only needs one point: the tail exists, and fan-out moves that tail into the user-facing path.

Measure the boundary before it hurts

For user-facing paths, p50 is the common path. p99 is regular tail pain. p99.9 is where "rare" stops being rare at high volume.

text

tail requests per second = 1000 * (1 - 0.999)
                          = 1 request per second

One request per second is not an edge case. It is a steady leak.

The review rule I would enforce is narrow. Measure latency at the top-level request boundary and emit p50, p95, p99, and p99.9 from that same boundary. Count fan-out on critical paths. Keep average latency as a support metric, not the incident trigger. And keep deterministic lab outputs tied to checked-in CSVs so the article numbers are reproducible, not claimed.

That average-latency point matters because average latency will keep looking reasonable longer than you want it to. It is a lagging explanation for this class of failure, not the alarm.

Do the measurement before the incident. Adding a histogram later is still better than guessing, but then you are trying to reason from data you did not collect. If the path matters, record the distribution from the first version.

The command that regenerates the numbers used here is:

bash

./gradlew :latency-lab:runTailLatency \
  -Pargs="--deterministic --duration 5s --concurrency 10 --output-dir ./results/tail-latency --snapshot-interval 1s"

The generated artifacts used in this post are:

text

golden/post1/post1-baseline.csv
golden/post1/post1-tail-amplification.csv
golden/post1/post1-baseline.png
golden/post1/post1-tail-amplification.png
golden/post1/post1-comparison.png

Resources

Jeffrey Dean and Luiz André Barroso, The Tail at Scale.
Gil Tene, HdrHistogram.

Next · Part 2 →Queueing Theory for Engineers

Code repositoryproduction-systems-labs

#tail-latency-system-behavior

Written by

Jagdish Salgotra

Distributed systems, cloud-native architecture, and the JVM. mostly shipping, occasionally reading.

all posts

Keep reading · rest of the series

Was this article helpful? or email →

anonymous · no account needed

The average hides the failure

I have lost real debugging time to this pattern.

So this lab makes that specific failure small enough to inspect.

Nothing in the downstream service got slower. The only difference was the request shape.

This is the quiet failure mode the lab isolates:

A mostly fast latency distribution.
A small but real slow tail.
A top-level request that waits for several downstream calls.
A dashboard habit that only watches the middle.

The baseline chart is intentionally plain. p50 sits near the normal path. The tail is visible only if you read p99 and p99.9.

The fan-out chart is the same lab after changing the request shape. Five branches. One top-level response. The max branch wins.

The mean throws away the part we need

Average latency is useful for some questions. It helps with broad capacity accounting. It can tell you something about total work. It is not useless.

It is just the wrong first number for this failure.

text

mean latency = sum(all request latencies) / request count

That formula spreads the bad requests across the good ones. Users do not experience that blend. They experience the one request they are waiting on.

The lab sampler in LatencyInjector.java is small enough to read without guessing what it does. Most requests land near 20 ms: normal mean 20 ms, stddev 5 ms. One percent take the slow path.

java

if (random.nextDouble() < TAIL_PROBABILITY) {
    double tailSample = -TAIL_MEAN_MS * Math.log(1.0 - random.nextDouble());
    long clamped = Math.round(tailSample);
    return Math.max(TAIL_MIN_MS, Math.min(TAIL_MAX_MS, clamped));
} else {
    double normalSample = random.nextGaussian() * NORMAL_STDDEV_MS + NORMAL_MEAN_MS;
    long clamped = Math.round(normalSample);
    return Math.max(NORMAL_MIN_MS, Math.min(NORMAL_MAX_MS, clamped));
}

It is a controlled input: fast path, rare slow path, repeatable output.

The baseline CSV is the part I would look at first:

text

elapsed_s  p50_ms  p95_ms  p99_ms  p999_ms  throughput_rps
1          20.0    29.0    34.0    1532.0   1000.0
2          20.0    28.0    36.0     972.0   1000.0
3          20.0    28.0   200.0     653.0   1000.0
4          20.0    29.0   200.0     919.0   1000.0
5          20.0    28.0    34.0    1383.0   1000.0

At 1000 rps, p99.9 is not trivia. It is roughly one request per second at or beyond that boundary. If that request is login, checkout, payment, or search, someone is waiting for it.

The fan-out math is not complicated:

text

P(at least one of 5 branches is slow) = 1 - P(all 5 branches are fast)
                                      = 1 - (0.99)^5
                                      = 4.9%

The code in TailAmplificationScenario.java is just that math in request form:

java

List<Callable<Long>> downstream = new ArrayList<>(FAN_OUT_DEGREE);
for (int j = 0; j < FAN_OUT_DEGREE; j++) {
    downstream.add(() -> {
        long latencyMs = downstreamInjector.sampleLatencyMs();
        Thread.sleep(latencyMs);
        return latencyMs;
    });
}

List<Long> results = ScopedRunner.fanOut(downstream);
long maxLatency = results.stream().mapToLong(Long::longValue).max().orElse(0L);
histogram.recordLatency(maxLatency);

No mystery. No clever failure. The request records the max downstream latency.

The fan-out CSV shows what that does:

text

elapsed_s  p50_ms  p95_ms  p99_ms  p999_ms  throughput_rps
1          26.0   200.0    808.0   1599.0   1000.0
2          26.0    35.0    731.0   1722.0   1000.0
3          26.0    35.0    929.0   2459.0   1000.0
4          26.0   200.0    587.0   1922.0   1000.0
5          26.0    34.0    597.0   1849.0   1000.0

p50 moved by 6 ms. p99 moved by 563 ms.

That gap is where average latency gets people in trouble.

The top-level path owns the tail

The comparison chart is the simplest artifact in this post. The normal p99 and fan-out p99 come from the same deterministic lab run. They split when the request starts waiting for the slowest branch.

The data here is generated lab data. That is the point. The post uses checked-in outputs from:

text

golden/post1/post1-baseline.csv
golden/post1/post1-tail-amplification.csv

This is also why per-service dashboards are not enough.

If the endpoint is GET /checkout, measure GET /checkout.

Not just pricing.

Not just inventory.

Not just payment.

The endpoint owns the user-visible tail because the endpoint owns the wait.

HdrHistogram is the recording tool in this repo, but the tool is secondary to the boundary. The wrapper is small:

java

public void recordLatency(long latencyMs) {
    recorder.recordValue(Math.min(latencyMs, MAX_LATENCY_MS));
}

public Histogram intervalHistogram() {
    return recorder.getIntervalHistogram();
}

The first post only needs one point: the tail exists, and fan-out moves that tail into the user-facing path.

Measure the boundary before it hurts

For user-facing paths, p50 is the common path. p99 is regular tail pain. p99.9 is where "rare" stops being rare at high volume.

text

tail requests per second = 1000 * (1 - 0.999)
                          = 1 request per second

One request per second is not an edge case. It is a steady leak.

That average-latency point matters because average latency will keep looking reasonable longer than you want it to. It is a lagging explanation for this class of failure, not the alarm.

The command that regenerates the numbers used here is:

bash

./gradlew :latency-lab:runTailLatency \
  -Pargs="--deterministic --duration 5s --concurrency 10 --output-dir ./results/tail-latency --snapshot-interval 1s"

The generated artifacts used in this post are:

text

golden/post1/post1-baseline.csv
golden/post1/post1-tail-amplification.csv
golden/post1/post1-baseline.png
golden/post1/post1-tail-amplification.png
golden/post1/post1-comparison.png

Resources

Jeffrey Dean and Luiz André Barroso, The Tail at Scale.
Gil Tene, HdrHistogram.

Why Average Latency Lies

Series navigation

Jagdish Salgotra

Keep reading · rest of the series

Why Average Latency Lies

The average hides the failure

The mean throws away the part we need

The top-level path owns the tail

Measure the boundary before it hurts

Resources

Series navigation

Jagdish Salgotra

Keep reading · rest of the series

The average hides the failure

The mean throws away the part we need

The top-level path owns the tail

Measure the boundary before it hurts

Resources