engnotes.dev
NotebookTopicsAbout

Subscribe

One email when a new post goes up. Nothing else.

one per post · no tracking · also on RSS

Site

  • Notebook
  • Topics
  • About
  • Contact

Topics

Project Loom9Structured Concurrency9Tail Latency & System Behavior4

Elsewhere

  • GitHub
  • X
  • LinkedIn
  • Email
engnotes.dev© 2026 Jagdish Salgotra · written on personal time. not on employer time.
PrivacyTermsCookies
blog/tail-latency-system-behavior/part 2
Tail Latency & System Behavior · Part 2 of 4

Queueing Theory for Engineers

A deterministic Java queueing simulation where p99 stays at 10ms through rho=1.00, then jumps to 109ms at rho=1.05 and 605ms at rho=1.30 while service time stays fixed.

J
Jagdish Salgotra
2026-06-07·10 min read·~1,300 words

Series navigation

← Previous · Part 1Why Average Latency LiesNext · Part 3 →Hedged Requests & Speculative Execution
Code repositoryproduction-systems-labs
#tail-latency-system-behavior
share
J

Written by

Jagdish Salgotra

Distributed systems, cloud-native architecture, and the JVM. mostly shipping, occasionally reading.

all posts

Keep reading · rest of the series

  • 2026-05-319 min read
    Part 1
    Why Average Latency Lies
  • 2026-06-147 min read
    Part 3
    Hedged Requests & Speculative Execution
  • 2026-06-216 min read
    Part 4
    The Coordinated Omission Problem
Was this article helpful? or email →
anonymous · no account needed

On this page

Reading progress

0 min of 10 · ~10 left

Ask the post

Any answer points back at the paragraph it came from.

Production System Labs - Series Tail Latency System Behavior, Runnable Java experiments on the failure patterns that show up under real load. Deterministic outputs, checked-in CSVs, reproducible on any machine.


CPU was not the thing users waited on

The painful queueing incidents I remember did not start with a host that looked obviously dead. They started with a service that still had enough CPU to make everyone hesitate, enough memory to avoid suspicion, and enough successful requests to keep the first few dashboards looking respectable while users were already stuck behind some smaller resource that did not have the same headroom.

This post is not a transcript from one of those incidents. The data is deterministic synthetic lab output from golden/post2/post2-saturation.csv, generated by QueueSaturationMain.java, and the point of making it synthetic is to remove every distraction except the one production shape that matters here: fixed service time, rising arrivals, and a bottleneck that crosses its service rate before the host graph looks dramatic.

The lab fixes service time at 10 ms and increases arrival rate. Nothing else gets slower.

The deterministic sweep stays flat through rho=1.00. Then p99 jumps to 109 ms at rho=1.05, 208 ms at rho=1.10, 407 ms at rho=1.20, and 605 ms at rho=1.30.

That jump keeps biting production systems. CPU looks fine. Memory looks fine. Error rate is quiet. Users are still waiting, because they are not waiting on "the service" in the abstract. They are waiting on a worker, a pool, a shard, a queue, a downstream limit, or some other bottleneck with its own service rate.

I initially expected this lab to show a softer knee. It does not. The deterministic run is blunt: flat, flat, flat, then waiting time jumps. That bluntness is useful because a smooth graph gives you too much room to negotiate with a limit that is not actually negotiating back.

The server did not become slower at rho=1.05. The worker still takes exactly 10 ms per request. The extra latency is pure waiting.

That distinction matters. If service time grows, optimize the work. If queue wait grows, the bottleneck is saturated, and optimizing random code paths is theater.

Post 1 showed why the average hides user-visible tail pain. Post 2 shows one way that pain gets created: arrivals start beating completions.


CPU was the wrong graph

The trap is measuring the graph that is easy to get and pretending it is the graph that explains the outage.

CPU is convenient. Heap is convenient. Host-level dashboards are convenient. Requests do not care. They wait behind the resource that is currently limiting progress.

For this lab, the model is stripped down to the part that matters:

text
rho = lambda / mu

lambda is the arrival rate into the bottleneck. mu is the service rate out of it. When arrivals meet or exceed completions, the backlog grows.

The classic M/M/1 formula is not the exact model here, because this lab uses deterministic service time. Still, the shape is the part worth keeping:

text
L = rho / (1 - rho)

The denominator is where the pain comes from. As rho gets close to 1, each extra point of utilization costs more than the last one.

text
rho    L requests in system    multiplier vs rho=0.50
0.50   1.0                     1x
0.70   2.3                     2.3x
0.80   4.0                     4x
0.90   9.0                     9x
0.95   19.0                    19x
0.99   99.0                    99x

Do not read that table as a precise production model. Read it as a warning against linear thinking. Going from 70% to 80% is not the same kind of move as going from 20% to 30%.

The deterministic lab removes service-time variance so the cliff is easy to reproduce. Real systems are usually less kind. Retries, lock contention, slow downstream calls, batch jobs, and deploy noise all need headroom. If the queue is already close to the edge, variance spends that headroom for you.

The saturation CSV is the source for the post:

text
rho   target_rps  actual_rps  p50_ms  p99_ms  p999_ms  mean_sojourn_ms  avg_queue_depth
0.10  10.0        10.0        10.0    10.0    10.0     10.0             0.00
0.20  20.0        20.0        10.0    10.0    10.0     10.0             0.00
0.30  30.0        30.0        10.0    10.0    10.0     10.0             0.00
0.40  40.0        40.0        10.0    10.0    10.0     10.0             0.00
0.50  50.0        50.0        10.0    10.0    10.0     10.0             0.00
0.60  60.0        60.0        10.0    10.0    10.0     10.0             0.00
0.70  70.0        70.0        10.0    10.0    10.0     10.0             0.00
0.80  80.0        80.0        10.0    10.0    10.0     10.0             0.00
0.90  90.0        90.0        10.0    10.0    10.0     10.0             0.00
0.95  95.0        95.0        10.0    10.0    10.0     10.0             0.00
1.00  100.0       100.0       10.0    10.0    10.0     10.0             0.00
1.05  105.0       105.0       85.0    109.0   110.0    59.8             4.46
1.10  110.0       110.0       160.0   208.0   209.0    109.5            9.50
1.15  115.0       115.0       234.0   307.0   309.0    159.3            14.42
1.20  120.0       120.0       308.0   407.0   408.0    209.2            19.34
1.25  125.0       125.0       384.0   506.0   508.0    259.0            24.30
1.30  130.0       130.0       458.0   605.0   608.0    308.8            29.35

The dangerous row is rho=1.00, not because it looks bad, but because it does not. p99 is still 10 ms. A shallow load test can stop there and declare victory right before the bill arrives.

The throughput chart is not the scary one in this deterministic run. Actual RPS follows target RPS. The cost is showing up as time-in-system.

That is the part worth spelling out: throughput can look obedient while users are already paying in latency.

The simulator in QueueSimulator.java makes the wait visible without involving wall-clock scheduling:

java
double arrivalMs = request * interArrivalMs;
int workerIndex = nextWorker(workerAvailableAtMs);
double serviceStartMs = Math.max(arrivalMs, workerAvailableAtMs[workerIndex]);
double finishMs = serviceStartMs + serviceTimeMs;
workerAvailableAtMs[workerIndex] = finishMs;
long sojournMs = Math.max(serviceTimeMs, Math.round(finishMs - arrivalMs));

If the worker is still busy, the request starts later. Service time did not change. Sojourn time did.

That is the queue.


Little's Law will not clean up bad boundaries

Little's Law is the check I want when the dashboards disagree:

text
L = lambda * W

L is work in the system. lambda is completed throughput. W is time in the system.

Use it to catch mismatched metrics. Do not use it to pretend mismatched metrics are clean.

At rho=1.20, the deterministic run reports:

text
lambda = 120 rps
W = 209.2 ms = 0.2092 s

L_computed = 120 * 0.2092
           = 25.1 requests

L_measured = 19.34 requests

Those values are not identical. They should not be presented as if they are.

The run is short. The queue is still building. The CSV's measured value is modeled queued work, while Little's Law L is work in the system, including work currently in service.

The useful part is the direction and order of magnitude. Both numbers say the system moved from no waiting to real backlog. The gap says the metrics are not measuring exactly the same thing.

The gap is not a flaw in the lesson. It is the lesson.

If latency is measured at the HTTP edge, throughput after an async handoff, and queue depth inside one worker pool, Little's Law will not line up cleanly. That does not make the law useless. It means the measurement boundary is wrong.

For this lab, the names are worth being precise about:

text
littles_law_computed_l = throughput * mean_sojourn_time
avg_queue_depth        = modeled queued-ahead work

LittlesLawCalculator.java computes the first value from actual throughput and mean sojourn time. QueueSimulator.java computes the second value from modeled queued-ahead work.

They should move together during overload. They are not identical quantities. If we wanted an exact comparison, the simulator should emit average work-in-system including in-service work.

That part still stings a little. The chart is useful, but the naming is not as clean as it should be. Better to admit the mess plainly than make the formula look tidier than the data.


Plan around the queue, not the host graph

Capacity planning starts at the bottleneck resource.

CPU headroom only matters when CPU is what requests wait behind. If requests are waiting on a worker pool, a database connection pool, a shard, or a downstream limit, CPU headroom is just scenery.

For a critical path, four measurements need the same boundary: arrival rate into the bottleneck, completion rate out of it, queue depth at that bottleneck, and sojourn time from enqueue to completion.

If those numbers do not roughly agree through L = lambda * W, stop polishing dashboards and fix the measurement boundary.

Do not run critical queues at the edge and call it efficiency. That is how ordinary variance turns into a page.

70% is not a law. The right headroom depends on variance, retry behavior, downstream coupling, release patterns, and business tolerance for failure. But the general rule is simple enough: the closer a critical queue runs to saturation, the more future traffic spikes get paid for as latency.

Make overload explicit. Use a bounded queue, a rate limiter, a load shedder, or a backpressure signal. Unlimited waiting is not reliability. It is just delayed failure with worse user experience.

Also separate service time from queue wait in your metrics. If p99 is climbing but service time is flat, the system is not "slow" in the normal sense. It is saturated somewhere. That single observation changes what you should do next:

text
service time high  -> optimize work
queue wait high    -> add capacity, reduce arrivals, or shed load
retry rate high    -> stop pouring traffic into a resource already at the edge

Post 5 dives into backpressure and load shedding. This post is the reason those things exist in the first place.

The command that regenerates the numbers used here is:

bash
./gradlew :latency-lab:runQueueSaturation \
  -Pargs="--deterministic --duration 2s --concurrency 1 --snapshot-interval 1s --output-dir ./results/queue-saturation"

The generated artifacts used in this post are:

text
golden/post2/post2-saturation.csv
golden/post2/post2-latency.png
golden/post2/post2-throughput.png
golden/post2/post2-littles-law.png

GoldenOutputTest.java runs this deterministic Article 2 path and compares post2-saturation.csv against the checked-in golden file. A fresh run for this post wrote results/article-grounding/post2/manifest.json; the Article 2 CSV artifact had golden_match: true, with its SHA-256 value matching the golden SHA-256 value.

References used for the underlying queueing ideas, not for the generated lab numbers:

  • John D. C. Little, A Proof for the Queuing Formula: L = lambda W.
  • Mor Harchol-Balter, Performance Modeling and Design of Computer Systems.