Series navigation
Written by
Jagdish Salgotra
Distributed systems, cloud-native architecture, and the JVM. mostly shipping, occasionally reading.
A deterministic Java queueing simulation where p99 stays at 10ms through rho=1.00, then jumps to 109ms at rho=1.05 and 605ms at rho=1.30 while service time stays fixed.
Written by
Distributed systems, cloud-native architecture, and the JVM. mostly shipping, occasionally reading.
Production System Labs - Series Tail Latency System Behavior, Runnable Java experiments on the failure patterns that show up under real load. Deterministic outputs, checked-in CSVs, reproducible on any machine.
The painful queueing incidents I remember did not start with a host that looked obviously dead. They started with a service that still had enough CPU to make everyone hesitate, enough memory to avoid suspicion, and enough successful requests to keep the first few dashboards looking respectable while users were already stuck behind some smaller resource that did not have the same headroom.
This post is not a transcript from one of those incidents. The data is deterministic synthetic lab output from golden/post2/post2-saturation.csv, generated by QueueSaturationMain.java, and the point of making it synthetic is to remove every distraction except the one production shape that matters here: fixed service time, rising arrivals, and a bottleneck that crosses its service rate before the host graph looks dramatic.
The lab fixes service time at 10 ms and increases arrival rate. Nothing else gets slower.
The deterministic sweep stays flat through rho=1.00. Then p99 jumps to 109 ms at rho=1.05, 208 ms at rho=1.10, 407 ms at rho=1.20, and 605 ms at rho=1.30.
That jump keeps biting production systems. CPU looks fine. Memory looks fine. Error rate is quiet. Users are still waiting, because they are not waiting on "the service" in the abstract. They are waiting on a worker, a pool, a shard, a queue, a downstream limit, or some other bottleneck with its own service rate.
I initially expected this lab to show a softer knee. It does not. The deterministic run is blunt: flat, flat, flat, then waiting time jumps. That bluntness is useful because a smooth graph gives you too much room to negotiate with a limit that is not actually negotiating back.
The server did not become slower at rho=1.05. The worker still takes exactly 10 ms per request. The extra latency is pure waiting.
That distinction matters. If service time grows, optimize the work. If queue wait grows, the bottleneck is saturated, and optimizing random code paths is theater.
Post 1 showed why the average hides user-visible tail pain. Post 2 shows one way that pain gets created: arrivals start beating completions.
The trap is measuring the graph that is easy to get and pretending it is the graph that explains the outage.
CPU is convenient. Heap is convenient. Host-level dashboards are convenient. Requests do not care. They wait behind the resource that is currently limiting progress.
For this lab, the model is stripped down to the part that matters:
rho = lambda / mu
lambda is the arrival rate into the bottleneck. mu is the service rate out of it. When arrivals meet or exceed completions, the backlog grows.
The classic M/M/1 formula is not the exact model here, because this lab uses deterministic service time. Still, the shape is the part worth keeping:
L = rho / (1 - rho)
The denominator is where the pain comes from. As rho gets close to 1, each extra point of utilization costs more than the last one.
rho L requests in system multiplier vs rho=0.50
0.50 1.0 1x
0.70 2.3 2.3x
0.80 4.0 4x
0.90 9.0 9x
0.95 19.0 19x
0.99 99.0 99x
Do not read that table as a precise production model. Read it as a warning against linear thinking. Going from 70% to 80% is not the same kind of move as going from 20% to 30%.
The deterministic lab removes service-time variance so the cliff is easy to reproduce. Real systems are usually less kind. Retries, lock contention, slow downstream calls, batch jobs, and deploy noise all need headroom. If the queue is already close to the edge, variance spends that headroom for you.
The saturation CSV is the source for the post:
rho target_rps actual_rps p50_ms p99_ms p999_ms mean_sojourn_ms avg_queue_depth
0.10 10.0 10.0 10.0 10.0 10.0 10.0 0.00
0.20 20.0 20.0 10.0 10.0 10.0 10.0 0.00
0.30 30.0 30.0 10.0 10.0 10.0 10.0 0.00
0.40 40.0 40.0 10.0 10.0 10.0 10.0 0.00
0.50 50.0 50.0 10.0 10.0 10.0 10.0 0.00
0.60 60.0 60.0 10.0 10.0 10.0 10.0 0.00
0.70 70.0 70.0 10.0 10.0 10.0 10.0 0.00
0.80 80.0 80.0 10.0 10.0 10.0 10.0 0.00
0.90 90.0 90.0 10.0 10.0 10.0 10.0 0.00
0.95 95.0 95.0 10.0 10.0 10.0 10.0 0.00
1.00 100.0 100.0 10.0 10.0 10.0 10.0 0.00
1.05 105.0 105.0 85.0 109.0 110.0 59.8 4.46
1.10 110.0 110.0 160.0 208.0 209.0 109.5 9.50
1.15 115.0 115.0 234.0 307.0 309.0 159.3 14.42
1.20 120.0 120.0 308.0 407.0 408.0 209.2 19.34
1.25 125.0 125.0 384.0 506.0 508.0 259.0 24.30
1.30 130.0 130.0 458.0 605.0 608.0 308.8 29.35
The dangerous row is rho=1.00, not because it looks bad, but because it does not. p99 is still 10 ms. A shallow load test can stop there and declare victory right before the bill arrives.
The throughput chart is not the scary one in this deterministic run. Actual RPS follows target RPS. The cost is showing up as time-in-system.
That is the part worth spelling out: throughput can look obedient while users are already paying in latency.
The simulator in QueueSimulator.java makes the wait visible without involving wall-clock scheduling:
double arrivalMs = request * interArrivalMs;
int workerIndex = nextWorker(workerAvailableAtMs);
double serviceStartMs = Math.max(arrivalMs, workerAvailableAtMs[workerIndex]);
double finishMs = serviceStartMs + serviceTimeMs;
workerAvailableAtMs[workerIndex] = finishMs;
long sojournMs = Math.max(serviceTimeMs, Math.round(finishMs - arrivalMs));
If the worker is still busy, the request starts later. Service time did not change. Sojourn time did.
That is the queue.
Little's Law is the check I want when the dashboards disagree:
L = lambda * W
L is work in the system. lambda is completed throughput. W is time in the system.
Use it to catch mismatched metrics. Do not use it to pretend mismatched metrics are clean.
At rho=1.20, the deterministic run reports:
lambda = 120 rps
W = 209.2 ms = 0.2092 s
L_computed = 120 * 0.2092
= 25.1 requests
L_measured = 19.34 requests
Those values are not identical. They should not be presented as if they are.
The run is short. The queue is still building. The CSV's measured value is modeled queued work, while Little's Law L is work in the system, including work currently in service.
The useful part is the direction and order of magnitude. Both numbers say the system moved from no waiting to real backlog. The gap says the metrics are not measuring exactly the same thing.
The gap is not a flaw in the lesson. It is the lesson.
If latency is measured at the HTTP edge, throughput after an async handoff, and queue depth inside one worker pool, Little's Law will not line up cleanly. That does not make the law useless. It means the measurement boundary is wrong.
For this lab, the names are worth being precise about:
littles_law_computed_l = throughput * mean_sojourn_time
avg_queue_depth = modeled queued-ahead work
LittlesLawCalculator.java computes the first value from actual throughput and mean sojourn time. QueueSimulator.java computes the second value from modeled queued-ahead work.
They should move together during overload. They are not identical quantities. If we wanted an exact comparison, the simulator should emit average work-in-system including in-service work.
That part still stings a little. The chart is useful, but the naming is not as clean as it should be. Better to admit the mess plainly than make the formula look tidier than the data.
Capacity planning starts at the bottleneck resource.
CPU headroom only matters when CPU is what requests wait behind. If requests are waiting on a worker pool, a database connection pool, a shard, or a downstream limit, CPU headroom is just scenery.
For a critical path, four measurements need the same boundary: arrival rate into the bottleneck, completion rate out of it, queue depth at that bottleneck, and sojourn time from enqueue to completion.
If those numbers do not roughly agree through L = lambda * W, stop polishing dashboards and fix the measurement boundary.
Do not run critical queues at the edge and call it efficiency. That is how ordinary variance turns into a page.
70% is not a law. The right headroom depends on variance, retry behavior, downstream coupling, release patterns, and business tolerance for failure. But the general rule is simple enough: the closer a critical queue runs to saturation, the more future traffic spikes get paid for as latency.
Make overload explicit. Use a bounded queue, a rate limiter, a load shedder, or a backpressure signal. Unlimited waiting is not reliability. It is just delayed failure with worse user experience.
Also separate service time from queue wait in your metrics. If p99 is climbing but service time is flat, the system is not "slow" in the normal sense. It is saturated somewhere. That single observation changes what you should do next:
service time high -> optimize work
queue wait high -> add capacity, reduce arrivals, or shed load
retry rate high -> stop pouring traffic into a resource already at the edge
Post 5 dives into backpressure and load shedding. This post is the reason those things exist in the first place.
The command that regenerates the numbers used here is:
./gradlew :latency-lab:runQueueSaturation \
-Pargs="--deterministic --duration 2s --concurrency 1 --snapshot-interval 1s --output-dir ./results/queue-saturation"
The generated artifacts used in this post are:
golden/post2/post2-saturation.csv
golden/post2/post2-latency.png
golden/post2/post2-throughput.png
golden/post2/post2-littles-law.png
GoldenOutputTest.java runs this deterministic Article 2 path and compares post2-saturation.csv against the checked-in golden file. A fresh run for this post wrote results/article-grounding/post2/manifest.json; the Article 2 CSV artifact had golden_match: true, with its SHA-256 value matching the golden SHA-256 value.
References used for the underlying queueing ideas, not for the generated lab numbers: