Series navigation
Written by
Jagdish Salgotra
Distributed systems, cloud-native architecture, and the JVM. mostly shipping, occasionally reading.
A deterministic Java measurement harness where a closed-loop benchmark reports a whole-run p99 of 10ms for a service that froze for 500ms, while open-loop scheduling and HdrHistogram's coordinated-omission correction both report 460ms from the same samples.
Written by
Distributed systems, cloud-native architecture, and the JVM. mostly shipping, occasionally reading.
Production System Labs - Series 1, Post 4 Runnable Java experiments on the failure patterns that show up under real load. Deterministic outputs, checked-in CSVs, reproducible on any machine.
A service freezes for half a second. Maybe a stop-the-world GC pause, a lock held too long, a synchronous flush, a downstream that went away and came back. For that half second nothing completes. Then it recovers and serves normally again.
Now ask the load test what the p99 was, and it tells you the pause never happened.
This is the failure the lab isolates, and it is a measurement failure, not a system failure. The service really did stall. The benchmark really did miss it. The numbers are deterministic synthetic output from golden/post4/, generated by CoordinatedOmissionDemo.java with a fixed 10 ms base latency and a single injected pause partway through the run, so the only variable is how you measure.
I have shipped a load test that reported a single-digit-millisecond p99 for a service I already knew froze for half a second, and I trusted that number long enough to be embarrassed by it later.
The whole-run summary is the cleanest way to see the lie. Three series, same service, same pause, three different stories in golden/post4/post4-summary.csv:
series total_requests whole_run_p99_ms
closed-loop-raw 450 10.0
closed-loop-corrected 450 460.0
open-loop 500 460.0
The closed-loop raw run reports a whole-run p99 of 10 ms for a service that stopped responding for 500 ms. That is the coordinated omission problem in one row. The corrected run and the open-loop run, measuring the same pause, both report a whole-run p99 of 460 ms.
The reason is in how a closed-loop test issues work. It sends a request, waits for the response, then sends the next one. When the service stalls, the test stalls with it. It is not sitting outside the service generating load at a steady rate; it is politely waiting in line with everyone else. So during a 500 ms freeze it records exactly one slow request, the one that was unlucky enough to be in flight, and then resumes. One slow sample out of 450 lands far past the 99th percentile, so the whole-run p99 is still 10 ms.
A real user population does not wait its turn. New requests keep arriving during the freeze whether or not the previous ones finished. The closed-loop test silently omitted all the requests that should have arrived during the stall, and it omitted them in a way that is perfectly correlated with the slowness, which is the worst possible bias for a tail metric.
Per-interval, the stall is not completely invisible in the raw run. golden/post4/post4-closed-loop-raw.csv shows the third second:
elapsed_s p50_ms p95_ms p99_ms p999_ms throughput_rps total_requests
3 10.0 10.0 510.0 510.0 51.0 251
The interval p99 does spike to 510 ms that second, but look at the throughput next to it: it collapsed from 100 rps to 51 rps. The test simply stopped producing load while the service was frozen. And a benchmark that reports one number for the run, the whole-run p99, reports 10 ms. The spike exists in a column nobody puts on the slide.
The two scheduling loops are nearly identical, and the one-line difference between them is the entire lesson.
The closed-loop loop in CoordinatedOmissionDemo.java advances its clock by however long the request actually took, so a slow request pushes the next arrival later. When the service is frozen, no new arrivals are generated:
while (nowMs < durationMs) {
long latencyMs = model.latencyForArrival(nowMs);
intervals.get(intervalIndex(nowMs, snapshotIntervalMs, intervals.size())).recordValue(latencyMs);
wholeRun.recordValue(latencyMs);
totalRequests++;
nowMs += latencyMs; // a slow response delays the next request
}
The open-loop loop advances its clock by a fixed inter-arrival interval, derived from the target rate, regardless of how long any request takes. Arrivals during the freeze still happen and still get recorded:
for (long arrivalMs = 0L; arrivalMs < durationMs; arrivalMs += expectedIntervalMs) {
long latencyMs = model.latencyForArrival(arrivalMs);
intervals.get(intervalIndex(arrivalMs, snapshotIntervalMs, intervals.size())).recordValue(latencyMs);
wholeRun.recordValue(latencyMs);
totalRequests++;
}
nowMs += latencyMs versus arrivalMs += expectedIntervalMs. That is the difference between a benchmark that hides pauses and one that exposes them. It also explains the request counts in the summary: the open-loop run records 500 requests, the closed-loop run only 450, and the missing 50 are exactly the arrivals the closed-loop test swallowed during the freeze.
The pause itself is not magic. PauseServiceModel.java returns baseline latency outside the pause window and wait-until-pause-end latency inside it. In the fresh deterministic run for this post, the benchmark output reported an expected interval of 10 ms, a 10 ms baseline service latency, and a pause window from 2500 ms through 3000 ms.
If you are stuck with closed-loop data, HdrHistogram can reconstruct the omitted samples, as long as you tell it the interval at which requests were supposed to arrive:
Histogram corrected = rawIntervals.get(i).copyCorrectedForCoordinatedOmission(expectedIntervalMs);
Given an expected interval of 10 ms and an observed 500 ms stall, the correction synthesizes the roughly fifty requests that should have arrived during the freeze, each with the latency it would have experienced. That is how the corrected whole-run p99 climbs from 10 ms to 460 ms and lines up with the open-loop run. The correction is a repair for data you already mis-collected; it is not a substitute for generating load correctly in the first place.
Before you trust a latency number, ask how the load was generated, because the generator decides whether the tail is real or theater.
Open-loop load is the right default for tails that matter: a fixed arrival schedule, independent of response time. A tool that backs off when the service slows down is measuring its own politeness, not your service. Closed-loop data can still be useful, but the expected interval has to be real and the coordinated-omission correction has to be explicit; in this lab, the corrected p99 was 46x the raw whole-run p99 from the closed-loop samples.
A whole-run percentile that looks too good next to a throughput dip deserves suspicion. A p99 of 10 ms beside a second where throughput fell from 100 rps to 51 rps is not a fast service. It is a missing measurement. Throughput belongs next to latency because the collapse is the tell; the latency column alone would have let the stall hide.
The harness is part of the system under test. A measurement tool that omits the worst moments is not neutral. It is actively lying in the direction that feels best.
The uncomfortable part is that the wrong number is the reassuring one. Coordinated omission does not make your service look randomly off; it makes it look better, precisely when it is at its worst.
The command that regenerates the numbers used here is:
./gradlew :latency-lab:runCoordinatedOmission \
-Pargs="--deterministic --duration 5s --concurrency 10 --output-dir ./results/coordinated-omission --snapshot-interval 1s --measurement-mode both"
The generated artifacts used in this post are:
golden/post4/post4-closed-loop-raw.csv
golden/post4/post4-closed-loop-corrected.csv
golden/post4/post4-open-loop.csv
golden/post4/post4-summary.csv
golden/post4/post4-p99-comparison.png
golden/post4/post4-throughput.png
GoldenOutputTest.java runs this deterministic Article 4 path and compares the four generated CSVs against the checked-in golden files. A fresh run for this post wrote results/article-grounding/post4/manifest.json; all four Article 4 CSV artifacts had golden_match: true, with SHA-256 values matching their golden SHA-256 values.
References used for the underlying measurement ideas, not for the generated lab numbers: