Series navigation
Written by
Jagdish Salgotra
Distributed systems, cloud-native architecture, and the JVM. mostly shipping, occasionally reading.
A deterministic Java comparison of five admission strategies under identical 2x overload. Reject-fast policies (token bucket, rate limiter) hold p99 at 10ms; a bounded queue accepts fewer requests at a 500ms p99, because under sustained overload a buffer is the worst of both.
Written by
Distributed systems, cloud-native architecture, and the JVM. mostly shipping, occasionally reading.
Production System Labs - Series 1, Post 5 Runnable Java experiments on the failure patterns that show up under real load. Deterministic outputs, checked-in CSVs, reproducible on any machine.
More work is arriving than the service can finish. Not briefly, not a spike that drains in a second, but a sustained period where the offered rate is above the service rate. Post 2 showed what that does to latency when nothing pushes back: the queue grows without bound and waiting time climbs until requests are answered long after anyone cares.
Backpressure is the decision you make instead of letting that happen. BackpressureScenario.java offers the same overload to five admission strategies: 1000 incoming requests over 5s, an incoming rate of 200 rps, and a target admit rate of 100 rps. The summary in golden/post5/post5-backpressure-summary.csv records what each strategy does:
strategy incoming accepted rejected acceptance_pct p50_ms p99_ms max_buffered
token-bucket 1000 599 401 59.9 10.0 10.0 0
bounded-queue 1000 549 451 54.9 500.0 500.0 50
leaky-bucket 1000 524 476 52.4 250.0 250.0 25
load-shedder 1000 500 500 50.0 50.0 50.0 5
resilience4j-rate-limiter 1000 500 500 50.0 10.0 10.0 0
The overload is identical in every row. What changes is where the pain goes.
We once had an inbound queue with no upper bound, and when traffic spiked it did not fall over loudly; it just kept accepting work until every request being served was already minutes old and nobody on the other end was still waiting for the answer.
The instinct, the one that built that unbounded queue, is to accept everything and let the system "catch up." Backpressure is the admission that there is no catching up while the overload lasts. The excess demand has to become something, and you only get to choose what.
Read the table as a set of choices, not a ranking. The token bucket and the rate limiter keep latency flat at 10 ms p99 because they reject the moment they are out of capacity; accepted requests are served immediately, and the cost lands entirely on the rejected 40 to 50%. The bounded queue makes the opposite bet: instead of rejecting at the capacity line, it buffers up to 50 requests and serves them late. Its p99 is 500 ms, which is exactly a full queue of 50 requests each taking 10 ms to serve. The leaky bucket is the same idea with a shallower buffer, 25 deep for a 250 ms p99.
Here is the part that should correct the intuition. The buffer did not even buy more accepted work for all that latency. The bounded queue accepted 549 requests, fewer than the token bucket's 599, because under a sustained overload the buffer fills up and then rejects anyway. It paid 500 ms of latency on everything it accepted and still turned away more requests than the policy that stayed at 10 ms. If your 200 ms SLO is the thing you are protecting, that 500 ms p99 did not protect it; it converted rejection into deadline misses, which are often worse, because the client already gave up and the work you did was wasted.
That is the lesson a queue does not advertise: buffering only helps when overload is transient, a burst it can absorb and drain after the spike passes. Against a sustained overload like this one, a buffer is the worst of both worlds, rejecting like the others and making everything it accepts slow.
The first second tells the rest of the story. From golden/post5/post5-backpressure-snapshots.csv:
strategy elapsed_s accepted rejected throughput_rps p50_ms p99_ms max_buffered
token-bucket 1 199 1 199.0 10.0 10.0 0
bounded-queue 1 149 51 149.0 380.0 500.0 50
The token bucket lets a burst of 199 through in the first second because it starts full, then settles to the refill rate. The bounded queue admits 149, but its p50 is already 380 ms and its buffer is already full, because every admission past the service rate goes to the back of a line.
Build the primitives by hand before reaching for a library, because the behavior in that table is not magic, it is about ten lines of arithmetic each.
A token bucket holds a balance that refills at a fixed rate and caps at a capacity. Admission spends a token; if the balance is below one, the request is rejected immediately. The capacity is the burst allowance, which is why the first second let 199 through. TokenBucket.java is the hand-rolled version used in this run:
public BackpressureDecision admit(long arrivalMs) {
refill(arrivalMs);
if (tokens < 1.0) {
return BackpressureDecision.rejected();
}
tokens -= 1.0;
return BackpressureDecision.accepted(serviceTimeMs);
}
private void refill(long nowMs) {
long elapsedMs = Math.max(0L, nowMs - lastRefillMs);
tokens = Math.min(capacity, tokens + (elapsedMs * refillPerMs));
lastRefillMs = nowMs;
}
Rejection is instant, so accepted requests never inherit a queue. That is the whole reason its p99 stays at the base 10 ms.
A bounded queue is the other shape. It tracks the completion times of admitted work, drops anything already finished, and rejects only when the buffer is full. Crucially, an admitted request starts after everything ahead of it, so it carries the accumulated wait. BoundedQueue.java makes that wait explicit:
public BackpressureDecision admit(long arrivalMs) {
removeCompleted(arrivalMs);
if (completionTimes.size() >= capacity) {
return BackpressureDecision.rejected();
}
long startMs = completionTimes.isEmpty()
? arrivalMs
: Math.max(arrivalMs, completionTimes.peekLast());
long completionMs = startMs + serviceTimeMs;
completionTimes.addLast(completionMs);
return BackpressureDecision.accepted(completionMs - arrivalMs);
}
completionMs - arrivalMs is the latency the caller sees, and it grows with queue depth. That single line is the difference between the token bucket's flat 10 ms and the bounded queue's 500 ms. The load shedder is a stricter variant of the same idea, capping in-flight work at a small number (here 5, giving a 50 ms p99), which keeps a tiny buffer for smoothing without ever letting latency run away.
Once the mechanics are clear, the library has nothing to teach you about behavior, only about robustness. The Resilience4jRateLimiterController.java row behaves like the hand-rolled token bucket without the burst tuning, accepting 500 at a flat 10 ms p99. Reaching for it is the right move in production, but only once you can predict, from the table above, what it will do under overload, and what it will do to the requests it turns away.
![Latency per strategy: flat for reject-fast policies, elevated for buffering policies]
Backpressure does not reduce load. It chooses who absorbs the overload and in what currency, and that choice should be deliberate.
Everything that can queue needs a bound. An unbounded buffer is not resilience; it is a delayed, worse outage where every request is stale on arrival. The overload currency also needs a decision before the incident. Reject-fast policies such as token buckets and rate limiters spend overload as rejections and keep accepted latency flat. Buffering policies such as bounded queues and leaky buckets spend overload as latency, and only earn that latency back if the overload is a transient burst they can drain. Against the sustained overload here, the buffering strategies rejected more and ran slower. There is no option that spends nothing.
A buffer that pushes accepted latency past your SLO has converted rejection into deadline misses. Queue size should come from the latency budget, not from available memory. Load shedding close to the edge keeps rejected requests cheap, and the load shedder held latency at 50 ms precisely because it never let work pile up.
The primitive is worth building once by hand so the behavior is predictable, then the battle-tested library can own the production hardening. Resilience4j is the right production choice; understanding the token bucket is what lets you configure it without guessing.
The hard part is not implementing any of these. It is accepting, before traffic forces the issue, that some requests will not be served, and choosing which ones on purpose rather than letting an unbounded queue decide by drowning everyone equally.
The command that regenerates the numbers used here is:
./gradlew :latency-lab:runBackpressure \
-Pargs="--deterministic --duration 5s --concurrency 10 --output-dir ./results/backpressure --snapshot-interval 1s --backpressure-strategy all"
The generated artifacts used in this post are:
golden/post5/post5-backpressure-summary.csv
golden/post5/post5-backpressure-snapshots.csv
golden/post5/post5-acceptance.png
golden/post5/post5-latency.png
GoldenOutputTest.java runs this deterministic Article 5 path and compares post5-backpressure-summary.csv and post5-backpressure-snapshots.csv against the checked-in golden files. A fresh run for this post wrote results/article-grounding/post5/manifest.json; both Article 5 CSV artifacts had golden_match: true, with SHA-256 values matching their golden SHA-256 values.
References used for the underlying backpressure ideas, not for the generated lab numbers: