engnotes.dev
NotebookTopicsAbout

Subscribe

One email when a new post goes up. Nothing else.

one per post · no tracking · also on RSS

Site

  • Notebook
  • Topics
  • About
  • Contact

Topics

Project Loom9Structured Concurrency9Tail Latency & System Behavior5

Elsewhere

  • GitHub
  • X
  • LinkedIn
  • Email
engnotes.dev© 2026 Jagdish Salgotra · written on personal time. not on employer time.
PrivacyTermsCookies
blog/tail-latency-system-behavior/part 5
Tail Latency & System Behavior · Part 5 of 5

Backpressure Design Patterns

A deterministic Java comparison of five admission strategies under identical 2x overload. Reject-fast policies (token bucket, rate limiter) hold p99 at 10ms; a bounded queue accepts fewer requests at a 500ms p99, because under sustained overload a buffer is the worst of both.

J
Jagdish Salgotra
2026-06-28·7 min read·~1,200 words

Series navigation

← Previous · Part 4The Coordinated Omission Problem
Code repositoryproduction-systems-labs
#tail-latency-system-behavior
share
J

Written by

Jagdish Salgotra

Distributed systems, cloud-native architecture, and the JVM. mostly shipping, occasionally reading.

all posts

Keep reading · rest of the series

  • 2026-05-319 min read
    Part 1
    Why Average Latency Lies
  • 2026-06-0710 min read
    Part 2
    Queueing Theory for Engineers
  • 2026-06-147 min read
    Part 3
    Hedged Requests & Speculative Execution
  • 2026-06-216 min read
    Part 4
    The Coordinated Omission Problem
Was this article helpful? or email →
anonymous · no account needed

On this page

Reading progress

0 min of 7 · ~7 left

Ask the post

Any answer points back at the paragraph it came from.

Production System Labs - Series 1, Post 5 Runnable Java experiments on the failure patterns that show up under real load. Deterministic outputs, checked-in CSVs, reproducible on any machine.


More is arriving than can ever finish

More work is arriving than the service can finish. Not briefly, not a spike that drains in a second, but a sustained period where the offered rate is above the service rate. Post 2 showed what that does to latency when nothing pushes back: the queue grows without bound and waiting time climbs until requests are answered long after anyone cares.

Backpressure is the decision you make instead of letting that happen. BackpressureScenario.java offers the same overload to five admission strategies: 1000 incoming requests over 5s, an incoming rate of 200 rps, and a target admit rate of 100 rps. The summary in golden/post5/post5-backpressure-summary.csv records what each strategy does:

text
strategy                    incoming  accepted  rejected  acceptance_pct  p50_ms  p99_ms  max_buffered
token-bucket                1000      599       401       59.9            10.0    10.0    0
bounded-queue               1000      549       451       54.9            500.0   500.0   50
leaky-bucket                1000      524       476       52.4            250.0   250.0   25
load-shedder                1000      500       500       50.0            50.0    50.0    5
resilience4j-rate-limiter   1000      500       500       50.0            10.0    10.0    0

The overload is identical in every row. What changes is where the pain goes.

We once had an inbound queue with no upper bound, and when traffic spiked it did not fall over loudly; it just kept accepting work until every request being served was already minutes old and nobody on the other end was still waiting for the answer.


A queue under sustained overload is the worst of both

The instinct, the one that built that unbounded queue, is to accept everything and let the system "catch up." Backpressure is the admission that there is no catching up while the overload lasts. The excess demand has to become something, and you only get to choose what.

Read the table as a set of choices, not a ranking. The token bucket and the rate limiter keep latency flat at 10 ms p99 because they reject the moment they are out of capacity; accepted requests are served immediately, and the cost lands entirely on the rejected 40 to 50%. The bounded queue makes the opposite bet: instead of rejecting at the capacity line, it buffers up to 50 requests and serves them late. Its p99 is 500 ms, which is exactly a full queue of 50 requests each taking 10 ms to serve. The leaky bucket is the same idea with a shallower buffer, 25 deep for a 250 ms p99.

Here is the part that should correct the intuition. The buffer did not even buy more accepted work for all that latency. The bounded queue accepted 549 requests, fewer than the token bucket's 599, because under a sustained overload the buffer fills up and then rejects anyway. It paid 500 ms of latency on everything it accepted and still turned away more requests than the policy that stayed at 10 ms. If your 200 ms SLO is the thing you are protecting, that 500 ms p99 did not protect it; it converted rejection into deadline misses, which are often worse, because the client already gave up and the work you did was wasted.

That is the lesson a queue does not advertise: buffering only helps when overload is transient, a burst it can absorb and drain after the spike passes. Against a sustained overload like this one, a buffer is the worst of both worlds, rejecting like the others and making everything it accepts slow.

The first second tells the rest of the story. From golden/post5/post5-backpressure-snapshots.csv:

text
strategy        elapsed_s  accepted  rejected  throughput_rps  p50_ms  p99_ms  max_buffered
token-bucket    1          199       1         199.0           10.0    10.0    0
bounded-queue   1          149       51        149.0           380.0   500.0   50

The token bucket lets a burst of 199 through in the first second because it starts full, then settles to the refill rate. The bounded queue admits 149, but its p50 is already 380 ms and its buffer is already full, because every admission past the service rate goes to the back of a line.


Ten lines of arithmetic, not magic

Build the primitives by hand before reaching for a library, because the behavior in that table is not magic, it is about ten lines of arithmetic each.

A token bucket holds a balance that refills at a fixed rate and caps at a capacity. Admission spends a token; if the balance is below one, the request is rejected immediately. The capacity is the burst allowance, which is why the first second let 199 through. TokenBucket.java is the hand-rolled version used in this run:

java
public BackpressureDecision admit(long arrivalMs) {
    refill(arrivalMs);
    if (tokens < 1.0) {
        return BackpressureDecision.rejected();
    }
    tokens -= 1.0;
    return BackpressureDecision.accepted(serviceTimeMs);
}

private void refill(long nowMs) {
    long elapsedMs = Math.max(0L, nowMs - lastRefillMs);
    tokens = Math.min(capacity, tokens + (elapsedMs * refillPerMs));
    lastRefillMs = nowMs;
}

Rejection is instant, so accepted requests never inherit a queue. That is the whole reason its p99 stays at the base 10 ms.

A bounded queue is the other shape. It tracks the completion times of admitted work, drops anything already finished, and rejects only when the buffer is full. Crucially, an admitted request starts after everything ahead of it, so it carries the accumulated wait. BoundedQueue.java makes that wait explicit:

java
public BackpressureDecision admit(long arrivalMs) {
    removeCompleted(arrivalMs);
    if (completionTimes.size() >= capacity) {
        return BackpressureDecision.rejected();
    }
    long startMs = completionTimes.isEmpty()
            ? arrivalMs
            : Math.max(arrivalMs, completionTimes.peekLast());
    long completionMs = startMs + serviceTimeMs;
    completionTimes.addLast(completionMs);
    return BackpressureDecision.accepted(completionMs - arrivalMs);
}

completionMs - arrivalMs is the latency the caller sees, and it grows with queue depth. That single line is the difference between the token bucket's flat 10 ms and the bounded queue's 500 ms. The load shedder is a stricter variant of the same idea, capping in-flight work at a small number (here 5, giving a 50 ms p99), which keeps a tiny buffer for smoothing without ever letting latency run away.

Once the mechanics are clear, the library has nothing to teach you about behavior, only about robustness. The Resilience4jRateLimiterController.java row behaves like the hand-rolled token bucket without the burst tuning, accepting 500 at a flat 10 ms p99. Reaching for it is the right move in production, but only once you can predict, from the table above, what it will do under overload, and what it will do to the requests it turns away.

![Latency per strategy: flat for reject-fast policies, elevated for buffering policies]


Choose who absorbs the overload, on purpose

Backpressure does not reduce load. It chooses who absorbs the overload and in what currency, and that choice should be deliberate.

Everything that can queue needs a bound. An unbounded buffer is not resilience; it is a delayed, worse outage where every request is stale on arrival. The overload currency also needs a decision before the incident. Reject-fast policies such as token buckets and rate limiters spend overload as rejections and keep accepted latency flat. Buffering policies such as bounded queues and leaky buckets spend overload as latency, and only earn that latency back if the overload is a transient burst they can drain. Against the sustained overload here, the buffering strategies rejected more and ran slower. There is no option that spends nothing.

A buffer that pushes accepted latency past your SLO has converted rejection into deadline misses. Queue size should come from the latency budget, not from available memory. Load shedding close to the edge keeps rejected requests cheap, and the load shedder held latency at 50 ms precisely because it never let work pile up.

The primitive is worth building once by hand so the behavior is predictable, then the battle-tested library can own the production hardening. Resilience4j is the right production choice; understanding the token bucket is what lets you configure it without guessing.

The hard part is not implementing any of these. It is accepting, before traffic forces the issue, that some requests will not be served, and choosing which ones on purpose rather than letting an unbounded queue decide by drowning everyone equally.

The command that regenerates the numbers used here is:

bash
./gradlew :latency-lab:runBackpressure \
  -Pargs="--deterministic --duration 5s --concurrency 10 --output-dir ./results/backpressure --snapshot-interval 1s --backpressure-strategy all"

The generated artifacts used in this post are:

text
golden/post5/post5-backpressure-summary.csv
golden/post5/post5-backpressure-snapshots.csv
golden/post5/post5-acceptance.png
golden/post5/post5-latency.png

GoldenOutputTest.java runs this deterministic Article 5 path and compares post5-backpressure-summary.csv and post5-backpressure-snapshots.csv against the checked-in golden files. A fresh run for this post wrote results/article-grounding/post5/manifest.json; both Article 5 CSV artifacts had golden_match: true, with SHA-256 values matching their golden SHA-256 values.

References used for the underlying backpressure ideas, not for the generated lab numbers:

  • Google SRE, Handling Overload.
  • Resilience4j, Resilience4j documentation.