Series navigation
Written by
Jagdish Salgotra
Distributed systems, cloud-native architecture, and the JVM. mostly shipping, occasionally reading.
Structured concurrency patterns are worth the complexity only when the cancellation policy is decided before the first fork, not after the first timeout. The three we run: aggregation on quorum, bulkheading per tenant, and a deadline shape that protects against one slow upstream.
Written by
Distributed systems, cloud-native architecture, and the JVM. mostly shipping, occasionally reading.
Java version note Code here targets Java 21 preview and uses
ShutdownOnSuccessandShutdownOnFailuredirectly. On Java 25 this will not compile. The mental models transfer cleanly. Only the syntax changes. Part 9 covers the migration.
The failures that force you to reach for these patterns show up the same way: p50 looks healthy while p99 is painful. Every dependency claims it is mostly fine in isolation. One slow optional call punishes the whole request.
These patterns are worth the complexity only when you can answer four questions before forking anything: which result counts, how long you wait, what a degraded response looks like, and how much extra load you will accept downstream. If you cannot answer those four, the structured concurrency machinery just hides the problem one layer deeper.
ShutdownOnSuccess gives you first-success semantics. The first fork to return a result completes the scope. The rest are interrupted.
Use this when multiple idempotent read sources are genuinely equivalent — replicated services or redundant endpoints where any successful answer is acceptable. Read-replica fan-out for a search or catalog query is the typical case: three replicas hold the same data, the fastest healthy one wins, and the others are cancelled before they finish work nobody will read.
static void runWithShutdownOnSuccess() throws Exception {
try (var scope = new StructuredTaskScope.ShutdownOnSuccess<String>()) {
scope.fork(() -> slowService("Service-A", 1000));
scope.fork(() -> slowService("Service-B", 500));
scope.fork(() -> slowService("Service-C", 200));
scope.join();
String result = scope.result();
logger.info("First successful result: {}", result);
}
}
Do not use this for non-idempotent operations or sources with different semantics. First-success joiners hide slower failures by design. Monitor failure rates per source separately or silent degradation will look like normal variance.
Emit results as they land and bound the wait. Slow tasks no longer punish fast ones.
Use this for responses with optional sections: recommendations, enrichment panels, diagnostics, any page where partial data is still useful and the missing pieces are visible to callers.
A product or dashboard page that fans out to half a dozen widget services with a hard 500ms budget is the right fit. Fast widgets render. Slow ones surface as loading or get dropped. The page never blocks on the slowest dependency.
public ProgressiveSummary<T> executeWithProgressCallback(
List<Callable<T>> tasks,
ProgressCallback<T> callback,
Duration maxDuration) throws InterruptedException {
long startTime = System.currentTimeMillis();
List<T> results = new ArrayList<>(Collections.nCopies(tasks.size(), null));
List<Exception> errors = new ArrayList<>();
boolean[] completed = new boolean[tasks.size()];
int totalCompleted = 0;
try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
List<StructuredTaskScope.Subtask<T>> subtasks = new ArrayList<>();
for (Callable<T> task : tasks) {
subtasks.add(scope.fork(task));
}
Instant maxTime = Instant.now().plus(maxDuration);
while (totalCompleted < tasks.size() && Instant.now().isBefore(maxTime)) {
try {
scope.joinUntil(Instant.now().plusMillis(50));
} catch (TimeoutException e) {
// 50ms polling tick elapsed before all tasks finished. Loop again.
}
for (int i = 0; i < subtasks.size(); i++) {
if (!completed[i]) {
var subtask = subtasks.get(i);
switch (subtask.state()) {
case SUCCESS:
try {
T result = subtask.get();
results.set(i, result);
completed[i] = true;
totalCompleted++;
callback.onProgress(i, result);
} catch (Exception e) {
errors.add(new RuntimeException("Task " + i + " failed", e));
completed[i] = true;
totalCompleted++;
}
break;
case FAILED:
errors.add(new RuntimeException("Task " + i + " failed"));
completed[i] = true;
totalCompleted++;
break;
case UNAVAILABLE:
// Still running. Check again next pass.
break;
}
}
}
}
if (Instant.now().isAfter(maxTime)) {
scope.shutdown();
}
return new ProgressiveSummary<>(
results.stream().filter(Objects::nonNull).collect(toList()),
errors,
totalCompleted,
tasks.size(),
System.currentTimeMillis() - startTime,
Instant.now().isAfter(maxTime)
);
} catch (Exception e) {
throw new RuntimeException("Unexpected timeout in progressive results", e);
}
}
The 50ms polling tick is deliberate. It gives completed subtasks a short window to be picked up without spinning. Tune it to your latency budget. Slow optional sections drop out cleanly instead of holding up the whole response.
Do not use this when every result is required for correctness. That is Pattern 4 territory.
The hedge fires only when the primary is slow, capping tail latency without doubling load.
Use this when one logical read has rare but painful tail spikes. The delay is the control. Without it, hedging becomes an accidental load multiplier.
Primary-key lookup against a sharded store, a cached profile fetch, an authoritative pricing call — these are the right fit. Healthy at p50, occasionally five to ten times longer for one specific request, and that long tail dominates user-visible latency.
The shape is simple. Fork a primary at t=0. Fork a hedge that sleeps for hedgeAfterMillis before doing real work. Race them under ShutdownOnSuccess. The key detail is what happens when the primary returns first: the still-sleeping hedge must be interrupted before it issues a duplicate request. On Java 21 this relies on Thread.sleep honouring interruption inside the hedge fork. It works. Java 25 makes the cancellation semantics cleaner via the Joiner API. Part 9 covers that diff.
Tune hedgeAfterMillis to your p95-to-p99 envelope. Watch the hedge-fire rate to confirm the delay is keeping duplicate load capped. If the hedge fires on more than a small percentage of requests, the delay is too short or the primary has a real problem worth investigating directly.
Do not hedge without a delay budget and load monitoring in place.
Use this when the fallback is part of the product contract, not a way to mask a broken primary path.
Search falling back to a cached top-N when the live index is unavailable. A recommendations API falling back to a generic list when the personalised model times out. The user gets a clearly degraded but coherent response. Your dashboards can count fallback hits as a real signal instead of burying them in a retry counter.
The structure matters here. The fallback runs outside the scope, sequentially, only after the primary has failed. Folding it into the same scope turns this into Pattern 3, removes the "primary failed first" trigger, and creates exactly the duplicate load this pattern was designed to avoid.
public <T> T runWithFallback(Callable<T> primary, Callable<T> fallback) throws Exception {
try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
var primaryFuture = scope.fork(primary);
scope.join();
try {
scope.throwIfFailed();
return primaryFuture.get();
} catch (Exception e) {
logger.warn("Primary failed, using fallback: {}", e.getMessage());
return fallback.call();
}
}
}
Make sure callers can tell when they got the fallback path. A silent fallback is just a slow bug.
| If you have | Use | Do not use |
|---|---|---|
| Equivalent idempotent read sources | Pattern 1: First successful result | Non-idempotent writes or sources with different semantics |
| Optional results with a deadline | Pattern 2: Bounded partial results | Workflows where every result is required |
| One logical read with painful p99 | Pattern 3: Hedged read with delay | Hedging without a delay budget or load monitoring |
| Primary path plus cheap degraded response | Pattern 4: Controlled degradation contract | Expensive, stale, or misleading fallback paths |