Structured Concurrency · Part 6 of 9

Composing resilience policies as separable layers

Each resilience policy stays a separable layer that fires visibly in logs, or you rebuild the complexity you removed.

Jagdish Salgotra

2026-04-26·10 min read·~1,500 words

← Previous · Part 5Why downstream capacity is the real ceiling on fan-out Next · Part 7 →Three structured-concurrency patterns we run in a fan-out service

Code repositoryproject-loom

#structured-concurrency

Written by

Jagdish Salgotra

Distributed systems, cloud-native architecture, and the JVM. mostly shipping, occasionally reading.

all posts

Keep reading · rest of the series

Was this article helpful? or email →

anonymous · no account needed

Structured Concurrency · Part 6 of 9

Composing resilience policies as separable layers

Each resilience policy stays a separable layer that fires visibly in logs, or you rebuild the complexity you removed.

Jagdish Salgotra

2026-04-26·10 min read·~1,500 words

All code in this series targets Java 21 preview APIs. Part 9 covers migration to Java 25.

One policy is easy to explain

A timeout by itself is easy to explain. A fallback by itself is easy to explain. A retry by itself is easy to explain. A circuit breaker by itself is easy to explain.

The trouble starts when one request needs several of them.

Should the retry happen inside the timeout budget, or should each retry get its own budget? Should fallback run after all retries fail, or should it race the primary path? Should the breaker count fallback failures? Should a timeout count as a breaker failure? If a child task succeeds while another task is retrying, does the parent still wait?

Those are not framework questions. They are policy questions.

Structured concurrency gives you a local ownership boundary for related work. It does not tell you how to stack timeout, retry, fallback, breaker, and degradation behavior. If those policies are hidden inside nested lambdas, correctness becomes a scavenger hunt.

Policy composition: deeply nested operators that hide each policy, versus a layered stack where scope lifecycle, timeout, retry, fallback, and admission are each a separable layer

This is still learning material. Structured concurrency is preview, and the examples are here to make policy shape visible rather than to claim production behavior. The main branch now builds with OpenJDK 25.0.2 and uses the Java 25 preview structured-concurrency API, with the Java 21 version separately managed in the feature/java-21 branch. The measurements below were generated from the current Java 25 code.

Keep ownership separate from policy

The cleanest pattern in ScopedRequestHandler.java is not a clever algorithm. It is the separation between "run this work inside a scope" and "apply a policy around that work."

In Java 21 preview syntax, the basic scoped helper looks like this:

java

public <T> T runInScope(Callable<T> task) throws Exception {
    try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
        Subtask<T> result = scope.fork(task);

        scope.join();
        scope.throwIfFailed();

        return result.get();
    }
}

That helper owns a very small responsibility. It forks one task, joins the scope, propagates failure, and returns the result.

The retry helper in the repository builds on that shape:

java

public <T> T runWithRetry(Callable<T> task, int maxRetries, Duration retryDelay) throws Exception {
    Exception lastException = null;

    for (int attempt = 1; attempt <= maxRetries; attempt++) {
        try {
            return runInScope(task);
        } catch (Exception failure) {
            lastException = failure;
            if (attempt < maxRetries) {
                Thread.sleep(retryDelay.toMillis());
            }
        }
    }

    throw new RuntimeException("All " + maxRetries + " attempts failed", lastException);
}

The retry loop does not fork unowned background work. Each attempt gets its own scoped call. That makes the attempt boundary visible.

This is also where the policy choice becomes visible: in this code, a retry attempt is scoped, and the delay sits between attempts. The delay is acceptable in this learning code because requests run on virtual threads, so Thread.sleep(...) parks the virtual thread. If the same policy ran on platform threads, the delay would deserve closer scrutiny.

Retry evidence from the service

The GET /retry/operation?op=important-task endpoint in StructuredMicroservice.java uses BusinessService.runRetryableOperation():

java

return scopedHandler.runWithRetry(
    () -> unstableExternalService(operation),
    3,
    Duration.ofMillis(500)
);

The simulated external service sleeps for 100ms and then fails 60% of the time.

A 10-call sequence showed all three useful cases:

text

retry-01 Error: All 3 attempts failed status=500
retry-02 External-important-task Duration: 1317ms status=200
retry-03 Error: All 3 attempts failed status=500
retry-04 External-important-task Duration: 105ms status=200
retry-05 External-important-task Duration: 1324ms status=200
retry-06 External-important-task Duration: 106ms status=200
retry-07 External-important-task Duration: 105ms status=200
retry-08 Error: All 3 attempts failed status=500
retry-09 External-important-task Duration: 711ms status=200
retry-10 Error: All 3 attempts failed status=500

The durations explain the policy. A first-attempt success lands around 105ms. A second-attempt success lands around 700ms because the first 100ms attempt failed, the code slept for 500ms, and the second 100ms attempt succeeded. A third-attempt success lands around 1.3s because two retry delays were paid before success.

That is the kind of evidence a composed policy needs. "Retry works" is not enough. You need to see which attempt won, how much delay the policy added, and what happens when the attempt budget is exhausted.

Fallback is a separate response path

Fallback should not be invisible.

The fallback helper in ScopedRequestHandler.java keeps fallback outside the primary scope:

java

public <T> T runWithFallback(Callable<T> primary, Callable<T> fallback) throws Exception {
    try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
        Subtask<T> primaryResult = scope.fork(primary);

        scope.join();
        scope.throwIfFailed();

        return primaryResult.get();
    } catch (Exception primaryFailure) {
        return fallback.call();
    }
}

The primary path is scoped. If it succeeds, the fallback is not called. If it fails, the fallback runs after the primary scope is done.

That sequencing matters because fallback is not just another sibling. If you run primary and fallback every time, you double load by default. Sometimes a race is exactly what you want, but that should be a separate policy with a separate name.

Retry and fallback as a structured policy path: ad-hoc nested try/catch with no clear boundary, versus a flow where each attempt is scoped and success, retry, and fallback are named, observable exits

The endpoint GET /data/with-fallback?key=userdata has a 100ms primary database path that fails randomly and a 150ms secondary path.

One local request took the fallback path:

text

Secondary-userdata
Duration: 262ms

That timing matches the code: roughly 100ms for the failed primary plus 150ms for secondary, with request overhead.

The focused load check mixed primary successes and fallback responses:

text

wrk -t4 -c40 -d10s http://localhost:8085/data/with-fallback?key=userdata
Latency avg: 146.76ms
Requests/sec: 269.54
Requests: 2,719

The high standard deviation in that run is expected. The endpoint has two visible paths: around 100ms when primary succeeds, and around 250ms when fallback runs.

The important point is not that fallback is "fast." The important point is that fallback has a measurable shape and a separate response contract.

Circuit breakers are admission policy

The circuit breaker helper answers a different question from retry or fallback:

Should this call be attempted at all?

java

public <T> T runWithCircuitBreaker(Callable<T> task, CircuitBreakerConfig config) throws Exception {
    if (config.isOpen()) {
        throw new RuntimeException("Circuit breaker is OPEN - failing fast");
    }

    try {
        T result = runInScope(task);
        config.onSuccess();
        return result;
    } catch (Exception failure) {
        config.onFailure();
        throw failure;
    }
}

The breaker check happens before the scoped call. That is the correct boundary. Once the breaker is open, the downstream call should not be forked and then discarded. It should not exist.

The endpoint GET /protected/service?req=test-request uses a threshold of three failures and a 30-second open interval.

One 15-call sequence showed the state transition:

text

protected-01 success Duration: 106ms
protected-02 success Duration: 102ms
protected-03 success Duration: 105ms
protected-04 success Duration: 105ms
protected-05 success Duration: 105ms
protected-06 success Duration: 105ms
protected-07 success Duration: 103ms
protected-08 success Duration: 106ms
protected-09 failure: Unreliable service failed
protected-10 failure: Unreliable service failed
protected-11 failure: Unreliable service failed
protected-12 open: Circuit breaker is OPEN - failing fast
protected-13 open: Circuit breaker is OPEN - failing fast
protected-14 open: Circuit breaker is OPEN - failing fast
protected-15 open: Circuit breaker is OPEN - failing fast

That sequence matters more than any single response. The early successes reset the breaker. Calls 9 through 11 failed consecutively. Calls 12 through 15 were rejected by admission policy before the unreliable service was called.

Composed resilience code needs this kind of sequence test. A single protected-service request cannot tell you whether the breaker counts consecutive failures, resets on success, or rejects before downstream work is attempted.

Timeout placement changes the meaning of the policy

The checked-in timeout helper is useful because it shows a mistake clearly.

java

public <T> T runInScopeWithTimeout(Callable<T> task, Duration timeout) throws Exception {
    Instant deadline = Instant.now().plus(timeout);

    try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
        Subtask<T> result = scope.fork(task);

        scope.join();

        if (Instant.now().isAfter(deadline)) {
            throw new TimeoutException("Operation exceeded timeout: " + timeout);
        }

        return result.get();
    }
}

The deadline is checked after join(). That means it detects lateness after the task finishes. It does not enforce the deadline while waiting.

The endpoint GET /timed/operation?op=slow-task uses a two-second budget and a 1500ms simulated service, so it succeeds:

text

Slow-slow-task
Duration: 1503ms

The load check stayed at the same shape:

text

wrk -t2 -c20 -d10s http://localhost:8085/timed/operation?op=slow-task
Latency avg: 1.51s
Requests/sec: 11.90
Requests: 120

This endpoint is not demonstrating timeout failure. It is demonstrating a timed wrapper around a task that finishes inside the budget.

That distinction matters when policies compose. If a retry loop wraps this helper, the timeout budget applies per attempt. If this helper wraps the whole retry loop, the timeout budget applies to the entire operation. If the code checks the deadline only after join(), neither policy stops waiting at the deadline.

Timeout placement changes the meaning of the budget: timeout inside retry gives each attempt its own fresh 2s deadline, versus timeout around retry giving one 2s budget for the whole operation that can stop a later retry mid-flight

For Java 21 preview code that needs deadline enforcement at the wait boundary, the relevant primitive is joinUntil(deadline).

Multi-stage workflows need one owner per stage

The order-processing endpoint is the best composition example in the clean service.

BusinessService.processOrder() has two stages:

java

var validationResult = scopedHandler.runInParallel(
    () -> validatePayment(orderId),
    () -> validateInventory(orderId),
    () -> validateShipping(orderId)
);

if (allValidationsPassed(validationResult)) {
    return scopedHandler.runInParallel(
        () -> chargePayment(orderId),
        () -> reserveInventory(orderId),
        () -> scheduleShipping(orderId)
    ).toString();
}

The validation stage has 100ms, 80ms, and 60ms branches. The fulfillment stage has 200ms, 150ms, and 100ms branches. The stages are sequential because fulfillment depends on validation, but each stage uses a local scope for its own sibling work.

One request returned:

text

TripleResult[result1=Payment-Charged-ORD-123, result2=Inventory-Reserved-ORD-123, result3=Shipping-Scheduled-ORD-123]
Duration: 312ms

The load check was:

text

wrk -t4 -c40 -d10s http://localhost:8085/order/process?orderId=ORD-123
Latency avg: 309.09ms
Requests/sec: 126.85
Requests: 1,280

The timing matches the policy shape. The first stage tracks the slowest validation branch, around 100ms. The second stage tracks the slowest fulfillment branch, around 200ms. The total lands near 300ms because the two stages are intentionally ordered.

That is the difference between structured composition and "just parallelize everything." The code keeps the business dependency visible.

Aggregation is the baseline

The service aggregation endpoint is the simplest comparison point.

It calls four services:

Branch	Simulated delay
auth	100ms
user	150ms
notification	80ms
analytics	200ms

One request returned:

text

Aggregated 4 services in 205ms: [Auth-OK, User-OK, Notification-OK, Analytics-OK]
Outer request duration: 210ms

The load check matched the 200ms slowest branch:

text

wrk -t4 -c40 -d10s http://localhost:8085/services/aggregate
Latency avg: 205.55ms
Requests/sec: 194.25
Requests: 1,960

This baseline matters because every extra policy should explain what it adds. Retry adds attempts and delay. Fallback adds a degraded response path. A breaker adds admission state. A staged workflow adds ordering. If a policy does not change behavior in a visible way, it may only be adding complexity.

What to test

Composed policies need tests that follow the policy path, not just the final HTTP status. Retry tests should identify whether success happened on the first, second, or third attempt, and exhausted retry should have its own assertion. Fallback tests should prove the fallback ran after primary failure, not alongside a successful primary. Circuit-breaker tests should use sequences long enough to show reset, threshold, open rejection, and recovery timing.

Timeout tests need to check where the budget is enforced. A deadline checked after join() is not the same as a deadline that stops waiting. For multi-stage workflows, test stage boundaries: validation should finish before fulfillment starts, and failure in validation should prevent fulfillment from being forked.

The most useful test output names the policy path. "200 OK" is not enough. "Primary succeeded," "fallback used," "retry exhausted," "breaker open," and "stage two skipped" are the observations that keep composed concurrency code understandable.

What comes next

Part 7 moves from small helper policies to larger service patterns. The same rule carries forward: keep ownership local, keep policy names honest, and make the measured behavior match the story.

The companion repository at github.com/salgotraja/project-loom has runnable examples for every post in this series. The README covers build setup and the scripts used to reproduce the checks.

Resources

← Previous · Part 5Why downstream capacity is the real ceiling on fan-out Next · Part 7 →Three structured-concurrency patterns we run in a fan-out service

Code repositoryproject-loom

#structured-concurrency

Written by

Jagdish Salgotra

Distributed systems, cloud-native architecture, and the JVM. mostly shipping, occasionally reading.

all posts

Keep reading · rest of the series

Was this article helpful? or email →

anonymous · no account needed

All code in this series targets Java 21 preview APIs. Part 9 covers migration to Java 25.

One policy is easy to explain

A timeout by itself is easy to explain. A fallback by itself is easy to explain. A retry by itself is easy to explain. A circuit breaker by itself is easy to explain.

The trouble starts when one request needs several of them.

Those are not framework questions. They are policy questions.

Keep ownership separate from policy

The cleanest pattern in ScopedRequestHandler.java is not a clever algorithm. It is the separation between "run this work inside a scope" and "apply a policy around that work."

In Java 21 preview syntax, the basic scoped helper looks like this:

java

public <T> T runInScope(Callable<T> task) throws Exception {
    try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
        Subtask<T> result = scope.fork(task);

        scope.join();
        scope.throwIfFailed();

        return result.get();
    }
}

That helper owns a very small responsibility. It forks one task, joins the scope, propagates failure, and returns the result.

The retry helper in the repository builds on that shape:

java

public <T> T runWithRetry(Callable<T> task, int maxRetries, Duration retryDelay) throws Exception {
    Exception lastException = null;

    for (int attempt = 1; attempt <= maxRetries; attempt++) {
        try {
            return runInScope(task);
        } catch (Exception failure) {
            lastException = failure;
            if (attempt < maxRetries) {
                Thread.sleep(retryDelay.toMillis());
            }
        }
    }

    throw new RuntimeException("All " + maxRetries + " attempts failed", lastException);
}

The retry loop does not fork unowned background work. Each attempt gets its own scoped call. That makes the attempt boundary visible.

Retry evidence from the service

The GET /retry/operation?op=important-task endpoint in StructuredMicroservice.java uses BusinessService.runRetryableOperation():

java

return scopedHandler.runWithRetry(
    () -> unstableExternalService(operation),
    3,
    Duration.ofMillis(500)
);

The simulated external service sleeps for 100ms and then fails 60% of the time.

A 10-call sequence showed all three useful cases:

text

retry-01 Error: All 3 attempts failed status=500
retry-02 External-important-task Duration: 1317ms status=200
retry-03 Error: All 3 attempts failed status=500
retry-04 External-important-task Duration: 105ms status=200
retry-05 External-important-task Duration: 1324ms status=200
retry-06 External-important-task Duration: 106ms status=200
retry-07 External-important-task Duration: 105ms status=200
retry-08 Error: All 3 attempts failed status=500
retry-09 External-important-task Duration: 711ms status=200
retry-10 Error: All 3 attempts failed status=500

Fallback is a separate response path

Fallback should not be invisible.

The fallback helper in ScopedRequestHandler.java keeps fallback outside the primary scope:

java

public <T> T runWithFallback(Callable<T> primary, Callable<T> fallback) throws Exception {
    try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
        Subtask<T> primaryResult = scope.fork(primary);

        scope.join();
        scope.throwIfFailed();

        return primaryResult.get();
    } catch (Exception primaryFailure) {
        return fallback.call();
    }
}

The primary path is scoped. If it succeeds, the fallback is not called. If it fails, the fallback runs after the primary scope is done.

The endpoint GET /data/with-fallback?key=userdata has a 100ms primary database path that fails randomly and a 150ms secondary path.

One local request took the fallback path:

text

Secondary-userdata
Duration: 262ms

That timing matches the code: roughly 100ms for the failed primary plus 150ms for secondary, with request overhead.

The focused load check mixed primary successes and fallback responses:

text

wrk -t4 -c40 -d10s http://localhost:8085/data/with-fallback?key=userdata
Latency avg: 146.76ms
Requests/sec: 269.54
Requests: 2,719

The high standard deviation in that run is expected. The endpoint has two visible paths: around 100ms when primary succeeds, and around 250ms when fallback runs.

The important point is not that fallback is "fast." The important point is that fallback has a measurable shape and a separate response contract.

Circuit breakers are admission policy

The circuit breaker helper answers a different question from retry or fallback:

Should this call be attempted at all?

java

public <T> T runWithCircuitBreaker(Callable<T> task, CircuitBreakerConfig config) throws Exception {
    if (config.isOpen()) {
        throw new RuntimeException("Circuit breaker is OPEN - failing fast");
    }

    try {
        T result = runInScope(task);
        config.onSuccess();
        return result;
    } catch (Exception failure) {
        config.onFailure();
        throw failure;
    }
}

The breaker check happens before the scoped call. That is the correct boundary. Once the breaker is open, the downstream call should not be forked and then discarded. It should not exist.

The endpoint GET /protected/service?req=test-request uses a threshold of three failures and a 30-second open interval.

One 15-call sequence showed the state transition:

text

protected-01 success Duration: 106ms
protected-02 success Duration: 102ms
protected-03 success Duration: 105ms
protected-04 success Duration: 105ms
protected-05 success Duration: 105ms
protected-06 success Duration: 105ms
protected-07 success Duration: 103ms
protected-08 success Duration: 106ms
protected-09 failure: Unreliable service failed
protected-10 failure: Unreliable service failed
protected-11 failure: Unreliable service failed
protected-12 open: Circuit breaker is OPEN - failing fast
protected-13 open: Circuit breaker is OPEN - failing fast
protected-14 open: Circuit breaker is OPEN - failing fast
protected-15 open: Circuit breaker is OPEN - failing fast

Timeout placement changes the meaning of the policy

The checked-in timeout helper is useful because it shows a mistake clearly.

java

public <T> T runInScopeWithTimeout(Callable<T> task, Duration timeout) throws Exception {
    Instant deadline = Instant.now().plus(timeout);

    try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
        Subtask<T> result = scope.fork(task);

        scope.join();

        if (Instant.now().isAfter(deadline)) {
            throw new TimeoutException("Operation exceeded timeout: " + timeout);
        }

        return result.get();
    }
}

The deadline is checked after join(). That means it detects lateness after the task finishes. It does not enforce the deadline while waiting.

The endpoint GET /timed/operation?op=slow-task uses a two-second budget and a 1500ms simulated service, so it succeeds:

text

Slow-slow-task
Duration: 1503ms

The load check stayed at the same shape:

text

wrk -t2 -c20 -d10s http://localhost:8085/timed/operation?op=slow-task
Latency avg: 1.51s
Requests/sec: 11.90
Requests: 120

This endpoint is not demonstrating timeout failure. It is demonstrating a timed wrapper around a task that finishes inside the budget.

For Java 21 preview code that needs deadline enforcement at the wait boundary, the relevant primitive is joinUntil(deadline).

Multi-stage workflows need one owner per stage

The order-processing endpoint is the best composition example in the clean service.

BusinessService.processOrder() has two stages:

java

var validationResult = scopedHandler.runInParallel(
    () -> validatePayment(orderId),
    () -> validateInventory(orderId),
    () -> validateShipping(orderId)
);

if (allValidationsPassed(validationResult)) {
    return scopedHandler.runInParallel(
        () -> chargePayment(orderId),
        () -> reserveInventory(orderId),
        () -> scheduleShipping(orderId)
    ).toString();
}

One request returned:

text

TripleResult[result1=Payment-Charged-ORD-123, result2=Inventory-Reserved-ORD-123, result3=Shipping-Scheduled-ORD-123]
Duration: 312ms

The load check was:

text

wrk -t4 -c40 -d10s http://localhost:8085/order/process?orderId=ORD-123
Latency avg: 309.09ms
Requests/sec: 126.85
Requests: 1,280

That is the difference between structured composition and "just parallelize everything." The code keeps the business dependency visible.

Aggregation is the baseline

The service aggregation endpoint is the simplest comparison point.

It calls four services:

Branch	Simulated delay
auth	100ms
user	150ms
notification	80ms
analytics	200ms

One request returned:

text

Aggregated 4 services in 205ms: [Auth-OK, User-OK, Notification-OK, Analytics-OK]
Outer request duration: 210ms

The load check matched the 200ms slowest branch:

text

wrk -t4 -c40 -d10s http://localhost:8085/services/aggregate
Latency avg: 205.55ms
Requests/sec: 194.25
Requests: 1,960

What to test

What comes next

Part 7 moves from small helper policies to larger service patterns. The same rule carries forward: keep ownership local, keep policy names honest, and make the measured behavior match the story.

The companion repository at github.com/salgotraja/project-loom has runnable examples for every post in this series. The README covers build setup and the scripts used to reproduce the checks.

Composing resilience policies as separable layers

Series navigation

Jagdish Salgotra

Keep reading · rest of the series

Composing resilience policies as separable layers

One policy is easy to explain

Keep ownership separate from policy

Retry evidence from the service

Fallback is a separate response path

Circuit breakers are admission policy

Timeout placement changes the meaning of the policy

Multi-stage workflows need one owner per stage

Aggregation is the baseline

What to test

What comes next

Resources

Series navigation

Jagdish Salgotra

Keep reading · rest of the series

One policy is easy to explain

Keep ownership separate from policy

Retry evidence from the service

Fallback is a separate response path

Circuit breakers are admission policy

Timeout placement changes the meaning of the policy

Multi-stage workflows need one owner per stage

Aggregation is the baseline

What to test

What comes next

Resources