Project Loom · Part 6 of 9

Performance Deep Dive

Benchmark numbers for virtual threads versus reactive code disagree wildly because the workload dictates the answer, not the model: I/O-bound code usually benefits, CPU-bound rarely does, and non-blocking moves work without removing it.

Jagdish Salgotra

2025-08-10·12 min read·~1,600 words

← Previous · Part 5Advanced Structured Concurrency Patterns Next · Part 7 →Production Readiness, Monitoring, and Debugging

Code repositoryproject-loom

#project-loom

Written by

Jagdish Salgotra

Distributed systems, cloud-native architecture, and the JVM. mostly shipping, occasionally reading.

all posts

Keep reading · rest of the series

Was this article helpful? or email →

anonymous · no account needed

Project Loom · Part 6 of 9

Performance Deep Dive

Jagdish Salgotra

2025-08-10·12 min read·~1,600 words

Note This series uses Java 21 as the baseline. Virtual threads are stable in Java 21 (JEP 444). Structured concurrency snippets in this part (StructuredTaskScope, JEP 453) use preview APIs and require --enable-preview.

Performance claims become useful when the benchmark shape is visible. A throughput number without the service delay, connection count, pool size, and error count is just a number. Virtual threads make waiting cheaper, but they do not make CPU work disappear, fix downstream limits, or turn a toy benchmark into a capacity plan.

This article is learning material. The main branch now builds with OpenJDK 25.0.2 and uses the Java 25 preview structured-concurrency API, with the Java 21 version separately managed in the feature/java-21 branch. The measurements below were generated from the current checked-in Java 25 code. Virtual threads are final in Java 21; the preview flag is still used in this repository because other examples in the same build use preview structured-concurrency APIs.

The useful question for this part is not "are virtual threads faster?" The useful question is narrower: what kind of work was measured, what limit should the code predict, and did the run behave close to that prediction?

What I ran

The measurements below were generated with OpenJDK 25.0.2 and Maven 3.9.12:

bash

mvn clean compile -DskipTests
mvn dependency:build-classpath -Dmdep.outputFile=cp.txt

The build succeeded and compiled 35 source files.

Then I ran:

bash

java --enable-preview -cp "$(cat cp.txt):target/classes" app.js.reactive.VirtualThreadReactiveIntegration
java --enable-preview -cp "$(cat cp.txt):target/classes" app.js.PlatformThreadPoolServer
java --enable-preview -cp "$(cat cp.txt):target/classes" app.js.VirtualThreadPoolServer
java --enable-preview -cp "$(cat cp.txt):target/classes" app.js.microservices.VirtualThreadMicroservice
./scripts/run-thread-optimized.sh

For the pool-server comparison, both servers bind to port 8080, so I ran them one at a time. For the richer microservice checks, I ran focused wrk commands against ports 8080 and 8086.

One script issue surfaced during the pass. run-thread-optimized.sh failed when it tried to build its own classpath, so I updated it to use the same cp.txt classpath file as the rest of the guide. I made the same change to run-memory-optimized.sh, because it used the same pattern.

A pool limit is visible under I/O wait

The smallest comparison in the repository is the cleanest one. PlatformThreadPoolServer.java and VirtualThreadPoolServer.java both expose GET /api and sleep for 200ms per request. The platform-thread version uses a fixed pool of 20 threads. The virtual-thread version uses a virtual-thread-per-task executor.

The benchmark used 200 concurrent connections for 10 seconds:

bash

wrk -t4 -c200 -d10s http://localhost:8080/api

The platform-thread pool result:

text

Latency     1.05s   540.55ms   1.91s    60.00%
980 requests in 10.09s
Socket errors: connect 0, read 0, write 0, timeout 780
Requests/sec:     97.10

The virtual-thread result:

text

Latency   204.67ms    1.78ms 209.50ms   69.19%
9654 requests in 10.08s
Requests/sec:    957.42

The math explains the result. A 20-thread pool with each worker sleeping for 200ms can complete about 20 / 0.2, or 100 requests per second. The measured platform-thread run completed 97.10 requests per second and reported 780 socket timeouts because 200 clients were competing for 20 workers.

The virtual-thread server had the same 200ms simulated wait, but each request got its own virtual thread. With 200 clients and a 200ms sleep, the rough ceiling is 200 / 0.2, or 1,000 requests per second. The measured run completed 957.42 requests per second with average latency close to the sleep time.

That is the core virtual-thread performance story. The request did not become faster than 200ms. The service could keep more waiting requests in flight without needing 200 platform threads.

The 8080 service shows three different shapes

The port 8080 VirtualThreadMicroservice has three basic endpoints that are useful for performance discussion:

Endpoint	Code shape
`/compute`	sum primes up to 50,000
`/block`	sleep for 300ms
`/file`	read a 10,000-line local file

Single requests returned:

Endpoint	Direct result
`/compute`	status 200 in 0.028252s, handler duration 2ms
`/block`	status 200 in 0.304610s, handler duration 302ms
`/file`	status 200 in 0.014830s, handler duration 13ms

Then I ran each endpoint with 200 connections for 10 seconds:

Endpoint	Load	Average latency	Requests/sec	Total requests
`/compute`	`wrk -t4 -c200 -d10s`	5.12ms	39,631.59	397,797
`/block`	`wrk -t4 -c200 -d10s`	304.78ms	634.56	6,400
`/file`	`wrk -t4 -c200 -d10s`	29.65ms	8,304.23	83,144

The /block endpoint is the most trustworthy teaching number in this set. It has a fixed 300ms delay, and with 200 clients the rough ceiling is 200 / 0.3, or about 667 requests per second. The measured result was 634.56 requests per second with average latency of 304.78ms.

The /compute endpoint is different. It is short CPU work on this machine, so the measured throughput is high, but virtual threads are not the reason the prime loop is fast. The loop is small. If the CPU work becomes large enough to saturate the machine, virtual threads cannot add CPU cores.

The /file endpoint is also local to this machine. It reads a generated local file, not a network filesystem or object store. Treat that number as a local file-path measurement, not a general storage benchmark.

The final /metrics snapshot from the same 8080 service reported:

text

Active Requests: 0
Total Requests: 487936
Average Response Time: 4.65ms
CPU Usage: 0.00%
Memory Usage: 381.45MB / 1384.00MB
JVM Uptime: 78 seconds
Thread Type: Virtual Threads

That aggregate average is not a headline latency number. It blends hundreds of thousands of very fast /compute requests with slower /block requests. For article evidence, the per-endpoint wrk rows are more useful.

CPU work still needs a CPU policy

The port 8086 service exists to make the CPU/I/O split visible. The server itself accepts requests on virtual threads, but /compute-optimized sends CPU work to a bounded platform-thread executor:

java

private static final ExecutorService cpuIntensiveExecutor =
    Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());

The direct checks returned:

Endpoint	Direct result
`/compute-optimized`	status 200 in 0.024890s, handler duration 4ms
`/io-optimized`	status 200 in 0.312777s, handler duration 312ms
`/mixed-workload`	status 200 in 0.207103s, handler duration 205ms

The focused load tests returned:

Endpoint	Load	Average latency	Requests/sec	Total requests
`/compute-optimized`	`wrk -t4 -c40 -d10s`	2.75ms	14,780.50	147,958
`/io-optimized`	`wrk -t4 -c200 -d10s`	312.71ms	634.14	6,402
`/mixed-workload`	`wrk -t4 -c80 -d10s`	205.43ms	388.44	3,920

The I/O endpoint again tracks the fixed wait. It sleeps for 300ms and reads a local file. With 200 clients, the rough ceiling is again about 667 requests per second; the measured result was 634.14.

The mixed endpoint runs a small CPU branch and a 200ms I/O branch. With 80 clients and a 200ms wait, the rough ceiling is 80 / 0.2, or 400 requests per second. The measured result was 388.44. That is the result you want from a performance article: the benchmark follows the code.

The final thread stats were:

text

Active Virtual Threads: 0
Active Platform Threads: 0
Total Requests: 158579
Available Processors: 14
Average Response Time: 20ms

Again, the aggregate average is mixed across endpoint types. The useful part is the available-processor count and the fact that the service routes CPU work through a bounded executor rather than treating virtual threads as a CPU-scaling tool.

The reactive integration demo is not a reactive benchmark

VirtualThreadReactiveIntegration.java is useful, but it is not a controlled benchmark of Reactor, WebFlux, Vert.x, or another reactive runtime.

The performance section in that class compares three ways to process 1,000 items with a 10ms sleep per item:

text

Traditional blocking: 11883ms
Virtual threads: 23ms (516.65x faster)
Structured concurrency: 18ms (660.17x faster)
Results: Traditional=1000, Virtual=1000, Structured=1000

This is a wait-parallelization demo. The serial loop takes close to 1000 * 10ms, plus overhead. The virtual-thread and structured versions start the waiting work concurrently, so they complete much closer to the single-item wait time.

That does not mean structured concurrency is 660x faster than a reactive framework. It means this specific serial waiting loop was the wrong baseline for concurrent waiting work. The evidence in this article is intentionally limited to fixed service delays, checked-in endpoints, local wrk output, and metrics from the same service process.

How to test performance claims

Start with the code path, not the conclusion. If a route sleeps for 300ms and the benchmark uses 200 clients, predict roughly 667 requests per second before you run it. If the result is close, the benchmark is probably measuring the thing you think it is measuring. If the result is far away, inspect the queue, pool, timeout, downstream limit, or benchmark command before writing the lesson.

Keep CPU and waiting work separate. Virtual threads help most when tasks are parked waiting for I/O, timers, locks, or network calls. CPU-heavy work still needs a bounded policy tied to available processors. The 8086 service demonstrates that shape by keeping request handling simple while sending CPU work through a fixed platform-thread executor.

Treat aggregate service metrics as supporting evidence, not the main result. A single average response time across /compute, /block, and /file hides the route-specific behavior. Publish per-endpoint latency, throughput, total requests, error counts, and the exact load settings.

What this does not prove

This article does not prove that virtual threads are universally faster than platform threads. It proves that, in these checked-in examples, a platform pool with 20 workers queues heavily under 200 waiting clients, while virtual threads keep the same 200ms wait close to 200ms.

This article also does not prove that virtual threads beat reactive systems in general. The repository contains a reactive integration demo, not a production-grade reactive service benchmark. If you need backpressure-heavy event pipelines, benchmark that actual design.

Finally, this article does not estimate cloud cost. Cost depends on request mix, downstream limits, JVM settings, instance type, redundancy, autoscaling policy, and failure targets. A cost table without those inputs would be decoration.

What comes next

Part 7 moves from benchmark shape to operational visibility: monitoring, debugging, JFR, pinning detection, and the checks that help explain virtual-thread behavior when the local benchmark is no longer enough.

Resources

Complete Code: PlatformThreadMicroservice.java - Platform thread implementation
Virtual Thread Benchmarks: VirtualThreadFlood.java - Memory and performance comparisons
Performance Tests: Run WrkBenchmarkRunner to benchmark /compute, /block, and /file endpoints
Profiling Guide: Use JFR with -XX:+FlightRecorder for detailed analysis
Official Documentation: JEP 444: Virtual Threads

← Previous · Part 5Advanced Structured Concurrency Patterns Next · Part 7 →Production Readiness, Monitoring, and Debugging

Code repositoryproject-loom

#project-loom

Written by

Jagdish Salgotra

Distributed systems, cloud-native architecture, and the JVM. mostly shipping, occasionally reading.

all posts

Keep reading · rest of the series

Was this article helpful? or email →

anonymous · no account needed

Note This series uses Java 21 as the baseline. Virtual threads are stable in Java 21 (JEP 444). Structured concurrency snippets in this part (StructuredTaskScope, JEP 453) use preview APIs and require --enable-preview.

What I ran

The measurements below were generated with OpenJDK 25.0.2 and Maven 3.9.12:

bash

mvn clean compile -DskipTests
mvn dependency:build-classpath -Dmdep.outputFile=cp.txt

The build succeeded and compiled 35 source files.

Then I ran:

bash

java --enable-preview -cp "$(cat cp.txt):target/classes" app.js.reactive.VirtualThreadReactiveIntegration
java --enable-preview -cp "$(cat cp.txt):target/classes" app.js.PlatformThreadPoolServer
java --enable-preview -cp "$(cat cp.txt):target/classes" app.js.VirtualThreadPoolServer
java --enable-preview -cp "$(cat cp.txt):target/classes" app.js.microservices.VirtualThreadMicroservice
./scripts/run-thread-optimized.sh

For the pool-server comparison, both servers bind to port 8080, so I ran them one at a time. For the richer microservice checks, I ran focused wrk commands against ports 8080 and 8086.

A pool limit is visible under I/O wait

The benchmark used 200 concurrent connections for 10 seconds:

bash

wrk -t4 -c200 -d10s http://localhost:8080/api

The platform-thread pool result:

text

Latency     1.05s   540.55ms   1.91s    60.00%
980 requests in 10.09s
Socket errors: connect 0, read 0, write 0, timeout 780
Requests/sec:     97.10

The virtual-thread result:

text

Latency   204.67ms    1.78ms 209.50ms   69.19%
9654 requests in 10.08s
Requests/sec:    957.42

That is the core virtual-thread performance story. The request did not become faster than 200ms. The service could keep more waiting requests in flight without needing 200 platform threads.

The 8080 service shows three different shapes

The port 8080 VirtualThreadMicroservice has three basic endpoints that are useful for performance discussion:

Endpoint	Code shape
`/compute`	sum primes up to 50,000
`/block`	sleep for 300ms
`/file`	read a 10,000-line local file

Single requests returned:

Endpoint	Direct result
`/compute`	status 200 in 0.028252s, handler duration 2ms
`/block`	status 200 in 0.304610s, handler duration 302ms
`/file`	status 200 in 0.014830s, handler duration 13ms

Then I ran each endpoint with 200 connections for 10 seconds:

Endpoint	Load	Average latency	Requests/sec	Total requests
`/compute`	`wrk -t4 -c200 -d10s`	5.12ms	39,631.59	397,797
`/block`	`wrk -t4 -c200 -d10s`	304.78ms	634.56	6,400
`/file`	`wrk -t4 -c200 -d10s`	29.65ms	8,304.23	83,144

The final /metrics snapshot from the same 8080 service reported:

text

Active Requests: 0
Total Requests: 487936
Average Response Time: 4.65ms
CPU Usage: 0.00%
Memory Usage: 381.45MB / 1384.00MB
JVM Uptime: 78 seconds
Thread Type: Virtual Threads

CPU work still needs a CPU policy

The port 8086 service exists to make the CPU/I/O split visible. The server itself accepts requests on virtual threads, but /compute-optimized sends CPU work to a bounded platform-thread executor:

java

private static final ExecutorService cpuIntensiveExecutor =
    Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());

The direct checks returned:

Endpoint	Direct result
`/compute-optimized`	status 200 in 0.024890s, handler duration 4ms
`/io-optimized`	status 200 in 0.312777s, handler duration 312ms
`/mixed-workload`	status 200 in 0.207103s, handler duration 205ms

The focused load tests returned:

Endpoint	Load	Average latency	Requests/sec	Total requests
`/compute-optimized`	`wrk -t4 -c40 -d10s`	2.75ms	14,780.50	147,958
`/io-optimized`	`wrk -t4 -c200 -d10s`	312.71ms	634.14	6,402
`/mixed-workload`	`wrk -t4 -c80 -d10s`	205.43ms	388.44	3,920

The I/O endpoint again tracks the fixed wait. It sleeps for 300ms and reads a local file. With 200 clients, the rough ceiling is again about 667 requests per second; the measured result was 634.14.

The final thread stats were:

text

Active Virtual Threads: 0
Active Platform Threads: 0
Total Requests: 158579
Available Processors: 14
Average Response Time: 20ms

The reactive integration demo is not a reactive benchmark

VirtualThreadReactiveIntegration.java is useful, but it is not a controlled benchmark of Reactor, WebFlux, Vert.x, or another reactive runtime.

The performance section in that class compares three ways to process 1,000 items with a 10ms sleep per item:

text

Traditional blocking: 11883ms
Virtual threads: 23ms (516.65x faster)
Structured concurrency: 18ms (660.17x faster)
Results: Traditional=1000, Virtual=1000, Structured=1000

How to test performance claims

What this does not prove

What comes next

Resources

Complete Code: PlatformThreadMicroservice.java - Platform thread implementation
Virtual Thread Benchmarks: VirtualThreadFlood.java - Memory and performance comparisons
Performance Tests: Run WrkBenchmarkRunner to benchmark /compute, /block, and /file endpoints
Profiling Guide: Use JFR with -XX:+FlightRecorder for detailed analysis
Official Documentation: JEP 444: Virtual Threads

Performance Deep Dive

Series navigation

Jagdish Salgotra

Keep reading · rest of the series

Performance Deep Dive

What I ran

A pool limit is visible under I/O wait

The 8080 service shows three different shapes

CPU work still needs a CPU policy

The reactive integration demo is not a reactive benchmark

How to test performance claims

What this does not prove

What comes next

Resources

Series navigation

Jagdish Salgotra

Keep reading · rest of the series

What I ran

A pool limit is visible under I/O wait

The 8080 service shows three different shapes

CPU work still needs a CPU policy

The reactive integration demo is not a reactive benchmark

How to test performance claims

What this does not prove

What comes next

Resources