Series navigation
Written by
Jagdish Salgotra
Distributed systems, cloud-native architecture, and the JVM. mostly shipping, occasionally reading.
Benchmark numbers for virtual threads versus reactive code disagree wildly because the workload dictates the answer, not the model: I/O-bound code usually benefits, CPU-bound rarely does, and non-blocking moves work without removing it.
Written by
Distributed systems, cloud-native architecture, and the JVM. mostly shipping, occasionally reading.
Note This series uses Java 21 as the baseline. Virtual threads are stable in Java 21 (JEP 444). Structured concurrency snippets in this part (
StructuredTaskScope, JEP 453) use preview APIs and require--enable-preview.
Performance claims become useful when the benchmark shape is visible. A throughput number without the service delay, connection count, pool size, and error count is just a number. Virtual threads make waiting cheaper, but they do not make CPU work disappear, fix downstream limits, or turn a toy benchmark into a capacity plan.
This article is learning material. The main branch now builds with OpenJDK 25.0.2 and uses the Java 25 preview structured-concurrency API, with the Java 21 version separately managed in the feature/java-21 branch. The measurements below were generated from the current checked-in Java 25 code. Virtual threads are final in Java 21; the preview flag is still used in this repository because other examples in the same build use preview structured-concurrency APIs.
The useful question for this part is not "are virtual threads faster?" The useful question is narrower: what kind of work was measured, what limit should the code predict, and did the run behave close to that prediction?
The measurements below were generated with OpenJDK 25.0.2 and Maven 3.9.12:
mvn clean compile -DskipTests
mvn dependency:build-classpath -Dmdep.outputFile=cp.txt
The build succeeded and compiled 35 source files.
Then I ran:
java --enable-preview -cp "$(cat cp.txt):target/classes" app.js.reactive.VirtualThreadReactiveIntegration
java --enable-preview -cp "$(cat cp.txt):target/classes" app.js.PlatformThreadPoolServer
java --enable-preview -cp "$(cat cp.txt):target/classes" app.js.VirtualThreadPoolServer
java --enable-preview -cp "$(cat cp.txt):target/classes" app.js.microservices.VirtualThreadMicroservice
./scripts/run-thread-optimized.sh
For the pool-server comparison, both servers bind to port 8080, so I ran them one at a time. For the richer microservice checks, I ran focused wrk commands against ports 8080 and 8086.
One script issue surfaced during the pass. run-thread-optimized.sh failed when it tried to build its own classpath, so I updated it to use the same cp.txt classpath file as the rest of the guide. I made the same change to run-memory-optimized.sh, because it used the same pattern.
The smallest comparison in the repository is the cleanest one. PlatformThreadPoolServer.java and VirtualThreadPoolServer.java both expose GET /api and sleep for 200ms per request. The platform-thread version uses a fixed pool of 20 threads. The virtual-thread version uses a virtual-thread-per-task executor.
The benchmark used 200 concurrent connections for 10 seconds:
wrk -t4 -c200 -d10s http://localhost:8080/api
The platform-thread pool result:
Latency 1.05s 540.55ms 1.91s 60.00%
980 requests in 10.09s
Socket errors: connect 0, read 0, write 0, timeout 780
Requests/sec: 97.10
The virtual-thread result:
Latency 204.67ms 1.78ms 209.50ms 69.19%
9654 requests in 10.08s
Requests/sec: 957.42
The math explains the result. A 20-thread pool with each worker sleeping for 200ms can complete about 20 / 0.2, or 100 requests per second. The measured platform-thread run completed 97.10 requests per second and reported 780 socket timeouts because 200 clients were competing for 20 workers.
The virtual-thread server had the same 200ms simulated wait, but each request got its own virtual thread. With 200 clients and a 200ms sleep, the rough ceiling is 200 / 0.2, or 1,000 requests per second. The measured run completed 957.42 requests per second with average latency close to the sleep time.
That is the core virtual-thread performance story. The request did not become faster than 200ms. The service could keep more waiting requests in flight without needing 200 platform threads.
The port 8080 VirtualThreadMicroservice has three basic endpoints that are useful for performance discussion:
| Endpoint | Code shape |
|---|---|
/compute | sum primes up to 50,000 |
/block | sleep for 300ms |
/file | read a 10,000-line local file |
Single requests returned:
| Endpoint | Direct result |
|---|---|
/compute | status 200 in 0.028252s, handler duration 2ms |
/block | status 200 in 0.304610s, handler duration 302ms |
/file | status 200 in 0.014830s, handler duration 13ms |
Then I ran each endpoint with 200 connections for 10 seconds:
| Endpoint | Load | Average latency | Requests/sec | Total requests |
|---|---|---|---|---|
/compute | wrk -t4 -c200 -d10s | 5.12ms | 39,631.59 | 397,797 |
/block | wrk -t4 -c200 -d10s | 304.78ms | 634.56 | 6,400 |
/file | wrk -t4 -c200 -d10s | 29.65ms | 8,304.23 | 83,144 |
The /block endpoint is the most trustworthy teaching number in this set. It has a fixed 300ms delay, and with 200 clients the rough ceiling is 200 / 0.3, or about 667 requests per second. The measured result was 634.56 requests per second with average latency of 304.78ms.
The /compute endpoint is different. It is short CPU work on this machine, so the measured throughput is high, but virtual threads are not the reason the prime loop is fast. The loop is small. If the CPU work becomes large enough to saturate the machine, virtual threads cannot add CPU cores.
The /file endpoint is also local to this machine. It reads a generated local file, not a network filesystem or object store. Treat that number as a local file-path measurement, not a general storage benchmark.
The final /metrics snapshot from the same 8080 service reported:
Active Requests: 0
Total Requests: 487936
Average Response Time: 4.65ms
CPU Usage: 0.00%
Memory Usage: 381.45MB / 1384.00MB
JVM Uptime: 78 seconds
Thread Type: Virtual Threads
That aggregate average is not a headline latency number. It blends hundreds of thousands of very fast /compute requests with slower /block requests. For article evidence, the per-endpoint wrk rows are more useful.
The port 8086 service exists to make the CPU/I/O split visible. The server itself accepts requests on virtual threads, but /compute-optimized sends CPU work to a bounded platform-thread executor:
private static final ExecutorService cpuIntensiveExecutor =
Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
The direct checks returned:
| Endpoint | Direct result |
|---|---|
/compute-optimized | status 200 in 0.024890s, handler duration 4ms |
/io-optimized | status 200 in 0.312777s, handler duration 312ms |
/mixed-workload | status 200 in 0.207103s, handler duration 205ms |
The focused load tests returned:
| Endpoint | Load | Average latency | Requests/sec | Total requests |
|---|---|---|---|---|
/compute-optimized | wrk -t4 -c40 -d10s | 2.75ms | 14,780.50 | 147,958 |
/io-optimized | wrk -t4 -c200 -d10s | 312.71ms | 634.14 | 6,402 |
/mixed-workload | wrk -t4 -c80 -d10s | 205.43ms | 388.44 | 3,920 |
The I/O endpoint again tracks the fixed wait. It sleeps for 300ms and reads a local file. With 200 clients, the rough ceiling is again about 667 requests per second; the measured result was 634.14.
The mixed endpoint runs a small CPU branch and a 200ms I/O branch. With 80 clients and a 200ms wait, the rough ceiling is 80 / 0.2, or 400 requests per second. The measured result was 388.44. That is the result you want from a performance article: the benchmark follows the code.
The final thread stats were:
Active Virtual Threads: 0
Active Platform Threads: 0
Total Requests: 158579
Available Processors: 14
Average Response Time: 20ms
Again, the aggregate average is mixed across endpoint types. The useful part is the available-processor count and the fact that the service routes CPU work through a bounded executor rather than treating virtual threads as a CPU-scaling tool.
VirtualThreadReactiveIntegration.java is useful, but it is not a controlled benchmark of Reactor, WebFlux, Vert.x, or another reactive runtime.
The performance section in that class compares three ways to process 1,000 items with a 10ms sleep per item:
Traditional blocking: 11883ms
Virtual threads: 23ms (516.65x faster)
Structured concurrency: 18ms (660.17x faster)
Results: Traditional=1000, Virtual=1000, Structured=1000
This is a wait-parallelization demo. The serial loop takes close to 1000 * 10ms, plus overhead. The virtual-thread and structured versions start the waiting work concurrently, so they complete much closer to the single-item wait time.
That does not mean structured concurrency is 660x faster than a reactive framework. It means this specific serial waiting loop was the wrong baseline for concurrent waiting work. The evidence in this article is intentionally limited to fixed service delays, checked-in endpoints, local wrk output, and metrics from the same service process.
Start with the code path, not the conclusion. If a route sleeps for 300ms and the benchmark uses 200 clients, predict roughly 667 requests per second before you run it. If the result is close, the benchmark is probably measuring the thing you think it is measuring. If the result is far away, inspect the queue, pool, timeout, downstream limit, or benchmark command before writing the lesson.
Keep CPU and waiting work separate. Virtual threads help most when tasks are parked waiting for I/O, timers, locks, or network calls. CPU-heavy work still needs a bounded policy tied to available processors. The 8086 service demonstrates that shape by keeping request handling simple while sending CPU work through a fixed platform-thread executor.
Treat aggregate service metrics as supporting evidence, not the main result. A single average response time across /compute, /block, and /file hides the route-specific behavior. Publish per-endpoint latency, throughput, total requests, error counts, and the exact load settings.
This article does not prove that virtual threads are universally faster than platform threads. It proves that, in these checked-in examples, a platform pool with 20 workers queues heavily under 200 waiting clients, while virtual threads keep the same 200ms wait close to 200ms.
This article also does not prove that virtual threads beat reactive systems in general. The repository contains a reactive integration demo, not a production-grade reactive service benchmark. If you need backpressure-heavy event pipelines, benchmark that actual design.
Finally, this article does not estimate cloud cost. Cost depends on request mix, downstream limits, JVM settings, instance type, redundancy, autoscaling policy, and failure targets. A cost table without those inputs would be decoration.
Part 7 moves from benchmark shape to operational visibility: monitoring, debugging, JFR, pinning detection, and the checks that help explain virtual-thread behavior when the local benchmark is no longer enough.
WrkBenchmarkRunner to benchmark /compute, /block, and /file endpoints-XX:+FlightRecorder for detailed analysis