This document covers how to achieve maximum performance in benchmarks, what the current competitive numbers look like, and how each optimization layer contributes.
# 1. Release build with full native optimizations
RUSTFLAGS="-C target-cpu=native" cargo build --release
# 2. Linux: enable io_uring backend (single biggest win after hardware tuning)
RUSTFLAGS="-C target-cpu=native" cargo build --release --features io-uring
# 3. Run with worker count pinned to physical cores (not hyperthreads)
CHOPIN_WORKERS=$(nproc --all) ./target/release/my_app
# 4. Tune the OS (see OS Tuning section below)
| Framework | Endpoint | Relative Throughput | Avg Latency |
|---|---|---|---|
| Chopin | /json |
100% | 686 µs |
| Chopin | /plaintext |
100% | 700 µs |
| Actix Web | /json |
91% | 812 µs |
| Axum | /json |
84% | 945 µs |
| Hyper | /json |
73% | 1,810 µs |
| Framework | Avg Latency C16 | Throughput | vs Chopin |
|---|---|---|---|
| Chopin (epoll) | 2.48 ms | 22.9M req/s | baseline |
| Axum (Tokio) | 1.39 ms | 48.6M req/s | 2.1× faster |
| Hyper | 6.77 ms | — | 2.7× slower |
Note: The C16 Linux gap vs Axum is a known scheduling overhead issue addressed by the io_uring backend (see below).
Enable with the io-uring feature flag. At runtime, Chopin automatically switches every worker to a completion-based event loop instead of the epoll readiness loop:
# Cargo.toml
[dependencies]
chopin-core = { version = "0.5.21", features = ["io-uring"] }
Or at build time:
cargo build --release --features chopin-core/io-uring
What changes:
accept + read + write + writev + close are all submitted as SQEs (no per-op syscalls)io_uring_enter callExpected gains (vs epoll baseline):
| Concurrency | epoll latency | io_uring latency | Improvement |
|---|---|---|---|
| C16 | 2.48 ms | ~1.3–1.7 ms | ~35–45% |
| C32 | 7.01 ms | ~3.5–4.5 ms | ~35–50% |
| C64 | 24.94 ms | ~12–16 ms | ~35–50% |
For maximum throughput on dedicated benchmark hosts, enable SQPOLL in worker_uring.rs by changing the setup flags:
// In worker_uring.rs: change setup_flags to include SQPOLL
let setup_flags = IORING_SETUP_SQPOLL
| IORING_SETUP_SINGLE_ISSUER
| IORING_SETUP_COOP_TASKRUN;
With SQPOLL, the kernel spawns a polling thread that continuously watches the SQ ring. Submissions never need io_uring_enter — zero syscalls in steady state.
Requires:
CAP_SYS_NICEor running as root for the kernel poll thread. Addulimit -l unlimitedfor locked memory.
target-cpu=native — SIMD-accelerated date formattingformat_http_date() has an AVX2 fast path for formatting the HTTP Date header. This is only enabled when compiled with native tuning:
RUSTFLAGS="-C target-cpu=native" cargo build --release
Impact: ~2–5 µs savings per response (the date formatting syscall + AVX2 path together).
Chopin defaults to num_cpus::get() workers (logical cores including hyperthreads). For I/O-bound workloads, physical core count often performs better:
// In main.rs — pin to physical core count
Chopin::new()
.mount_all_routes()
.serve("0.0.0.0:3000")
.unwrap();
Or set explicitly:
use chopin_core::Server;
Server::bind("0.0.0.0:3000")
.workers(num_cpus::get_physical()) // physical cores only
.serve(router)
.unwrap();
The default slab is 10,000 connections per worker. For TFB-style high-concurrency tests, increase it:
// worker.rs: change the slab size
let mut slab = ConnectionSlab::new(50_000);
Each Conn is ~25 KB (8 KB read buf + 16 KB write buf + metadata), so 50k slots ≈ 1.25 GB per worker. Size appropriately for your RAM.
Apply these before running any benchmark:
# Increase file descriptor limits
ulimit -n 1048576
# TCP socket tuning
sudo sysctl -w net.core.somaxconn=65535
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=65535
sudo sysctl -w net.core.netdev_max_backlog=65535
sudo sysctl -w net.ipv4.tcp_tw_reuse=1
sudo sysctl -w net.ipv4.tcp_fin_timeout=10
# io_uring: allow locked memory for SQPOLL
ulimit -l unlimited
# CPU governor: performance mode (avoids frequency scaling)
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# NUMA: bind to a single NUMA node (if applicable)
numactl --cpunodebind=0 --membind=0 ./target/release/my_app
# Standard TFB-style: 8 threads, 256 connections, 15 seconds
wrk -t8 -c256 -d15s http://localhost:3000/plaintext
# Higher concurrency
wrk -t16 -c512 -d15s http://localhost:3000/json
# Measure latency at a fixed rate
wrk2 -t8 -c256 -d15s -R 100000 http://localhost:3000/plaintext
hey -n 1000000 -c 256 http://localhost:3000/plaintext
┌──────────────────────────────────────────────────────┐
│ Per-Worker Thread (pinned to CPU core) │
│ │
│ SO_REUSEPORT listener ─► io_uring ring │
│ │ │
│ multi-shot accept ◄───────┘ │
│ │ │
│ ConnectionSlab (O(1) alloc, no locks) │
│ │ │
│ Zero-alloc parser ──► Radix router │
│ │ │
│ Handler ──► Raw byte serializer │
│ │ │
│ writev (headers + body, 1 syscall) │
│ │ │
│ CQE completion ──► next SQE submission │
└──────────────────────────────────────────────────────┘
Key principles:
Body::Static / Body::Bytes use writev zero-copy)Body::File uses kernel sendfile)| Feature | Linux | macOS |
|---|---|---|
| epoll (default) | ✅ | — |
| kqueue (default) | — | ✅ |
io_uring (io-uring feature) |
✅ kernel ≥5.11 | — |
| TCP_DEFER_ACCEPT | ✅ | — |
| TCP_FASTOPEN | ✅ | ✅ |
| sendfile | ✅ | ✅ |
| SO_REUSEPORT | ✅ | ✅ |
| Scenario | Framework | Relative Throughput | Avg Latency | Max Latency |
|---|---|---|---|---|
| Plain Text | Chopin | 100% | 1.25 ms | 37.83 ms |
| Hyper | 88% | 1.45 ms | 68.27 ms | |
| JSON | Chopin | 100% | 0.98 ms | 25.05 ms |
| Hyper | 74% | 1.45 ms | 67.39 ms |
wrk -t10 -c200 -d10s