Linux I/O: From Syscall to Disk Platter
Every time your application calls write(), an intricate machinery of kernel subsystems springs into action. Understanding this machinery is the difference between an application that handles 10K IOPS and one that handles 500K. Let's trace the journey of a single I/O request from userspace to hardware.
The I/O Stack at a Glance
The Linux I/O stack is a layered architecture. Each layer adds functionality — and latency:
| Layer | Component | Typical Latency |
|---|---|---|
| Userspace | Application write() |
~0 (just a syscall) |
| VFS | Virtual File System | 50–200 ns |
| Page Cache | Buffered I/O layer | 100–500 ns |
| File System | ext4, XFS, btrfs | 1–10 μs |
| Block Layer | I/O scheduler, merging | 1–5 μs |
| Device Driver | NVMe, SCSI, virtio | 5–50 μs |
| Hardware | SSD / HDD | 10 μs – 10 ms |
Key insight: For a typical NVMe SSD, the kernel overhead often exceeds the actual hardware latency. This is why technologies like
io_uringwere created — to minimize kernel overhead.
Buffered vs Direct I/O
Buffered I/O (the default)
When you call write() without O_DIRECT, the data goes into the page cache — a region of memory managed by the kernel. The actual disk write happens later, asynchronously.
#include <fcntl.h>
#include <unistd.h>
int main(void) {
int fd = open("/tmp/test.dat", O_WRONLY | O_CREAT | O_TRUNC, 0644);
if (fd < 0) return 1;
const char *data = "Hello, page cache!\n";
write(fd, data, 19); // Goes to page cache, NOT disk
// Data is in memory. Kernel will flush later.
// Use fsync() to force it to disk:
fsync(fd);
close(fd);
return 0;
}The page cache is incredibly effective for most workloads:
- Read-heavy workloads get automatic caching
- Write-heavy workloads benefit from write coalescing
- Sequential access triggers kernel read-ahead
Direct I/O
Databases like PostgreSQL and MySQL often bypass the page cache using O_DIRECT. Why? Because they have their own buffer pool that's smarter about the access patterns:
int fd = open("/data/db.ibd", O_RDWR | O_DIRECT);When to use Direct I/O:
- You have your own caching layer (databases)
- You need predictable latency (no page cache eviction surprises)
- You're doing large sequential I/O (streaming, backups)
When NOT to use Direct I/O:
- General application code (page cache is almost always better)
- Small random reads (page cache prefetching helps enormously)
The io_uring Revolution
Traditional I/O in Linux uses read() / write() syscalls — each one requires a context switch between userspace and kernel space. For high-IOPS workloads, these context switches dominate the cost.
io_uring (introduced in Linux 5.1) solves this with two shared ring buffers:
- Submission Queue (SQ): Userspace writes I/O requests here
- Completion Queue (CQ): Kernel writes completed results here
#include <liburing.h>
#include <fcntl.h>
#include <string.h>
int main(void) {
struct io_uring ring;
io_uring_queue_init(32, &ring, 0);
int fd = open("/tmp/io_uring_test", O_WRONLY | O_CREAT | O_TRUNC, 0644);
// Prepare a write request
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
char buf[] = "Hello from io_uring!\n";
io_uring_prep_write(sqe, fd, buf, strlen(buf), 0);
io_uring_sqe_set_data(sqe, (void *)42); // user data tag
// Submit — may not even need a syscall in SQPOLL mode!
io_uring_submit(&ring);
// Wait for completion
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
if (cqe->res < 0) {
// handle error
}
io_uring_cqe_seen(&ring, cqe);
io_uring_queue_exit(&ring);
close(fd);
return 0;
}Performance comparison
Here's a real benchmark from a 4-core VM with an NVMe SSD, doing 4KB random reads:
| Method | IOPS | Avg Latency | CPU Usage |
|---|---|---|---|
pread() sync |
120K | 33 μs | 95% |
aio_read() (libaio) |
310K | 12 μs | 80% |
io_uring |
480K | 8 μs | 60% |
io_uring + SQPOLL |
520K | 7 μs | 45% * |
* SQPOLL uses a dedicated kernel thread, so CPU usage shifts from userspace to kernel.
Tracing I/O with BPF
When debugging I/O performance, you need visibility into what's happening at each layer. BPF tools give you that:
biolatency — Block I/O latency histogram
# Install bcc tools
sudo apt install bpfcc-tools
# Run biolatency
sudo biolatency-bpfcc -D 5Custom BPF program to trace slow writes
#!/usr/bin/env python3
from bcc import BPF
prog = r"""
#include <linux/blk-mq.h>
BPF_HASH(start, struct request *);
BPF_HISTOGRAM(dist);
int trace_start(struct pt_regs *ctx, struct request *req) {
u64 ts = bpf_ktime_get_ns();
start.update(&req, &ts);
return 0;
}
int trace_done(struct pt_regs *ctx, struct request *req) {
u64 *tsp = start.lookup(&req);
if (tsp == 0) return 0;
u64 delta = bpf_ktime_get_ns() - *tsp;
dist.increment(bpf_log2l(delta / 1000));
start.delete(&req);
return 0;
}
"""
b = BPF(text=prog)
b.attach_kprobe(event="blk_mq_start_request", fn_name="trace_start")
b.attach_kprobe(event="blk_mq_complete_request", fn_name="trace_done")
print("Tracing block I/O... Ctrl+C to stop.")
try:
b.trace_print()
except KeyboardInterrupt:
b["dist"].print_log2_hist("usecs")I/O Scheduler Tuning
Understanding the schedulers
Modern Linux offers three block I/O schedulers:
- none — No reordering. Best for NVMe SSDs with internal parallelism.
- mq-deadline — Ensures requests are served within a deadline. Good for mixed read/write workloads.
- bfq (Budget Fair Queueing) — Provides fairness between processes. Good for desktops.
# Check current scheduler
cat /sys/block/nvme0n1/queue/scheduler
# Output: [none] mq-deadline kyber bfq
# Change scheduler
echo "mq-deadline" | sudo tee /sys/block/nvme0n1/queue/schedulerQueue depth tuning
# Check current queue depth
cat /sys/block/nvme0n1/queue/nr_requests
# Default: 1024
# For latency-sensitive workloads, reduce it
echo 64 | sudo tee /sys/block/nvme0n1/queue/nr_requestsRule of thumb: Higher queue depth = higher throughput but higher tail latency. For databases, a queue depth of 32–64 often gives the best latency/throughput trade-off.
File System Considerations
ext4 tuning for write-heavy workloads
# Mount options for maximum write throughput
mount -o noatime,nodiratime,data=writeback,barrier=0 /dev/nvme0n1p1 /data
# WARNING: barrier=0 disables write barriers — data may be lost on power failure!
# Only use this for ephemeral/reproducible data (caches, temp files).XFS for large files
XFS excels at handling large files and high-concurrency workloads:
# Create XFS with optimal settings for NVMe
mkfs.xfs -f -d su=4k,sw=1 -l size=256m /dev/nvme0n1p1
# Mount with optimal options
mount -o noatime,logbufs=8,logbsize=256k /dev/nvme0n1p1 /dataComparing file system performance
A quick fio benchmark on the same NVMe SSD:
fio --name=randwrite --ioengine=io_uring --rw=randwrite \
--bs=4k --numjobs=4 --size=4G --runtime=60 \
--group_reporting --filename=/data/fio_testResults (4KB random write, 4 jobs, 60 seconds):
| File System | IOPS | Bandwidth | Avg Latency |
|---|---|---|---|
| ext4 (defaults) | 285K | 1.11 GB/s | 55 μs |
| ext4 (tuned) | 340K | 1.33 GB/s | 46 μs |
| XFS (defaults) | 310K | 1.21 GB/s | 51 μs |
| XFS (tuned) | 355K | 1.39 GB/s | 44 μs |
Practical Checklist
When investigating I/O performance, follow this checklist:
- Identify the bottleneck layer — Use
biolatencyandiostatto determine if the problem is in the kernel or the hardware - Check the I/O scheduler — NVMe drives should almost always use
none - Verify queue depth — Too high causes latency spikes; too low wastes throughput
- Examine page cache behavior — Use
vmstatandsar -Bto check page cache hit rates - Consider
io_uring— If your workload is IOPS-bound and you're on Linux 5.1+ - Profile with BPF — Write custom probes to trace exactly where time is spent
- Test file system options —
noatime, write barriers, journal mode all matter - Measure, don't guess — Always benchmark with your actual workload pattern
Further Reading
- Linux Kernel Documentation: Block Layer
- io_uring: Efficient I/O with io_uring (PDF) — Jens Axboe's original paper
- Brendan Gregg: BPF Performance Tools
- fio documentation — The standard I/O benchmarking tool
- LWN.net: An introduction to io_uring
The Linux I/O stack is deep, but you don't need to understand every layer to be effective. Start by measuring where your time is spent, then work your way down from the top. More often than not, the fix is simpler than you think — a mount option, a scheduler change, or switching to io_uring.
Your disk is faster than you think. It's the path to the disk that's slow.
Leave a comment ✎