Most performance engineers have been there: perf top shows a flat profile, the CPU looks busy, latency is still bad, and you're staring at a flame graph that tells you nothing useful. The application isn't spinning — it's waiting. And your profiler has no idea.
This is the core problem addressed in a paper from OSDI '24, and it's one of those rare cases where reading a systems paper genuinely changed how I think about instrumentation. The work introduces Blocked Samples, a lightweight kernel-level sampling primitive that bridges the gap between On-CPU and Off-CPU analysis — and then builds two practical tools on top of it.
The Fundamental Gap in Linux Profiling
Linux perf is brilliant for CPU-bound work. Its task-clock software event fires on a fixed interval, captures the instruction pointer and call chain, and you get a statistically sound picture of where your cycles are going. But the moment a thread blocks — on a read(), a mutex, a futex — the timer stops. The thread disappears from perf's view entirely until it's rescheduled.
This creates an invisible population of samples. A thread that spends 80% of its wall-clock time waiting for a slow fsync() looks, to perf, like it barely exists.
The workarounds are unsatisfying. You can use perf sched or eBPF-based tools to trace scheduler events, but these give you raw tracing data, not aggregated profiles. You can instrument your code with clock_gettime() around suspect calls. You can build custom eBPF programs that hook sched_switch. All of this requires knowing roughly where to look — which defeats the purpose of profiling.
The paper's authors call this the three-way failure of existing tools: they give you a partial view (On-CPU only), they lack precise code-level context for Off-CPU events, and — critically — they can't answer the causality question: if I fix this bottleneck, how much faster does my program actually run?
What Blocked Samples Actually Do
The key insight is that the OS already knows exactly when a thread blocks and for how long. Every scheduler context switch records timestamps. The information is there — it just wasn't being fed back into the sampling stream.
Blocked Samples work by hooking three points in the Linux kernel scheduler: schedule-out (when a thread gives up the CPU), wake-up (when something signals the thread is ready), and schedule-in (when the thread resumes). At each hook point, timestamps are recorded. When a thread finally gets scheduled back in, the runtime calculates two intervals:
- T_blocked: time spent actually blocked (waiting for I/O, a lock, etc.)
- T_sched: time spent in the run queue, waiting for a CPU to become available
If the sum of these spans would have crossed one or more of perf's regular sampling ticks, a synthetic "blocked sample" is emitted — at the cadence of the original sampling frequency, maintaining statistical comparability with On-CPU samples.
Each blocked sample carries four pieces of information that make it genuinely useful:
Instruction Pointer and call chain — the last user-space IP before the thread blocked, plus the full kernel call chain. This is what existing Off-CPU tools have historically lacked. You don't just know that something blocked; you know which line of code triggered the block, and you can see whether it went through vfs_read, blkdev_submit_bio, io_schedule, or somewhere else entirely.
Subclass — a coarse categorization of why the thread blocked. The paper uses four: I/O waiting, synchronization (locks/futexes), CPU scheduling contention, and everything else (sleeps, timers). This single field turns out to carry a lot of diagnostic weight.
Weight — a deduplication mechanism. A thread blocked on an fsync that takes 300ms would, naively, generate thousands of identical samples. Instead, the system emits a single physical sample with a weight encoding how many logical samples it represents. This keeps overhead low without losing fidelity.
The overhead numbers are compelling: average 1.6% across the paper's benchmarks. That's production-safe.
The Two Tools: bperf and BCOZ
bperf
bperf is a direct extension of Linux perf, designed to feel familiar to anyone already using perf stat or perf report. The core addition is that its output now includes blocked samples alongside CPU samples, tagged by subclass:
# Samples tagged [I] = I/O block, [L] = Lock, [S] = Scheduling wait
35.12% rocksdb [I] GetDataBlockFromCache
BlockFetcher::ReadBlockContents
BlockBasedTable::NewDataBlockIterator
21.43% rocksdb [L] LRUCacheShard::Lookup
ShardedCache::Lookup
BlockBasedTable::GetFromBlockCacheThe bracketed subclass tags give you immediate signal. If you're seeing [I] on code you expected to be cache-resident, something's wrong with your caching layer. If you're seeing [L] deep in a hot path, you have lock contention hiding behind what looks like normal execution.
For each blocked sample, bperf reports both the last user-space IP and the last kernel-space IP before the block. The kernel IP is often where the real story is — it tells you whether an apparent lock wait is really a futex, whether an I/O wait is going through the page cache, whether a "sleep" is actually spinning in a kernel retry loop.
BCOZ
COZ is a causal profiler originally published at SOSP '15. Its central idea is elegant: instead of measuring how fast code is, simulate how fast your program would be if a particular section of code were faster. It does this with "virtual speedup" — when the target code runs, artificially slow down all other concurrent threads by the same proportion. If the target is truly on the critical path, the overall program speeds up proportionally. If other work absorbs the slack, you see diminishing returns.
BCOZ extends this to Off-CPU events. The critical addition is subclass-level virtual speedup: instead of targeting a specific code location, you can target a class of blocking events. You can ask: "What would happen to my overall throughput if all I/O waits in this program were 50% shorter?" This models hardware upgrade scenarios — NVMe vs. SATA, faster memory — without you having to actually buy new hardware.
The implementation complexity here is non-trivial. Standard COZ injects delays into threads to create the virtual speedup illusion. But Off-CPU events involve synchronization primitives — if thread A is waiting for thread B to release a lock, naively delaying thread B propagates the delay to thread A in ways that corrupt the measurement. The paper handles this by injecting delays before wake-up operations, ensuring the causal chain is accounted for correctly.
Three Cases from RocksDB
The paper validates all of this against RocksDB, which is a good choice: it's complex, widely deployed, and exhibits both lock contention and I/O-heavy behavior depending on workload.
Case 1: The lock hiding behind I/O. On a read-heavy workload with a large block cache, an existing Off-CPU tool (based on wait-for graphs) identified the bottleneck as threads waiting on hardware interrupts — suggesting that faster SSDs would help. bperf told a different story: the dominant [L]-tagged samples pointed to LRUCacheShard::Lookup, deep in the block cache. BCOZ predicted that optimizing GetDataBlockFromCache would yield ~50% speedup. The fix — sharding the cache to reduce lock contention — cut lock wait time by 97% and improved throughput 3.4x. No new hardware required.
Case 2: Knowing which I/O to fix. On a random-read workload with a small cache, the profile was full of [I] samples — obviously I/O-bound. But which I/O? RocksDB issues reads for filter blocks (Bloom filters), index blocks, and data blocks. bperf distinguished these in the call chain, and BCOZ identified filter block reads as the highest-leverage target. Converting filter and index block reads from synchronous to asynchronous I/O dropped blocking I/O events by 74% and improved throughput 1.8x.
Case 3: Scheduling contention. The NAS Parallel Benchmark integer sort, running 32 threads on fewer cores, had CPUs as the real bottleneck — but through scheduler contention rather than raw compute. Original COZ saw no meaningful speedup signal from any code location. BCOZ's [S]-subclass virtual speedup correctly identified scheduling contention and predicted linear improvement from adding cores, which matched the actual measurement.
What This Approach Can't (Yet) Do
The paper is refreshingly honest about current limitations. The biggest one: the I/O subsystem is still a black box. When a blocked sample fires with subclass [I], you know a thread was waiting for I/O. You don't know why the I/O was slow — whether the device was busy, whether an SSD's garbage collection kicked in, whether you hit a queue depth limit, whether the kernel's I/O scheduler made a bad decision.
The authors note this as explicit future work: integrating device-internal event information (NVMe command completion, GC pauses, etc.) into the blocked sample stream. This would close the loop between application-level call stacks and device-level behavior — a genuinely hard problem given the diversity of storage hardware.
The subclass taxonomy is also coarse. Four categories (I/O, lock, scheduling, other) covers most cases, but distinguishing between a futex-based mutex and an RCU read lock, or between a page fault stall and a network socket wait, would require finer granularity. The current implementation groups these under "other", which limits its usefulness for network-heavy or memory-pressure scenarios.
What To Do Differently
The immediate takeaway isn't "go use bperf in production today" (though the 1.6% overhead claim is encouraging). It's more about how you frame the problem when something is slow.
When you see high latency without proportional CPU usage, the gap is Off-CPU time — and you need tooling that can see it. perf's task-clock profiling isn't wrong; it's just not looking at the right population of events. Adding even rough Off-CPU visibility — whether through bperf, async-profiler's wall-clock mode, or custom eBPF programs that hook sched_switch — fundamentally changes the diagnostic conversation.
The causality point matters too. Identifying that your program spends 40% of wall time on I/O is useful. Knowing that eliminating this specific call path's I/O would speed up the whole program by 30% — while eliminating that other call path's I/O would buy you almost nothing because it's off the critical path — is what actually guides optimization priorities. BCOZ's virtual speedup mechanism, extended to hardware upgrade scenarios, is a practical way to answer "is this worth doing" before you write a line of code or sign a PO.
The deeper architectural insight is that the scheduler is an untapped telemetry source. Every context switch is a structured event with timestamps, thread identities, and wake-up reasons. Most of that data evaporates. Blocked Samples is, at its core, a proposal to treat scheduler events as first-class profiling data — on equal footing with PMU counters and tracepoints. That framing feels right, and I expect we'll see more tools built around it.
This post is based on "Blocked Samples: Unified On-/Off-CPU Profiling for Causal Performance Analysis", presented at OSDI '24.
Leave a comment ✎