Back to Performance Engineering
Featured

Beyond the Flame Graph

How wPerf Finds the Bottlenecks Your Profiler Can't See

|
11 min read

There's a specific kind of frustration that every performance engineer has felt. You're staring at a flame graph, CPU utilization is sitting at 35%, your latency is through the roof — and the profiler shows nothing. No hot functions. No obvious culprits. Just a system that's inexplicably slow.

That's an Off-CPU problem, and traditional profilers are completely blind to it.

When a thread blocks on a mutex, waits for a disk write to commit, or sleeps on a condition variable, CPU profilers record... nothing. The thread isn't on the CPU, so the sampler never fires. The bottleneck is real and often severe, but it's invisible to the tooling most of us reach for first.

This is what motivated wPerf, a research project out of OSU Systems Lab that was presented at OSDI '18. The paper doesn't just add another Off-CPU profiler to the ecosystem — it introduces a fundamentally different way of reasoning about waiting, one grounded in graph theory rather than duration statistics.

The Problem with "Who Waited Longest"

The existing approach to Off-CPU analysis, popularized by Brendan Gregg's Off-CPU flame graphs and eBPF-based tools like offcputime, works by recording how long each thread spends blocked and attributing that time to the blocking stack. It answers: which thread waited the most?

This is useful, but it has a critical flaw: long waits are not the same as global bottlenecks.

Imagine a background compaction thread in a storage engine sleeping for 200ms waiting for a flush to complete. That's a long wait. It'll show up prominently in any duration-based profiler. But if that thread has no relationship to your request-handling path, optimizing it moves nothing meaningful.

Meanwhile, a 50-microsecond lock acquisition on your critical write path — taken thousands of times per second by every worker thread — is generating almost no per-event duration signal, but it's quietly throttling your entire throughput. Duration-based analysis will rank it below the compaction thread noise.

wPerf's insight is that we need to answer a different question: which waiting events are causally connected to the overall throughput of the system?

Wait-for Graphs and the Knot Theorem

wPerf models a running system as a directed graph G=(V,E). Each vertex represents either a thread or an I/O device (treated as a pseudo-thread). Each directed edge u→v means "thread u is currently waiting on v." The weight of an edge represents the cumulative wait time over the observation window.

This structure captures what duration-based tools miss: the topology of waiting. It's not just who waited, but who caused whom to wait, and how that causality chains through the system.

The key theoretical contribution is what the paper calls the Knot theorem. In graph theory, a Knot is a subgraph where no vertex has outgoing edges pointing outside the subgraph — it's a closed waiting loop. The theorem states: if a wait-for graph contains any Knot composed of worker threads, that Knot contains at least one bottleneck limiting the application's maximum throughput.

The intuition is clean. If a set of threads are waiting only on each other and never on anything external, they're stuck in a closed system. No optimization outside that Knot can break the cycle. To improve throughput, you must break at least one edge inside every Knot.

To find Knots efficiently, wPerf decomposes the wait-for graph into Strongly Connected Components (SCCs) using Tarjan's algorithm. Any SCC with no outgoing edges in the condensed DAG is a Knot. This immediately narrows the search space from hundreds of threads to a handful of structural bottlenecks.

Cascaded Redistribution: Tracing Causality Through Chains

Identifying Knots gets you to the right neighborhood, but you still need to know which specific wait inside the Knot to fix. This is where it gets more interesting.

Consider a chain: thread A waits on thread B, which is blocked waiting for a disk write. The raw graph has edges A→B and B→Disk. If you only look at edge weights, B carries the weight of A's waiting but is itself a victim. Optimizing B's code accomplishes nothing if the real bottleneck is disk throughput.

wPerf solves this with cascaded redistribution. The algorithm propagates wait weight along dependency chains: if A waits for B, and during that same interval B is waiting for C, then A's wait is redistributed — partially or fully — onto the B→C edge. The weight "flows" toward the root cause.

If a node is actively executing on CPU while being waited on, it absorbs the weight — it's genuinely doing work that's taking time. If it's also blocked, the weight keeps flowing downstream. The final accumulated weight on each edge reflects its true causal contribution to the waiting of everything upstream.

After redistribution, wPerf applies edge trimming — dropping edges below a significance threshold (roughly 1% of total time) — and re-runs Knot identification. This iterative refinement compresses a complex dependency network with hundreds of threads down to a tight loop of a few nodes that are genuinely causing each other's delays.

Implementation: KProbes, perf, and a Python Graph Engine

The engineering implementation is worth understanding because it shapes what wPerf can and can't observe.

The kernel module uses KProbe hooks on two scheduler functions:

# The two critical hook points
sched_switch   # captures who left the CPU and why (blocked vs preempted)
sched_wakeup   # captures who woke whom — this builds the directed edges

sched_wakeup is the critical one. Every time a thread wakes another, wPerf records the waker's TID alongside the wakee's. This is what allows it to construct directed edges rather than just recording durations. eBPF's offcputime typically only captures the blocking stack — it knows who was sleeping but not who kicked it awake.

The recorder layer (record.sh) coordinates additional data collection:

# Stack collection for attribution
perf record -g --call-graph dwarf -p $PID

# I/O saturation detection
iostat -x 1
ifstat 1

# Thread name resolution for readability
jstack $PID  # for JVM workloads

The analyzer is Python-based, using NetworkX for graph operations. It ingests the raw wait/wakeup event stream, performs timestamp matching to reconstruct wait intervals, runs the redistribution algorithm, and exports the resulting Knots and weighted edges as CSV files with full call stacks attached.

For user-space synchronization that doesn't cross the kernel boundary — spinlocks, RDMA operations, custom lock-free structures — wPerf provides an annotation API. C programs call uds_add() at initialization and log custom wait/wakeup events around their synchronization points. Java workloads use a JNI wrapper. It's inelegant but necessary: modern high-performance systems deliberately bypass the kernel, and a profiler that can only see kernel-boundary events will miss a growing fraction of real bottlenecks.

What It Found in the Wild

The case studies are where the methodology proves its value.

ZooKeeper was the flagship example. Under a read-heavy workload, CPU utilization sat around 30-40% — clearly not the bottleneck. wPerf's graph revealed a Knot between the SendThread and SyncThread, with an I/O pseudo-thread representing the transaction log. The cascade analysis showed that the fundamental issue was batch size: ZooKeeper's design limited outstanding requests, which reduced the number of writes batched per fsync call, which made synchronization overhead dominate. After increasing concurrency limits and tuning the batching algorithm, throughput improved by up to 4.83x with latency dropping over 50%.

MySQL surfaced a subtler issue. wPerf identified an SCC containing both worker threads and background flush threads. The redistribution analysis converged on a specific lock: "waiting for empty flushing slot." MySQL defaulted to 8 flush slots for single-page operations — a configuration that silently serialized concurrent dirty-page eviction under high write load. The fix was straightforward (increase the slot count or buffer size), but the bug was essentially invisible to any tool that only looked at lock contention duration without tracing the resource pool structure.

HDFS demonstrated the distributed case. EditLog sync latency in the Namenode propagated through processing threads to DataNode clients — a dependency chain that crossed RPC boundaries. The cascade algorithm correctly stripped out network latency contributions and isolated metadata flush as the root cause.

Overhead and Tradeoffs

wPerf reports approximately 5.1% runtime overhead in its evaluations, which is competitive for a tool providing full causal graph analysis. For comparison:

  • perf with CPU sampling: under 1%, but completely blind to Off-CPU time
  • eBPF offcputime: 10-40% under high syscall rates
  • ftrace function tracing: 298-841% — fine for development, unusable in production

The gap between wPerf and eBPF tools at high syscall rates is worth noting. eBPF programs execute synchronously on every event; in a system making millions of context switches per second, that adds up. wPerf's kernel module uses an optimized binary log format with deferred flushing, which smooths out the overhead profile.

The honest limitation is the annotation requirement for user-space synchronization. If you're profiling a system that does significant coordination below the kernel boundary — and many high-performance databases, messaging systems, and network stacks do — you need to instrument those paths manually. That's not a fatal flaw, but it is engineering work.

What This Changes for Practitioners

The practical shift wPerf demands is conceptual before it's tooling. Duration-based profiling asks "what took the longest?" Graph-based analysis asks "what is structurally blocking progress?"

These are different questions, and they lead to different diagnoses. A thread that waited for 500ms on a lock is a fact. A Knot containing that thread, where cascade analysis shows the weight ultimately accumulating on a 5-slot resource pool shared by 32 threads, is an actionable finding.

A few things to take from this:

When you see high latency with low CPU utilization, you're in Off-CPU territory. offcputime or wPerf-style analysis is the right tool, not a flame graph.

When offcputime gives you a stack but the fix isn't obvious, you're probably looking at a victim thread rather than a root cause. Tracing the wakeup chain — who woke the thread, who woke the waker — gets you to the real bottleneck.

wPerf's Knot theorem gives you a formal way to evaluate whether a proposed optimization will actually help. If the bottleneck isn't inside a Knot, fixing it won't improve throughput. This is a useful sanity check before spending engineering time on a change.

The tooling itself is available at OSUSysLab/wPerf on GitHub. It requires kernel module compilation and some integration work, so it's not a one-command drop-in. But the methodology — building wait-for graphs, identifying SCCs, applying cascade redistribution — is sound enough that even implementing parts of it in your own tracing infrastructure, using eBPF and a graph library, would yield better Off-CPU analysis than duration aggregation alone.

The flame graph was a genuine breakthrough for CPU profiling. For Off-CPU analysis, we've been coasting on the duration-based equivalent for too long. Graph topology is the next step.


wPerf was presented at OSDI '18. The full paper is available at the USENIX site.

Leave a comment ✎