I've spent a good chunk of the past week reading through the opentelemetry-ebpf-profiler codebase, and I keep coming back to the same thought: this is one of the most carefully engineered pieces of systems software I've read in years. The problem it solves — continuous profiling across mixed-runtime workloads, with no frame pointers, no debug symbols, zero process restarts, kernel 5.4+ — is genuinely hard. The solutions it found are worth studying even if you never touch a profiler.
Let me walk you through what I found.
The Core Insight: Compile the DWARF Away
If you've tried to implement stack unwinding without frame pointers, you've probably stared at .eh_frame for a while and felt vaguely hopeless. DWARF CFI is a stack machine with arbitrary expressions. It's designed for correctness, not speed. Running a DWARF interpreter in a hot interrupt path would be insane.
The otel-ebpf-profiler team's answer is elegant: don't interpret DWARF at runtime. Instead, run the DWARF virtual machine once, ahead of time, in userspace, and compile its output into a compressed lookup table that eBPF can query in microseconds.
The result is a two-phase architecture:
- Userspace Host Agent (Go): Parses
.eh_frame,.debug_frame, and.gopclntabfrom ELF files, runs the DWARF CFA interpreter, and distills the output into compact "stack delta" tables. These get loaded into eBPF maps. - eBPF (C): On each perf event, binary-searches the pre-loaded maps to find the stack delta for the current PC, then uses it to unwind one frame. Tail-calls itself. Repeat up to 29 times.
The entry point for the userspace extraction is stackdeltaextraction.go:184, function extractFile(). It tries data sources in a specific order:
// stackdeltaextraction.go:203-218
if err = ee.parseGoPclntab(); err != nil { ... }
if err = ee.parseEHFrame(); err != nil { ... }
if err = ee.parseDebugFrame(elfFile); err != nil { ... }Go binaries first, then .eh_frame, then .debug_frame as a last resort. When a Go binary has both .gopclntab and .eh_frame (it usually does — the C runtime parts have .eh_frame), an extractionFilter selects .gopclntab-covered ranges first and fills gaps from .eh_frame. That's the PLT stubs, C library calls, etc.
The DWARF VM Implementation
elfehframe.go implements a full DWARF CFA interpreter. state.step() at line 605 handles about 30 DWARF opcodes. The interesting part isn't the common cases — cfaDefCfa, cfaOffset, etc. are straightforward. It's how they deal with cfaDefCfaExpression and cfaExpression, the opcodes that emit arbitrary DWARF stack programs.
They don't evaluate arbitrary expressions. Instead, they pattern-match against a small set of known shapes (lines 492–532):
- The GCC-generated PLT trampoline pattern
REG + offset*(REG + offset)*(REG1 + 8*REG2 + offset) + postOffset— this one showed up in OpenSSL hand-written assembly
That last pattern is particularly telling. Somebody actually went and decompiled OpenSSL's assembly to figure out what DWARF expression GCC was emitting for it, then added a special case. That's the kind of attention to detail that makes this project interesting.
If an expression doesn't match any known pattern, it falls back to "unknown", which means the unwinder gives up at that frame. Acceptable. Full correctness would require shipping a DWARF stack machine into the kernel, and nobody wants that.
The Stack Delta Encoding
After all the DWARF interpretation, what do you actually store in the eBPF maps? Here's the representation (types.h:872-875):
typedef struct StackDelta {
u16 addrLow; // Low 16 bits of address (offset within 64KB page)
u16 unwindInfo; // Index into global UnwindInfo array, or special command flags
} StackDelta;Four bytes per entry. The page-level granularity is the key — you only store the low 16 bits of the address, since the high bits are captured by the map key (fileID + page number). Then unwindInfo is an index into a global deduplication table:
typedef struct UnwindInfo {
u8 flags; // UNWIND_FLAG_*
u8 baseReg; // CFA base register (RBP, RSP, RAX, ...)
u8 auxBaseReg; // Register for FP/return address recovery
u8 mergeOpcode; // Two-adjacent-delta merge optimization
s32 param; // CFA expression parameter (usually the stack offset)
s32 auxParam; // FP/RA expression parameter
} UnwindInfo;Twelve bytes. The unwind_info_array is global across all executables — the observation being that the set of unique CFA expressions in a typical system isn't that large. They measured it: /usr/bin/* and /usr/lib/*.so on a desktop system produce about 9,700 unique UnwindInfo entries. The map ceiling is 16,384. So even with significant binary diversity, you're unlikely to hit the limit in practice.
The mergeOpcode field is a micro-optimization I didn't expect. Adjacent stack delta entries that represent a push/pop pair get merged into a single record using a 1-bit sign + 7-bit address threshold encoding. It reduces table size and, more importantly, reduces the number of eBPF map entries the binary search has to traverse.
The Three-Level Map Hierarchy
Finding the right StackDelta for a given (PID, virtual address) requires three levels of lookup:
flowchart TD
A["(PID, VA) from eBPF interrupt"] --> B["Level 1: pid_page_to_mapping_info
(LPM_TRIE)
Maps (PID, page) → fileID + bias + unwinder"]
B --> C["Level 2: stack_delta_page_to_info
(HASH_MAP)
Maps (fileID, 64KB page number) → range in inner array"]
C --> D["Level 3: exe_id_to_N_stack_deltas
(HASH_OF_MAPS, 16 buckets)
Binary search within inner array"]
D --> E["UnwindInfo index → unwind_info_array lookup"]The Level 1 LPM Trie is particularly neat. Rather than inserting one entry per 64KB page of a memory mapping, the Go side uses a "rightmost set bit" splitting algorithm (lpm/lpm.go) to decompose [start, end) into the minimum number of power-of-two-aligned LPM prefixes. One mmap() region becomes a handful of trie entries, not thousands.
The Level 3 bucketing — 16 separate inner maps sized from 2^8 to 2^23 — solves a real eBPF problem. Inner maps in HASH_OF_MAPS must all have the same size at creation time. If you size for the worst case, small executables waste enormous amounts of memory. By creating 16 map sizes and assigning each executable to the smallest bucket that fits, memory usage scales with actual delta count rather than worst-case delta count. The selection logic is in native_stack_trace.ebpf.c:28-63.
The binary search itself (get_stack_delta() at line 168) uses the classic for (i = 0; i < 16; i++) pattern rather than a while loop. This isn't a style choice — it's required by the eBPF verifier, which needs statically bounded loops. The verifier also forces several null checks on values that logically can't be null (like get_per_cpu_record() results), but without them the program won't load.
The Tail Call Architecture
Each eBPF program can only be so large. Complex unwinding logic for 11+ language runtimes can't fit in one program. The solution is tail call chains via bpf_tail_call(), managed through perf_progs (a BPF_PROG_ARRAY map).
Each PROG_UNWIND_NATIVE invocation handles exactly 5 frames, then tail-calls either itself (more native frames), an interpreter-specific unwinder (switched to a JVM/Python/Go heap frame), or PROG_UNWIND_STOP. The logic for deciding which tail call to make is get_next_unwinder() in tracemgmt.h.
The interpreter detection is worth looking at. Before any tail call, the code checks whether the current PC falls within a known interpreter's main loop range by looking it up in interpreter_offsets (a hash map keyed by PC range). This is how the profiler knows to switch from native unwinding to, say, JVM frame walking — without any cooperation from the JVM process.
Near the tail call limit (29/32), the system detects it's running low and jumps to PROG_UNWIND_STOP with whatever frames it's collected so far. You get a truncated but valid stack trace rather than garbage. That's the right behavior.
Signal Frames and the vDSO Problem
Two corner cases that would break naive unwinding: signal frames and the vDSO.
Signal frames are detected either by checking the S augmentation character in the CIE (part of the DWARF metadata), or by pattern-matching the PC against the address of rt_sigreturn. When a signal frame is detected, the unwinder reads the rt_sigframe struct from the stack — at a fixed offset of 40 bytes on x86_64 — to recover 18 registers and resume unwinding from the interrupted context.
The vDSO is messier. The ARM64 vDSO doesn't have correct .eh_frame data. Rather than skipping it or hoping for frame pointers, synthdeltas.go literally disassembles the vDSO at profiler startup: it scans for STP/LDP instruction pairs, identifies frame setup/teardown patterns, and synthesizes fake stack delta entries from that analysis. It's the kind of thing you do when you've committed to actually working on ARM64 servers.
Container Support: Leveraging /proc
The container story is surprisingly clean. There's no "container mode" — the profiler runs in the host PID namespace, which means:
- eBPF returns host PIDs (not container-relative PIDs).
bpf_get_current_pid_tgid()always gives you the host view. /proc/<HOST_PID>/gives you everything about the container process without any namespace translation.
getMappingFile() in process/process.go:401 has a three-tier fallback for accessing the actual binary files:
/proc/<PID>/map_files/<addr>— works even for files that have been deleted after beingdlopen()'d (common in container image updates)/proc/<PID>/root/<path>— follows the process's mount namespace root, piercing through overlay filesystems/proc/<PID>/task/<TID>/root— fallback when the main thread has exited but other threads are still running
The cache key for parsed ELF files is (device, inode) — so two containers sharing a base image layer with identical shared libraries only cause one ELF parse. That's the elfInfoCache in processmanager/processinfo.go:186.
What you don't get: Kubernetes pod metadata. No labels, no namespace names. The profiler knows the container ID (extracted from /proc/<PID>/cgroup via regex matching against Docker/containerd/CRI-O formats) and exports it as container.id, but correlating that with K8s objects is left to the backend.
Symbol Resolution: The Deliberate Split
For native C/C++/Rust binaries, the profiler doesn't resolve function names. It can't — the agent doesn't have access to debuginfo, and downloading it at runtime would add latency and complexity. Instead, it ships the ELF virtual address plus the FileID and GNU Build ID to the backend. The backend can then look up the debug info (via debuginfod or its own symbol index) asynchronously.
The FileID scheme (libpf/fileid.go:114-170) is SHA-256 over the first 4KiB + last 4KiB + length of the file. Notably, this is entirely independent of any symbol or debug information — it's purely a content fingerprint of the ELF structure. The code has a rather emphatic comment about this:
ANY CHANGE IN BEHAVIOR CAN EASILY BREAK OUR INFRASTRUCTURE, POSSIBLY MAKING THE ENTIRETY OF THE DEBUG INDEX OR FRAME METADATA WORTHLESS
Go binaries are the exception. .gopclntab is in a PT_LOAD segment, so it survives stripping. Even a fully stripped Go binary retains full function name, source file, and line number information. The fallback search for gopclntab magic bytes across all read-only PT_LOAD segments (elfgopclntab.go:304-346) means it finds the table even in PIE binaries where the section table has been stripped.
For interpreted runtimes — JVM, CPython, V8, Ruby, PHP, Erlang, .NET — the profiler reads directly from process memory via process_vm_readv, following the interpreter's own data structures to reconstruct symbolic frames. This is the Loader/Data/Instance abstraction in interpreter/types.go:
type Loader func(ebpf EbpfHandler, info *LoaderInfo) (Data, error)
type Data interface { Attach(...) (Instance, error); Unload(...) }
type Instance interface { Symbolize(...) error; Detach(...) error }Detection happens by looking at the process's dynsym for well-known symbols. libjvm.so presence means HotSpot. v8dbg_* dynamic symbols mean V8. Each interpreter gets its own Instance per process, and they share Data structures (the parsed offsets for a given JVM version, for instance) across all processes running that same JVM binary.
The Coredump Testing Trick
This is my favorite engineering decision in the whole codebase.
eBPF programs are famously hard to test. You can't just run them in a unit test — you need a kernel, you need to trigger the right perf events, you need to introspect the map state. The team's solution: compile the same eBPF C code into a userspace CGO binary using a TESTING_COREDUMP preprocessor flag.
#if defined(TESTING_COREDUMP)
static inline long bpf_probe_read_user(void *buf, u32 sz, const void *ptr) {
long __bpf_probe_read_user(u64, void *, u32, const void *);
return __bpf_probe_read_user(__cgo_ctx->id, buf, sz, ptr);
}
#endifEvery BPF helper gets a Go implementation that operates on actual coredump data. The eBPF unwinding logic — the binary search, the frame recovery, everything — runs in a normal test process against real coredump files. If it produces the right stack trace from a coredump, it will produce the right stack trace at runtime.
This is the kind of testability infrastructure that separates serious projects from toys.
The Known Warts
No honest analysis skips the rough edges.
The new-process race: When a new process starts, the first perf event that hits it triggers a PID report and returns immediately — the trace is discarded. The userspace processes the PID event, loads the ELF, populates the maps, and then subsequent samples work. Short-lived processes that die before the second sample are invisible. This is acknowledged in tracemgmt.h with "technically this is not SMP safe" — the comment is understating it.
The unbuffered channel: traceCh in controller.go:171 is unbuffered. The goroutine that reads from the perf buffer (G4) is directly coupled to the goroutine that processes traces (G5). Under high off-CPU event rates, G5 becomes a bottleneck and backs up into G4. The code at events.go:240-241 acknowledges this explicitly.
The 16384 UnwindInfo limit: It's a constant, not a configuration parameter. On a system with a lot of unique binary variants — think heavily patched containerized services — you could hit it. There's no graceful degradation path documented for when you do.
The technical debt: I counted four FIXME comments in the .NET implementation alone (dotnet/method.go:202, dotnet/instance.go:479), plus a TEMPORARY HACK in interpreter_dispatcher.ebpf.c:280-295 for filtering malformed single-frame traces. These aren't dealbreakers but they're worth knowing about if you're running this in production.
The verifier tax: A significant fraction of the eBPF C code exists purely to satisfy the verifier's static analysis, not because it's logically necessary. Every loop is bounded by a constant. Every map lookup result is null-checked even when it can't be null. The binary search that would naturally be while (lo < hi) is instead for (int i = 0; i < 16; i++). This makes the code harder to read and reason about, which is an ironic cost for safety guarantees.
What You Should Steal
If you're building any kind of userspace-kernel data pipeline over eBPF, a few patterns here are worth copying directly:
Pre-compile the expensive stuff. DWARF interpretation is the extreme case, but the principle applies broadly: do the heavy computation at load time, store the results in eBPF maps in a form the kernel program can consume cheaply.
Use LPM tries for address range queries. One trie entry per naturally-aligned power-of-two block covers arbitrary ranges with minimal entries. The lpm/lpm.go splitting algorithm is clean and portable.
Bucket your inner maps. If you're using HASH_OF_MAPS and the inner map sizes vary significantly, 16 buckets spanning powers of two is much more memory-efficient than one size fits all.
Test with coredumps. The TESTING_COREDUMP pattern — compiling the same eBPF C code as CGO with stub helpers — is genuinely brilliant. It lets you test the actual kernel-side logic with real data without needing a kernel.
Use BPF_NOEXIST for event deduplication. The inhibit_events latch in tracemgmt.h:58-85 is a clean pattern for "send this event at most once" without any explicit locking.
The thing that keeps striking me about this codebase is how many decisions were clearly made by people who had already tried the naive approach and hit its limits. The bucketed maps, the coredump testing, the DWARF pre-compilation, the three-tier /proc fallback — none of these are the first thing you'd reach for. They're the things you arrive at after the first thing didn't work well enough. That's the signature of a codebase that's been through real production.
Leave a comment ✎