Back to Performance Engineering

Annotate Everything, Pay Nothing

A Deep Dive into Intel ITT

The first time I read the Intel ITT documentation, I assumed I was misunderstanding something. The claim was that you could compile instrumentation calls into production binaries — every hot path, every tight loop — and that when no profiling tool was attached, the overhead would be negligible. Not "low". Not "amortised over time". Negligible, by design, at every call site.

That's a bold claim. Most instrumentation stories go the other way: you add markers for debugging, they slow things down, so you ifdef them out for release and then wonder why your release builds behave differently from your profiling builds. ITT's answer to this is an architectural decision, not a performance trick, and it's worth understanding from the ground up.


The Two-Layer Design

ITT separates the instrumentation surface from the collection mechanism at the binary level.

Your application links against libittnotify — a static library that ships with the Intel oneAPI toolkits and is also available standalone via the ittapi repository. This library contains no actual collection logic. What it contains is a global function pointer table. Every ITT API call in your code — __itt_task_begin, __itt_domain_create, __itt_pause, all of them — expands to an indirect call through this table.

At process start, all entries in the table point to null stubs. A null stub is exactly what it sounds like: a function that immediately returns, doing nothing. The compiler sees an indirect call; the CPU executes a pointer dereference and a branch. That's the entirety of the overhead when no collector is present.

When a profiling tool like VTune attaches (or is already running when the process starts), it loads a collector dynamic library into the process. That collector calls __itt_api_init, which populates the function pointer table with real implementations. From that point forward, every ITT call routes to the collector. The application binary is identical in both cases.

This is the inversion I found surprising: the decision to pay instrumentation overhead belongs to the profiler operator, not the developer who wrote the code. You can ship an ITT-annotated binary to production and pay ~2–5 ns per call (one pointer dereference and an indirect branch on x86-64). When your on-call engineer wants to profile a live instance, they attach VTune without recompiling anything.

The implication for production deployability is real. No separate "profiling build". No ifdef forests. The annotation is always there; the collector is optional.


The API Surface

ITT has two distinct API families that serve different purposes and are often confused.

The ITT Notification API is for annotating known code. You own the source; you add markers. The JIT Profiling API is for dynamic code — methods compiled at runtime by a JIT compiler or interpreter. They share the zero-overhead-when-idle guarantee but have very different shapes.

Domains, Tasks, and String Handles

The core ITT model revolves around three types:

  • A Domain is a named namespace for your annotations. Think of it as a category that groups related tasks. VTune can filter by domain.
  • A String Handle is an interned string reference. Creating it is moderately expensive (heap allocation + internal registration); using it is free. You create handles once, store them globally, and pass them to task calls.
  • A Task marks a region of work on a single thread. You call begin before the work and end after.

In C:

// Initialise once — not in hot paths
static __itt_domain*        g_domain    = NULL;
static __itt_string_handle* g_draw_mesh = NULL;

void init_itt(void) {
    g_domain    = __itt_domain_create("com.myengine.render");
    g_draw_mesh = __itt_string_handle_create("RenderSystem::DrawMesh");
}

// In the hot path
void draw_mesh(Mesh* m) {
    __itt_task_begin(g_domain, __itt_null, __itt_null, g_draw_mesh);
    // ... actual work ...
    __itt_task_end(g_domain);
}

The __itt_null arguments are the task ID and parent task ID. Passing __itt_null for the parent tells the collector to infer parentage from the call stack. You can assign explicit IDs if you need to express logical parent-child relationships that don't follow the physical call stack — useful for work-stealing thread pools where tasks migrate between threads.

The Rust ittapi crate wraps this cleanly:

use ittapi::{Domain, StringHandle, Task};

// Once at startup
let domain    = Domain::new("com.myengine.render");
let draw_mesh = StringHandle::new("RenderSystem::DrawMesh");

// In the hot path — RAII guard, ends task on drop
fn draw_mesh_fn(mesh: &Mesh) {
    let _task = Task::begin(&domain, &draw_mesh);
    // ... actual work ...
    // task ends here automatically
}

The RAII design is not cosmetic. The C API has no protection against a task_begin without a matching task_end — a panic or early return in the middle of a region will leave the task open, producing malformed timelines in the profiler. The Rust crate makes this impossible: the Task struct calls __itt_task_end in its Drop implementation. I've caught real bugs this way during development, where an error path bypassed cleanup code.

Frames, Counters, and Markers

Beyond tasks, ITT provides three other useful primitives:

Frames annotate application-level cycles — game loops, render frames, request/response cycles. Where tasks measure individual operations, frames measure the wall-clock budget of a full iteration. VTune has a dedicated frame duration view that plots frame time distributions:

__itt_frame_begin_v3(g_domain, NULL);
// ... process one frame ...
__itt_frame_end_v3(g_domain, NULL);

Counters track scalar metrics over time — queue depths, cache hit rates, object pool utilisation. They appear in VTune's custom metrics timeline:

static __itt_counter g_queue_depth = NULL;
g_queue_depth = __itt_counter_create("WorkQueue::depth", "com.myengine.threading");

// Later:
__itt_counter_set_value(g_queue_depth, &depth);  // unsigned 64-bit

Markers are instantaneous events — a one-time annotation with no duration:

__itt_marker(g_domain, __itt_null, my_handle, __itt_marker_scope_process);

Useful for noting when a configuration reload fired, when a GC cycle started, or when a network partition was detected.

Pause and Resume

This one earns its place in every VTune workflow I use:

__itt_pause();   // Stop collecting
// ... warm-up code, initialisation, parts you don't care about ...
__itt_resume();  // Start collecting

__itt_pause and __itt_resume signal the collector to suppress sample collection. The CPU keeps running; nothing about execution changes. What changes is whether VTune records the samples. When you're profiling a server that takes 10 seconds to initialise before entering steady-state operation, this eliminates 10 seconds of initialisation noise from your profile. The resulting report is clean: every sample is from the code path you actually care about.


The JIT Profiling API

If you're writing a JIT compiler, an interpreter, or anything that generates machine code at runtime, the JIT Profiling API is the mechanism by which profilers can attach symbol names to your dynamically generated code.

Without it, profiler samples that land in JIT-compiled regions appear as unlabelled addresses. With it, they appear as named methods with source-level line attribution.

The core call:

#include <jitprofiling.h>

// Before registering a JIT-compiled method, check if anyone is listening
if (iJIT_IsProfilingActive() != iJIT_NOTHING_RUNNING) {
    iJIT_Method_Load method = {0};
    method.method_id          = assign_unique_id();   // caller owns this
    method.method_name        = "MyJIT::fibonacci";
    method.method_load_address = code_ptr;
    method.method_size         = code_size;
    method.source_file_name    = "fibonacci.myl";
    method.line_number_size    = line_table_count;
    method.line_number_table   = line_table_ptr;

    iJIT_NotifyEvent(iJVM_EVENT_TYPE_METHOD_LOAD_FINISHED, &method);
}

The iJIT_IsProfilingActive() gate is important. Building a iJIT_Method_Load struct requires allocating and filling name strings, constructing a line number table, and calling into the notification infrastructure. When nothing is profiling, that's wasted work. The fast path when iJIT_NOTHING_RUNNING is returned: skip the entire block.

The method_id uniqueness requirement is the most common source of bugs in JIT integrations. The API provides no ID allocator — you own it. In a compiler with concurrent compilation threads, ID assignment requires synchronisation. An atomic counter works fine in practice:

static _Atomic unsigned int g_next_method_id = 1;

unsigned int assign_unique_id(void) {
    return atomic_fetch_add_explicit(
        &g_next_method_id, 1, memory_order_relaxed
    );
}

When a tiered JIT re-compiles a method (e.g., after collecting type feedback), you must send iJVM_EVENT_TYPE_METHOD_UNLOAD_START with the old ID before registering the new version. Re-using an ID without an unload event produces undefined collector behaviour — in practice, VTune will merge samples from different compilations of the same method, making it impossible to compare them.

Implementing the line number table is non-trivial but worth it. Without it, profilers can attribute samples to a method name but not to a source line. For any compiler targeting a non-trivial language, line-level attribution is the difference between "the JIT'd code is hot" and "line 47 in fibonacci.myl accounts for 83% of samples".


What "Zero Overhead" Actually Means

The documentation says negligible overhead when no collector is attached. Let me be more precise about what that means at the instruction level.

Each ITT call compiles to something like:

mov  rax, QWORD PTR [rip + __itt_task_begin_ptr]
test rax, rax
je   skip
call rax
skip:

That's a load from the function pointer table, a null check, and a conditional indirect call. On a modern out-of-order CPU with a warm L1 cache, this is roughly 3–5 ns — maybe 2–3 cycles for the load, plus branch prediction overhead if the branch is poorly predicted (it won't be, since it's always-not-taken when no collector is attached).

For most code — functions that do microseconds or more of work — this is genuinely negligible. For code that's called in loops running at sub-100 ns per iteration, 5 ns per ITT call starts to add up. I've seen ITT overhead become measurable in loops running at 20 MHz+ call rates. The right response is not to abandon ITT but to hoist the task boundary to a coarser granularity: annotate the batch, not each element.

When a collector is attached, overhead depends entirely on what the collector does. VTune's production collector is reasonably optimised, but it does take a lock, record a timestamp, and write to a ring buffer per task event. For a well-instrumented application with moderate task granularity (functions taking microseconds to milliseconds), this is invisible. For extremely fine-grained annotation of tight loops, you'll see collector overhead — which is a signal to coarsen your instrumentation, not to remove it.


Comparison with Alternatives

No instrumentation API exists in isolation. Here's where ITT sits relative to the other options:

Linux perf software events (`PERF_COUNT_SW_DUMMY`)

perf_event_open with PERF_COUNT_SW_DUMMY gives you userspace breakpoints that perf can count or sample around. The interface is lower-level than ITT: you call ioctl(fd, PERF_EVENT_IOC_ENABLE) to start and stop collection regions. It integrates with the kernel's perf subsystem, meaning you get hardware PMU correlation for free.

The limitation: no semantic naming. You annotate regions by file descriptor, not by human-readable domain/task names. Tooling support outside of perf record/perf report is thin. For understanding what a region does, you're on your own.

ITT wins on semantics and ecosystem integration. Perf wins on portability (any Linux kernel ≥ 3.x, no Intel-specific headers required) and kernel-level correlation.

LTTng Userspace Tracing (UST)

LTTng UST is a mature, high-throughput tracing framework for Linux. It uses shared memory ring buffers between the traced application and the collection daemon, which gives it lower per-event overhead than ITT at high event rates — sub-nanosecond in some configurations.

LTTng supports CTF (Common Trace Format), meaning traces can be consumed by tools like Babeltrace, TraceCompass, and Perfetto. The ecosystem is broader than VTune.

The cost: LTTng requires a running daemon (lttng-sessiond), session configuration before tracing starts, and a non-trivial setup for per-probe type declarations. ITT's "link the library and call the API" story is much simpler to adopt incrementally. For production server tracing at high event rates, I'd consider LTTng. For developer-workflow profiling where VTune is already in the picture, ITT is less friction.

ETW (Event Tracing for Windows)

ETW is the Windows analogue and is, in many ways, more capable than ITT for system-level tracing — it integrates with kernel events, supports hardware PMU data, and WPA (Windows Performance Analyzer) is an excellent analysis tool. If your application is Windows-first, ETW is the native choice.

ITT is cross-platform (Linux, Windows, macOS); ETW is Windows-only. On Linux, ETW is not an option. On Windows, VTune can consume both ETW and ITT events simultaneously, which is genuinely useful.

Tracy and Superluminal

Tracy is a sampling + instrumentation profiler with its own marker API. You instrument with ZoneScoped macros (RAII, like the Rust ittapi crate's Task), and Tracy records high-resolution timelines with memory profiling, mutex contention tracking, and GPU zone support.

The important distinction: Tracy is a complete profiling system, not just an instrumentation API. It ships its own collection server, its own GUI, and its own wire protocol. This makes it more self-contained and more portable than ITT. It does not require VTune.

The trade-off is that Tracy's collector is always-on in profiling builds — it continuously records to a memory-mapped ring buffer and streams to the capture server. This is great for realtime visualisation but means you can't ship Tracy instrumentation in production binaries with zero overhead. ITT's null-stub design is fundamentally better for production code that you want instrumented at all times.

Superluminal has a similar story: excellent Windows profiler, its own marker API, but tied to its own tool ecosystem.

My current practice: ITT annotations for long-running production services where the zero-overhead property matters. Tracy for game-engine and GPU workloads during development, where the richer visualisation and realtime streaming are more useful than production deployability.


Practical Patterns

The handle pre-creation rule

This is the single most important operational detail and the most commonly violated. __itt_domain_create and __itt_string_handle_create are not free. They allocate memory, register the handle internally (with a lock), and may trigger lazy DLL loading on first call. Calling them in a hot path is a performance bug.

The right pattern:

// Module-level or thread-local statics, initialised at startup
static __itt_domain*        s_domain   = NULL;
static __itt_string_handle* s_compress = NULL;

// Call this exactly once, during application initialisation
void audio_system_init(void) {
    s_domain   = __itt_domain_create("com.myengine.audio");
    s_compress = __itt_string_handle_create("AudioSystem::CompressBlock");
}

// In the hot path — handles already exist, just use them
void compress_audio_block(const float* samples, size_t count) {
    __itt_task_begin(s_domain, __itt_null, __itt_null, s_compress);
    // ...
    __itt_task_end(s_domain);
}

In Rust, the once_cell or std::sync::OnceLock pattern works well:

use std::sync::OnceLock;
use ittapi::{Domain, StringHandle};

static DOMAIN: OnceLock<Domain>       = OnceLock::new();
static COMPRESS: OnceLock<StringHandle> = OnceLock::new();

fn init() {
    DOMAIN.get_or_init(|| Domain::new("com.myengine.audio"));
    COMPRESS.get_or_init(|| StringHandle::new("AudioSystem::CompressBlock"));
}

Naming tasks with enough context

__itt_string_handle_create("loop") is useless. VTune groups timeline entries by string handle. If you have a single "loop" handle used everywhere, your timeline collapses all loops into one indistinguishable mass.

Good names encode the subsystem and the specific operation:

"RenderSystem::DrawMesh"
"PhysicsEngine::BroadPhase::SAP"
"NetworkLayer::DeserialisePacket[TCP]"
"AudioMixer::Resample[48kHz→44.1kHz]"

When I'm annotating a codebase for the first time, I use the format SubSystem::Method[distinguishing_parameter]. The square-bracket qualifier is informal — ITT treats it as an opaque string — but it makes profiles much easier to read when the same method is called with meaningfully different arguments.

Using pause/resume to focus on steady-state

A server application might spend 15 seconds on startup before entering its event loop. A game engine spends significant time on asset loading. You almost never care about these phases when profiling steady-state performance.

int main(int argc, char** argv) {
    __itt_pause();             // Suppress collection during init

    load_config(argv[1]);
    init_thread_pool(8);
    warmup_jit_cache();

    __itt_resume();            // Now start collecting

    run_event_loop();          // This is what we care about
    return 0;
}

The resulting VTune report starts clean, with no samples from the init path inflating hotspot counts.

The Reference Collector for headless environments

The ITT reference collector is a standalone tool that records ITT events to a trace file without requiring VTune. This makes it useful for CI performance regression testing: instrument your application, run it under the reference collector, capture the trace, parse task durations, and assert that no critical path has regressed beyond a threshold.

This is not a complete analysis workflow — you won't get flame graphs or source attribution from it. But for automated regression detection on a build server where installing VTune is impractical, it fills a real gap.

Annotating JIT compilers — the complete pattern

For a JIT compiler or interpreter, the full integration pattern is:

  1. On process start, check iJIT_IsProfilingActive(). If nothing is profiling, skip all notification infrastructure.
  2. After each successful compilation, call iJIT_NotifyEvent(iJVM_EVENT_TYPE_METHOD_LOAD_FINISHED, ...) with a fully populated iJIT_Method_Load including the line number table.
  3. On re-compilation (tiered JIT, deoptimisation + recompile), send iJVM_EVENT_TYPE_METHOD_UNLOAD_START with the old ID, then send a new METHOD_LOAD_FINISHED event with a new ID.
  4. On process shutdown or module unload, send iJVM_EVENT_TYPE_METHOD_UNLOAD_START for all registered methods if cleanup is required.

Step 3 is where most JIT integrations have bugs. Tiered compilation is common, and the unload/reload dance is easy to forget until you see VTune reporting impossible hotspot distributions in re-JIT'd methods.


When Not to Use ITT

ITT is an emission-only instrumentation API. It has no in-process query capability — you cannot ask "how long did the last DrawMesh task take?" from within your application. All analysis lives in the collector. If you need in-process metrics (for adaptive algorithms, circuit breakers, or self-monitoring logic), ITT is the wrong tool. Reach for std::chrono, RDTSC wrappers, or a metrics library.

ITT requires you to identify the code you want to annotate. When you don't know where the hotspot is, sampling profilers — perf record, VTune's sampling mode, async-profiler — are more appropriate. Annotate after you've identified the interesting regions, not before.

For kernel-level tracing — understanding syscall latency, scheduler interference, interrupt storms, block I/O patterns — ITT doesn't help. You're in userspace; the kernel is opaque to ITT. eBPF, perf events, and ftrace are the right tools for kernel-side analysis. I frequently use ITT and perf sched together: ITT to understand what the application thinks it's doing, perf sched to understand what the kernel did to the threads during that time.

The static library adds approximately 100–200 KB to your binary. For most server applications this is irrelevant. For embedded targets with tight flash budgets, it may not be.

Finally: ITT is Intel tooling. The JIT Profiling API has broader cross-vendor adoption (perf and async-profiler both support it), but the domain/task/frame/counter surface is primarily consumed by VTune, Advisor, and Inspector. If your team uses neither, the semantic annotations won't produce output anywhere. On those stacks, Tracy, LTTng, or ETW (Windows) are more appropriate choices.


The architectural insight that makes ITT worth understanding is the function pointer indirection: defer the cost decision to the operator, not the developer. Every other instrumentation design I've encountered either strips annotations for release or pays overhead always. ITT's answer — link it everywhere, pay only when something is listening — is the right trade-off for production code where you want profiling to be an operational decision, not a build-time one.

The Rust ittapi crate makes this accessible without the rough edges of the C API. The RAII task guard is not just ergonomics; it's correctness. In a codebase where panics, early returns, and error propagation can interrupt any region, a C-style begin/end pair is a reliability hazard. The Drop implementation removes that category of bug entirely.

Annotate early, name precisely, pre-create handles, and gate JIT notifications behind iJIT_IsProfilingActive. The profile you can take in production is worth more than the perfect profile you can only take in a lab.


Adrian is a performance engineer writing at overtone.dev.

Leave a comment ✎