A Practical Beginner's Guide to Intel Processor Trace


Greetings, if you're reading this tutorial, chances are you have some interest in Intel Processor Trace (PT) and how it can improve your debugging or program analysis capabilities, but aren't sure how to get started. This is understandable, given that there aren't many practical tutorials on Intel PT and what documentation is publicly available is scattered across Intel specifications, Linux documentation, and nuggets of text in various smaller projects. The existing text targets a vast range of audiences, ranging from low-level driver developers to experienced software engineers. Most, if not all, of that text is not newcomer friendly and it's easy to spend hours sifting through it only to find your basic questions unanswered.

With this in mind, I've written this tutorial to fulfill two goals. The first is to get newcomers up and running with Intel PT in as few steps as possible. This is a hands-on tutorial for people who want to quickly get hacking. In the process, I hope to also reveal to you what makes Intel PT so powerful and why it's my go to technology for any project involving dynamic program analysis. My second goal is to then point you to the various sources of more technical information with prefaces of what you'll find in each destination, so you can focus your exploration based on your ultimate needs and higher goals.

So with all that clarified, let's dive in.

Background: What is Intel PT?

If you stumbled upon this tutorial by chance and have no clue what Intel PT is, let's start with some background information and a bit of history. Readers who already know about Intel PT or don't care for a history lesson should skip to the next section.

Intel PT is one of a long line of hardware features developed and released by Intel to help developers gain better insights into how their software runs on Intel processors for debugging and performance profiling. If you're using an Intel processor built within the past decade, chances are it includes Intel PT.

Intel PT is a spiritual successor to technologies like hardware performance counters. Performance counters are an excellent solution if you want to know statistical information about the performance behaviors of a particular piece of code, like how often it encounters cache misses, but performance counters can't tell you much at a per-instruction granularity.

Intel PT aims to address this shortcoming by providing execution traces as opposed to just a set of counter registers. Each trace is a stream of packets representing different events, such as whether a branch was taken, how many processor cycles have elapsed, and so forth. These packets are flushed directly into physical memory, asynchronously, bypassing all caches, to enable recording of traces with minimal impact to the target program.

The original intended purpose of Intel PT was to enable advanced debugging and profiling of performance sensitive code. Intel PT yields precise information about executed instruction sequences, timing, and even energy consumption, to accurately record software behaviors without the overhead or artifacts that instrumentation code would introduce, such as cache interference. However, while Intel PT was originally intended for debugging and profiling, it is also an extremely powerful tool for guiding program analysis, which is why I use it extensively in my security research projects. Unlike any other technique currently available, Intel PT enables me to transparently observe feasible paths and timing information for real-world executions with only about a 2% performance impact to the target program. It can also successful yield traces in many tricky programs that break instrumentation tools like Intel PIN or DynamoRIO. This makes it my go to technology for collecting data to fuel my security solutions.

Getting Started: Linux Perf

The fastest way to start using Intel PT is to setup a Linux computer with an Intel CPU and install Perf. Note that at the time of writing, Intel PT does not play nicely with virtual machines or containers, so you'll need a computer that runs Linux as its native operating system. I'm going to walk through the steps using a Debian system, but the same steps should work for Ubuntu or (with a bit of tweaking) any other Linux system.

Step 1: Install Perf. Most Linux distributions include Perf as a package that can be installed via the default package manager. For Debian, you can install it by opening a terminal and running:

sudo apt install linux-perf

Be aware that since this package installs an additional kernel driver with some beefy startup behaviors, you may need to reboot your system at this point before proceeding.

Step 2: Verify Intel PT is available. Once Perf is installed, we should verify that it can access Intel PT. We can check this with the perf list command:

$ perf list | grep intel_pt
  intel_pt//                                         [Kernel PMU event]

If the above command prints no lines, then it means either your CPU doesn't have Intel PT or it provides a very old version that lacks all the features required to work with Perf.

Step 3: Adjust paranoid mode. By default, Perf only allows root to use Intel PT because it reveals a lot of information about traced programs. If you want to allow any user to access Intel PT, you can temporarily disable this restriction by running the command:

echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid

If you want to permanently disable it, you can add the following line to /etc/sysctl.conf:

kernel.perf_event_paranoid=-1

If you decide to leave this setting at its default level, be aware that you'll need to run the subsequent commands in this tutorial as root (or use sudo).

Step 4: Recording traces. We can record traces using the perf record command. For this tutorial, I'll use /bin/ls as my target program. Let's start with the most basic tracing:

$ perf record -e intel_pt//u -- /bin/ls
perf.data
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.043 MB perf.data ]

In this example, perf record is how I tell Perf I want to record something, -e intel_pt specifies that I want to record an Intel PT trace, //u specifies that I want to record only the user space portion of the program's execution (Intel PT can also record kernel execution) using the default Intel PT settings (more on this later) and then everything after the -- is the command that'll be executed and traced.

You'll notice that Perf prints some extra messages after the program finishes executing with useful information like how big the final trace is. Be aware that it is possible for a program to generate too much data for Intel PT and Perf to flush to storage in time, in which case the trace will contain holes. Perf's messages will warn you if this occurs.

If everything ran successfully, you should now have a perf.data file in your current working directory. This file contains the Intel PT trace along with some sideband data recorded by Perf's driver. This sideband records things like where objects were mapped into memory and when context switches between tasks occurred, which is needed in order to recover the executed instruction sequence and untangle threads if your target program runs on multiple CPU cores simultaneously.

Step 5: Decoding and recovery. In Intel PT terminology, we first have to decode the perf.data file to extract the trace of Intel PT events, and then combine that with the recorded sideband to recover the instruction execution sequence. Fortunately, Perf comes with scripts that can do this all for us in one step.

The simplest (and in my opinion, most useful) command is the following:

perf script --insn-trace

Note that by default, perf script tries to read ./perf.data. If your Perf output file is in another location or has a different name, you can specify its path using the -i flag.

This script recovers and prints the exact sequence of instructions our target program executed. Specifically, each line has the following format:

    ls         832828     [027]      8105202.171292271:     7fa2165d4c22    do_system+0x372 (/usr/lib/x86_64-linux-gnu/libc-2.31.so)    insn: 31 c0
^Executable^ ^Task ID^ ^CPU Core #^ ^Estimated Timestamp^ ^Virtual Address^ ^Symbol Offset^              ^Object^                    ^Instruction Bytes^

With this information, we now know the exact order instructions were executed, where those instructions resided in memory, how to map those instruction back to the original program and library files, and more. We can then use this information to, for example, build control flow graphs and other representations for program analysis.

If this is too granular for your desired use case, there's also other scripts that can process the trace into more coarse representations. For example, --call-trace will print only the function calls (along with their call depth), which you can use to extract a call graph for the execution. You can access the full manual for perf script by running man perf-script.

Additional Noteworthy Tricks

Disassembling instructions. You may have noticed that our instruction trace prints the raw bytes of the instructions instead of printing them in nice human-readable assembly. This is because in order to disassemble them, Perf needs access to a disassembler. Specifically, Perf integrates with Xed, which is Intel's official disassembler.

Unfortunately, the default APT repositories for Debian don't include a Xed package, so we'll need to compile and install Xed manually. Here's one way I've found to do this:

git clone https://github.com/intelxed/xed.git xed
git clone https://github.com/intelxed/mbuild.git mbuild
cd xed
./mfile.py examples
sudo cp ./obj/wkit/bin/xed /usr/local/bin/xed

Now that we've installed Xen, we can use the --xed flag with perf script:

perf script --insn-trace --xed

And now the instruction bytes are replaced with human-readable assembly.

Adjusting Timestamp Accuracy. As I pointed out in Step 5, part of the output from perf script --insn-trace includes an estimated timestamp of when each instruction executed. I emphasize the word estimated because the accuracy of this timestamp depends on how frequently we configure Intel PT to record timing packets.

To keep this tutorial concise, I won't go into the details of what the different timing packets are and their trade-offs, but I'll show you how to adjust them as an example of how to change Intel PT settings in perf record. We can specify any non-default Intel PT settings like so:

$ perf record -e intel_pt/cyc=1,cyc_thresh=0/u -- /bin/ls
perf.data
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.074 MB perf.data ]

In this case, what I've added is cyc=1, which enables the generation of CYC timing packets by Intel PT and cyc_thresh=0, which tells Intel PT to output these packets as frequently as possible. If you look very closely, you'll notice the trace size has almost doubled as a result, so be mindful of this when deciding how accurate you need the timestamps to be.

If we now rerun perf script --insn-trace, we'll see that the estimated timestamp updates more frequently, however it still isn't updated after every instruction. This is due to a limitation of Intel PT, but the specifics are outside the scope of this tutorial.

One final useful bit of information I'll point out regarding timing is if we add -F+ipc to our perf script command, it'll periodically print out fields like IPC: 4.14 (29/7). IPC stands for instructions per cycle, and in this example, it's saying that the previous 29 instructions executed in 7 clock cycles. This gives us a bit more insight into how the timestamps are changing.

Further Reading

This concludes the basics of how to use Intel PT via Perf. From here, there's several documents you can read to gain more technical information:

If you want to know more about the Intel PT specification, including how to program Intel PT with a driver, what all the packets do, and how to perform instruction recovery, check out the Intel Architectures Software Developer's Manuals (ASDM). The chapters get shifted around as the manual is updated, but at the time of writing the chapter on Intel PT is located in Volume 3, Chapter 33.

If for some reason you do not want to rely on Perf to decode traces or you want to integrate decoding and recovery directly into your own program, libipt is a good starting point that offers standalone tools and a library for processing Intel PT data.

If you would like to learn more about Perf's Intel PT features, including all the other available command line arguments and other use cases like tracing kernel executions or KVM, see Linux's documentation.