Intel PT Data at Rest: A Compression Experiment


Full Disclosure: I am a researcher in Georgia Tech's ISTC-ARSA, which is funded by Intel. Although I reference two publications that share Xinyang Ge and Weidong Cui as authors, I am neither associated with them nor Microsoft Research at the time of writing.

Intel Processor Trace (PT) is a powerful hardware feature for recording the behavior of CPUs. With it, developers and researchers can monitor the control-flow path taken by threads, hardware interrupts, and more, all with cycle-accurate timing. However, this rich stream of data comes at the cost of size. Depending on what PT is configured to trace, it can output hundreds of megabytes of data per second per core. PT does take steps to save bandwidth by only recording changes in control-flow, excluding redundant high-order bits in target addresses, and compressing returns leading to predictable locations. However, despite this compression, the volume of data is still massive.

As a consequence, much of the work published so far handles tracing in one of two ways. One option is to consume the trace as it is generated. This works as long as the consumer can keep up with the producer, which is the case in the control-flow integrity (CFI) system Griffin. The other common approach is to configure PT to write in a circular buffer. This option is suitable for crash dump analysis systems like Snorlax, which only need a fixed size window into a thread's past.

However, while some applications are feasible using the two previous methods, there are still situations were it is desirable to store the entire trace for postmortem analysis. If nothing else, it is useful for repeatable experiments. With this in mind, I performed a naive experiment last night to explore if more can be done to compress PT traces when the data is at rest. Based on the observations that the compression PT applies is highly localized (i.e. a target address verses the previously recorded target address and a return verses the previously recorded call) and that programs often execute repetitive loops, I hypothesized that even a general purpose compression algorithm should be able to compress traces with a good ratio.

Procedure

The overall idea for the experiment is very simple: gather some PT traces, compress them with a commonly used algorithm, and compare the sizes. For a subject I used the simple HTTP server that comes with Python 2.7 to host a copy of this blog. For each trial I had a crawler request pages from the server for a set duration. Once the time expired, I terminated the server and crawler and stopped the tracing. I then compressed the trace using the GNU/Linux utility gzip, which uses Lempel-Ziv coding. I also fed it through a disassembler that matches the PT packets to the binary's static code to produce a linear sequence of instructions. From this I counted the number of unique basic blocks executed during the trace to serve as a rough proxy for code coverage. To summarize the procedure:

  1. Configure and enable PT tracing.
  2. Start the Python HTTP server.
  3. Start the crawler.
  4. Wait for a specified duration.
  5. Terminate the crawler and server.
  6. Stop PT tracing.
  7. Compress the resulting trace and count the number of unique basic blocks executed.

Results

Figure 1

Comparing the original size of the PT trace to the size after compression produces the above graph. Both plots best match linear regressions and are increasing over time. However, the size of the compressed traces increases at a slower rate than the uncompressed traces, meaning these two plots are diverging as time increases.

Another observation to note is the large volume of trace data produced during the server's startup. This explains why even the shortest trial produced a 1GB trace. For the same reason, counting the number of unique basic blocks turned out to not be useful. The number of new basic blocks executed while serving requests was small.

Figure 2

The next graph shows the relationship between the compressed and uncompressed sizes as a space savings percentage. The plot best fits a linear regression and shows the savings decreasing over time. This is likely due to the design of the underlying compression algorithm, which is intended for general use and does not take into consideration the unique characteristics of PT traces.

To summarize, this experiment shows that more can be done to compress PT traces for storage at rest.

Discussion

It is understandable that the compression used by PT would produce small space savings compared to general compression algorithms given the limitations of hardware memory and Intel's very strict performance overhead requirements. In practice, PT produces an overhead of less than 4% in the worst case, and less than 2% on average. These numbers are based on my own observations and the results published by other researchers. In short, PT has very few clock cycles and very little space available for performing compression.

Another factor that deserves consideration is compression's impact on processing time. For systems that consume PT traces on the fly, the largest source of performance overhead is not PT tracing itself but rather the time spent buffering and consuming it. In CFI, for example, the PT trace has to be matched with the executed code in order to reconstruct control-flow. This is why the authors of Griffin report a 11.9% overhead on the SPECint benchmark despite the 4% overhead of PT itself. Adding better space saving compression could increase this overhead further.

That said, for storing PT traces at rest, more can be done to better conserve space.