Intel Processor Trace, execvp, and ptrace


Lately, I've been playing around with Intel Processor Trace (PT); a x86 hardware feature that allows for complete tracing of process control flows. As part of my research, I've been developing my own Linux driver and user program to control PT.

Tracing can be configured using a handful of model specific registers (MSRs) in the Intel CPU. One useful configuration supported by PT is CR3 filtering. For those readers less familiar with x86 architecture, when a user process is executed, the CPU's CR3 register holds the physical address of the process's page table. Since every process has its own page table, each process will also have a CR3 value that is unique from every other currently scheduled process. By configuring PT to use a CR3 filter, tracing can be limited to a single process.

Early versions of my program could only trace already running processes. I would use the GNU debugger to start the target process and trap its first instruction and then I would manually feed its PID into my program as an argument. The Linux driver would then convert the PID into a CR3 by traversing the process's task structure (virt_to_phys(task_struct->mm_struct->pgd)) and use this address to configure PT (IA32_RTIT_CR3_MATCH). Needless to say, having to manually start and trap the target process got very tiring after repeated tracing.

To simplify tracing a process, I wanted my program to take as parameters the file path of an executable and its arguments and automatically start and trace the process. My first attempt roughly followed this pseudo code:

pid = fork();

if (pid == 0) { // Child process

    // Wait for parent to signal that PT is ready
    execvp(target_program, args);

} else {        // Parent process

    enable_cr3_filter(pid);
    enable_pt();
    // Signal child that PT is ready
}

Easy enough, right? I compiled the program, ran my first trace and got... nothing.

execvp and CR3

So what went wrong? It turns out we can demonstrate the problem with a simple test. Consider this simple C program:

// test_1.c
#include <stdio.h>
#include <unistd.h>

void pid_to_cr3() {

    int m_pid = getpid();
    char pid_str[20];
    snprintf(pid_str, 20, "%d", m_pid);

    FILE * chardev = fopen("/dev/pid_to_cr3", "w");
    fputs(pid_str, chardev);
    fclose(chardev);
}

void main() {

    char *argv[] = {"./test_2", NULL};

    int pid = fork();

    if (pid == 0) {      // Child process

        pid_to_cr3();    // printed to dmesg
        execvp(argv[0], argv);
    }
}

In this example, /dev/pid_to_cr3 is a simple Linux character device that processes can write a PID into and it will print the corresponding CR3 value into the kernel log:

// pid_to_cr3.c
static unsigned long pid_to_cr3(int pid) {

    struct task_struct *task;
    struct mm_struct *mm;
    void *cr3_virt;

    task = pid_task(find_vpid(pid), PIDTYPE_PID);

    if ((uintptr_t) task < 1)
        return 0;

    mm = task->mm;

    // mm can be NULL in cases such as kthreads, in which case we want the active_mm
    if (mm == NULL)
        mm = task->active_mm;

    if (mm == NULL)
        return 0;

    cr3_virt = (void *) mm->pgd;
    return virt_to_phys(cr3_virt);
}

After test_1.c passes its PID to /dev/pid_to_cr3, it then uses execvp to overwrite its memory with a new program: test_2.c. This program simply passes its PID to /dev/pid_to_cr3 as well:

#include <stdio.h>
#include <unistd.h>

void pid_to_cr3() {

    int m_pid = getpid();
    char pid_str[20];
    snprintf(pid_str, 20, "%d", m_pid);

    FILE * chardev = fopen("/dev/pid_to_cr3", "w");
    fputs(pid_str, chardev);
    fclose(chardev);
}

void main() {

    pid_to_cr3();    // printed to dmesg
}

If we compile these source files and execute test_1, we expect that the PID before and after executing execvp will be the same because execvp causes the kernel to overwrite the caller's own memory. But what happens to the CR3 value? As it turns out:

[ 1757.437572] PID 17319 = CR3 18759503872
[ 1757.438414] PID 17319 = CR3 18826612736

Rather than rewriting the existing caller's page table when execvp is called, the Linux kernel actually allocates and populates an entirely new page table! Since our original PT program was getting the CR3 before the execvp, our trace wasn't including the target program's execution.

ptrace

So how do we get the CR3 value after execvp is called by the child? We can't simply have the parent signal the child, like in the first attempt, because any code we give to the child process will be overwritten when execvp is called. The solution instead lies in an OS feature known as ptrace. Using ptrace, we can have the child process attach itself to the parent process for debugging. When execvp is completed, the OS will pause the child and signal the parent. The parent can catch this signal using waitpid(), do whatever it needs to do, and then resume the child. The code looks something like this:

pid = fork()

if (pid == 0) { // Child process

    ptrace(PTRACE_TRACEME, 0, NULL, NULL);
    execvp(args[0], args);

} else {        // Parent process

    waitpid(pid, NULL, 0); // Wait for child to complete execvp()

    enable_cr3_filter(pid);
    enable_pt();

    ptrace(PTRACE_DETACH, pid, 0, 0); // Resume child

}

Note that this code detaches the parent from the child once the child has been paused. This causes the child to resume normal execution. If we wanted to continue monitoring the child (for example, to detect fork() or clone()), we could do so.

Making the above modification allows the parent to capture the correct CR3 value and get a complete PT trace.