Carter Yagemann

I'm a computer scientist and cybersecurity researcher. My interests include hacking, system design, and software engineering.

Barnum


Abstract

Barnum is an offline control flow attack detection system that applies deep learning on hardware execution traces to model a program's behavior and detect control flow anomalies. Our implementation analyzes document readers to detect exploits and ABI abuse. Recent work has proposed using deep learning based control flow classification to build more robust and scalable detection systems. These proposals, however, were not evaluated against different kinds of control flow attacks, programs, and adversarial perturbations.

We investigate anomaly detection approaches to improve the security coverage and scalability of control flow attack detection. Barnum is an end-to-end system consisting of three major components: 1) trace collection, 2) behavior modeling, and 3) anomaly detection via binary classification. It utilizes Intel® Processor Trace for low overhead execution tracing and applies deep learning on the basic block sequences reconstructed from the trace to train a normal program behavior model. Based on the path prediction accuracy of the model, Barnum then determines a decision boundary to classify benign vs. malicious executions.

We evaluate against 8 families of attacks to Adobe Acrobat Reader and 9 to Microsoft Word on Windows 7. Both readers are complex programs with over 50 dynamically linked libraries, just-in-time compiled code and frequent network I/O. Barnum shows its effectiveness with 0% false positive and 2.4% false negative on a dataset of 1,250 benign and 1,639 malicious PDFs. Barnum is robust against evasion techniques as it successfully detects 500 adversarially perturbed PDFs.

Source Code & Documentation

Barnum Tracer - Collects PT traces from a KVM hypervisor.

Barnum Learner - Processes the traces and builds models for classification.

Data & Models

Barnum can analyze the control flow of several kinds of programs for anomalies. The following links are to data and models for classifying PDF and Microsoft Word documents. These samples were traced on a Windows 7 virtual machine. The traced applications were Adobe Acrobat Reader 9.3 and Microsoft Word 2010.

PDF

Traces - Everything needed to reproduce the PDF malware evaluation from the paper. (Size: 76 GB)

Pre-trained Model - This model is smaller than what was used in the paper, accuracy will be lower. (Size: 2.4 MB)

DOCX

Traces - Everything needed to reproduce the DOCX malware evaluation from the paper. (Size: 49 GB)

Pre-trained Model - This model is smaller than what was used in the paper, accuracy will be lower. (Size: 2.4 MB)

Malware Hashes

Unfortunately, we cannot release the original document malware used in our evaluation. Below are links to lists of hashes to help other researchers construct similar datasets. The paper also includes a table showing the families these samples represent.

PDF Malware Hashes

DOCX Malware Hashes