Malware Has a Color - Carter Yagemann

Feb 21, 2019
Security
#malware

In an upcoming paper I plan to present some preliminary work in applying machine learning to program control flows to detect anomalies. Specifically, my coauthors and I demonstrate how to use this to analyze document malware with promising accuracy. In previous posts, I've detailed the threat malicious documents pose to users and shared some insights into why this problem remains prevalent. For this post, I want to switch gears and share a fun technique I use to help understand what the control flow anomaly detector is really seeing. Put playfully, I'm going to demonstrate how to spot malware by the color it makes. Enjoy.

Without going into too many details about the system (I'll be sure to make a post about where to find the paper and code when it becomes available), at a high level we collect and use execution traces of a target program (e.g. Acrobat Reader) opening benign documents to create a path prediction model. We then use the model on an unlabeled trace with the intuition being that any occurring anomalies will cause a sudden drop in prediction accuracy. There's a little more to it than that, but this is the gist of how the system operates.

One challenge with such a system is the traces are massive. Even if we narrow our focus to only indirect control flow transfers (e.g. ret, icall, and ijmp), that's still thousands to millions of events per minute of real time execution. This makes diagnosing whether there was a bug or whether the malware simply didn't detonate a challenge.

One option is to manually analyze the malware sample, typically by running the virtual machine in another framework. This is time consuming though, which is why I've come up with a niftier solution that involves visualizing the trace in conjunction with the model's output. This is how I discovered that malware has a color.

Visualization Technique

First, I convert each target address into a color by hashing it. For simplicity, I take the first three bytes of a md5 hash to get a RGB tuple. The reason I use a secure hashing algorithm instead of a simple checksum is so nearby addresses will create very different colors, creating contrast. Here's an example of what a benign trace looks like:

Benign Trace

We can clearly see patterns, but there's no clear indicator of what makes one normal and another anomalous. Let's compare with a family of PDF malware that the system is good at detecting: pdfka. Here's one of the traces:

PDFKA Color Trace

Look at that streak of dark blue! What's going on there? Is this an exploit or simply a pattern our previous example didn't capture? To find out, I create a second image where each target is a white pixel if the model predicted it correctly and black if the prediction was wrong. Here's the result for the same trace:

PDFKA Model Output

Looks like there's a streak of incorrect predictions that lines up with the dark blue, but let's confirm it by subtracting the two images. This will cause areas of accurate prediction (white in the second image) to become black while areas of incorrect predictions will keep their color from the first image. Here's the result:

PDFKA Subtraction

Sure enough, that blue streak is an anomaly produced by pdfka. In fact, if we were to visualize all the pdfka samples in our paper's evaluation dataset, we would find they all contain a blue streak. What the blue is really visualizing is an exploit (CVE-2010-0188) being carried out against the TIFF parser library in AcroRd32.dll. Therefore, we can say the color of pdfka is dark blue!

Other Examples

To demonstrate the value of subtracting, consider this visualization of opening a Microsoft Word document:

MS Word Trace

You may be tempted to conclude this is a red malware (since coming up with this technique, I've been referring to malware by their colors instead of family names for fun), but it's actually benign. We can see this in the subtraction:

MS Word Subtraction

No colors means no anomalies. Now let's see a trace of hancitor:

Hancitor

As you can see, it's green malware. Meanwhile thus is blue-purple:

Thus

I think that's enough examples to make my point. I hope you've been convinced that malware has a color. Thanks for reading and happy hacking!