Microsoft And Intel Team Up To Visualize Malware With AI For A Better Detection System

There are a lot of strains of malware, coming from different families. Identifying and recognizing them can be difficult.

This is why researchers from the Microsoft Threat Protection Intelligence Team have teamed up with researchers from Intel Labs to create STAtic Malware-as-Image Network Analysis (STAMINA), a project that utilizes a novel approach to detect and classify malware by visualizing them.

The technique is by converting malware samples into grayscale images so a deep learning system can study them.

During the first part of their collaboration, the researchers use Intel's previous work on deep transfer learning for static malware classification, on a real-world dataset from Microsoft, to better understand the practical value of approaching malware classification as a computer vision task.

The dataset has been compiled from from Windows Defenders installations.

According to Microsoft on its blog post, the goal is to:

Leverage deep learning with high accuracy and low false positives in order to avoid time-consuming manual feature engineering.
Optimize deep learning techniques in terms of model size and leveraging platform hardware capabilities to optimize execution of deep-learning malware detection approaches.

Microsoft and Intel researchers used 60% of the known malware samples to train the original DNN algorithm, 20% of the files were used to validate the DNN and the remaining 20% were used for the actual testing process.

The basis for the project is to observe if malware binaries can be turned into grayscale images. If possible, it should then create the needed textural and structural patterns to effectively classify the binaries.

And finally, it will then put them into their respective threat families.

The steps to software root Mediatek-powered Android devices. (Credit: XDA-Developers / diplomatic)

Static analysis is associated, simply because it is still the important building block for AI-driven detection of malware.

It's this static analysis that produces the metadata about a file.

Using this metadata, machine learning classifiers on the client and in the cloud can analyze the files to know whether or not they are malicious. It's through this static analysis that most threats can be caught before they can even run.

But for more complex threats, the method should go a bit beyond that.

Here, it uses dynamic analysis and behavior analysis that are built on static analysis, in order to provide more features and build more comprehensive detection. The method improvises at scale to find the best ways to perform static analysis.

To make this happen, the researchers use their knowledge on computer vision to build an enhanced static malware detection framework that leverages deep transfer learning to train directly on portable executable (PE) binaries represented as images.

The first step, is image conversion pre-processing.

This is to prepare the binaries by converting them into two-dimensional images, using pixel conversion, reshaping, and resizing. The binaries were then converted into a one-dimensional pixel stream by assigning each byte a value between 0 and 255, corresponding to pixel intensity.

Each pixel stream was then transformed into a two-dimensional image using the file size to determine the its width and height.

The second step, called the transfer learning, is a technique for overcoming the isolated learning paradigm and utilizing knowledge acquired for one task to solve related ones. This technique is meant to reduce the training time by bypassing the needs to search for optimized parameters and architectures.

Then finally, the third step, is the evaluation.

STAMINA's transfer learning. (Credit: Microsoft/IBM)

"The joint research showed that applying STAMINA to real-world hold-out test data set achieved a recall of 87.05% at 0.1% false positive rate, and 99.66% recall and 99.07% accuracy at 2.58% false positive rate overall," Microsoft said, suggesting that so far, STAMINA has proven mostly effective.

But it has some drawbacks though.

For example, STAMINA can go in-depth into samples and extract additional signals that might not be captured in the metadata. But for bigger size applications, STAMINA becomes less effective due to its limitations in converting billions of pixels into JPEG images and then resizing them.

"In such cases, metadata-based methods show advantages over our research."

However, with further research and tweaks, STAMINA could be very useful.

Most malware detection systems usually rely on extracting binary signatures or fingerprints. But the huge amount of signatures can make traditional methods for classifying them impractical. STAMINA could help anti-malware tools by making them capable of keeping up with the change, thus reducing the chances of security threats slipping past defenses.

The results and further technical details of the research are listed on the team's a paper provided by IBM.

Published:

13/05/2020