CrowdSource: Automated Inference of High Level Malware Functionality   from Low-Level Symbols Using a Crowd Trained Machine Learning Model

Joshua Saxe; Rafael Turner; Kristina Blokhin

arXiv:1605.08642·cs.CR·May 30, 2016

CrowdSource: Automated Inference of High Level Malware Functionality from Low-Level Symbols Using a Crowd Trained Machine Learning Model

Joshua Saxe, Rafael Turner, Kristina Blokhin

PDF

TL;DR

CrowdSource is a machine learning system that rapidly infers high-level malware functionalities from low-level binary strings by leveraging web-based technical documents, achieving high accuracy and processing large volumes efficiently.

Contribution

This work introduces a novel NLP-based approach that maps low-level malware strings to high-level functionalities using crowd-trained models and web data.

Findings

01

Detects at least 14 malware capabilities with 0.86 f-score

02

Processes tens of thousands of binaries daily on standard hardware

03

Demonstrates high accuracy and scalability in malware analysis

Abstract

In this paper we introduce CrowdSource, a statistical natural language processing system designed to make rapid inferences about malware functionality based on printable character strings extracted from malware binaries. CrowdSource "learns" a mapping between low-level language and high-level software functionality by leveraging millions of web technical documents from StackExchange, a popular network of technical question and answer sites, using this mapping to infer malware capabilities. This paper describes our approach and provides an evaluation of its accuracy and performance, demonstrating that it can detect at least 14 high-level malware capabilities in unpacked malware binaries with an average per-capability f-score of 0.86 and at a rate of tens of thousands of binaries per day on commodity hardware.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.