Latent Semantic Structure in Malicious Programs
John Musgrave, Temesguen Messay-Kebede, David Kapp, Anca Ralescu

TL;DR
This paper applies Latent Semantic Analysis to malicious program binaries to uncover their underlying semantic structure, providing a more abstract and detailed representation of program composition and similarity.
Contribution
It introduces a novel application of Latent Semantic Analysis to binary analysis, revealing semantic topics and structures within malicious programs.
Findings
Semantic topics effectively characterize program components
Spatial representation improves program structure resolution
Similarity metrics aid in comparing malicious binaries
Abstract
Latent Semantic Analysis is a method of matrix decomposition used for discovering topics and topic weights in natural language documents. This study uses Latent Semantic Analysis to analyze the composition of binaries of malicious programs. The semantic representation of the term frequency vector representation yields a set of topics, each topic being a composition of terms. The vectors and topics were evaluated quantitatively using a spatial representation. This semantic analysis provides a more abstract representation of the program derived from its term frequency analysis. We use a metric space to represent a program as a collection of vectors, and a distance metric to evaluate their similarity within a topic. The segmentation of the vectors in this dataset provides increased resolution into the program structure.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Spam and Phishing Detection
