A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks
Beatrice Casey, Joanna C. S. Santos, George Perry

TL;DR
This survey reviews current machine learning approaches for cybersecurity tasks involving source code, highlighting prevalent representations like ASTs and graphs, popular tasks like vulnerability detection, and common models such as sequence-based methods and SVMs.
Contribution
It provides a comprehensive overview of existing source code representations and modeling techniques used in cybersecurity, identifying trends and gaps in the field.
Findings
Graph-based representations are most popular.
Tokenizers and ASTs are the most common representations.
Vulnerability detection is the leading cybersecurity task.
Abstract
Machine learning techniques for cybersecurity-related software engineering tasks are becoming increasingly popular. The representation of source code is a key portion of the technique that can impact the way the model is able to learn the features of the source code. With an increasing number of these techniques being developed, it is valuable to see the current state of the field to better understand what exists and what is not there yet. This article presents a study of these existing machine learning based approaches and demonstrates what type of representations were used for different cybersecurity tasks and programming languages. Additionally, we study what types of models are used with different representations. We have found that graph-based representations are the most popular category of representation, and tokenizers and Abstract Syntax Trees (ASTs) are the two most popular…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Digital and Cyber Forensics · Network Security and Intrusion Detection
