On the Role of Pre-trained Embeddings in Binary Code Analysis
Alwin Maier, Felix Weissberg, Konrad Rieck

TL;DR
This paper critically evaluates the effectiveness of pre-trained embeddings in binary code analysis, finding that end-to-end learning often outperforms embeddings when ample labeled data is available, challenging their assumed necessity.
Contribution
It systematically compares recent embeddings with end-to-end learning across multiple tasks, providing guidelines on when each approach is preferable.
Findings
End-to-end learning often outperforms pre-trained embeddings with sufficient labeled data.
Differences between embeddings are minimal when data is abundant.
Guidelines are provided for choosing between embeddings and end-to-end learning.
Abstract
Deep learning has enabled remarkable progress in binary code analysis. In particular, pre-trained embeddings of assembly code have become a gold standard for solving analysis tasks, such as measuring code similarity or recognizing functions. These embeddings are capable of learning a vector representation from unlabeled code. In contrast to natural language processing, however, label information is not scarce for many tasks in binary code analysis. For example, labeled training data for function boundaries, optimization levels, and argument types can be easily derived from debug information provided by a compiler. Consequently, the main motivation of embeddings does not transfer directly to binary code analysis. In this paper, we explore the role of pre-trained embeddings from a critical perspective. To this end, we systematically evaluate recent embeddings for assembly code on five…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
