On the Role of Pre-trained Embeddings in Binary Code Analysis

Alwin Maier; Felix Weissberg; Konrad Rieck

arXiv:2502.08682·cs.LG·February 14, 2025

On the Role of Pre-trained Embeddings in Binary Code Analysis

Alwin Maier, Felix Weissberg, Konrad Rieck

PDF

TL;DR

This paper critically evaluates the effectiveness of pre-trained embeddings in binary code analysis, finding that end-to-end learning often outperforms embeddings when ample labeled data is available, challenging their assumed necessity.

Contribution

It systematically compares recent embeddings with end-to-end learning across multiple tasks, providing guidelines on when each approach is preferable.

Findings

01

End-to-end learning often outperforms pre-trained embeddings with sufficient labeled data.

02

Differences between embeddings are minimal when data is abundant.

03

Guidelines are provided for choosing between embeddings and end-to-end learning.

Abstract

Deep learning has enabled remarkable progress in binary code analysis. In particular, pre-trained embeddings of assembly code have become a gold standard for solving analysis tasks, such as measuring code similarity or recognizing functions. These embeddings are capable of learning a vector representation from unlabeled code. In contrast to natural language processing, however, label information is not scarce for many tasks in binary code analysis. For example, labeled training data for function boundaries, optimization levels, and argument types can be easily derived from debug information provided by a compiler. Consequently, the main motivation of embeddings does not transfer directly to binary code analysis. In this paper, we explore the role of pre-trained embeddings from a critical perspective. To this end, we systematically evaluate recent embeddings for assembly code on five…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.