ViC: Virtual Compiler Is All You Need For Assembly Code Search
Zeyu Gao, Hao Wang, Yuanda Wang, Chao Zhang

TL;DR
This paper introduces ViC, a virtual compiler based on a pre-trained LLM, which enables assembly code generation from source code across multiple languages, significantly improving assembly code search performance.
Contribution
The paper presents ViC, a novel virtual compiler that emulates compilation for any language, facilitating large-scale dataset creation and enhancing assembly code search accuracy.
Findings
Achieved 26% improvement over baseline in assembly code search.
Successfully trained ViC on 20 billion tokens from Ubuntu packages.
Enabled cross-language virtual compilation without real compilers.
Abstract
Assembly code search is vital for reducing the burden on reverse engineers, allowing them to quickly identify specific functions using natural language within vast binary programs. Despite its significance, this critical task is impeded by the complexities involved in building high-quality datasets. This paper explores training a Large Language Model (LLM) to emulate a general compiler. By leveraging Ubuntu packages to compile a dataset of 20 billion tokens, we further continue pre-train CodeLlama as a Virtual Compiler (ViC), capable of compiling any source code of any language to assembly code. This approach allows for virtual compilation across a wide range of programming languages without the need for a real compiler, preserving semantic equivalency and expanding the possibilities for assembly code dataset construction. Furthermore, we use ViC to construct a sufficiently large…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmbedded Systems Design Techniques · Parallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems
