Nova: Generative Language Models for Assembly Code with Hierarchical Attention and Contrastive Learning
Nan Jiang, Chengxiao Wang, Kevin Liu, Xiangzhe Xu, Lin Tan, Xiangyu Zhang, Petr Babkin

TL;DR
Nova introduces a hierarchical attention and contrastive learning approach to enhance generative language models for assembly code, significantly improving performance in binary decompilation and code similarity detection tasks.
Contribution
The paper presents Nova, a novel generative LLM for assembly code that incorporates hierarchical attention and contrastive learning to address assembly-specific challenges.
Findings
Outperforms existing techniques in binary decompilation by up to 21.58% in Pass@1 and Pass@10.
Achieves up to 6.17% higher Recall@1 in binary code similarity detection.
Demonstrates strong capabilities in assembly generation and understanding tasks.
Abstract
Binary code analysis is the foundation of crucial tasks in the security domain; thus building effective binary analysis techniques is more important than ever. Large language models (LLMs) although have brought impressive improvement to source code tasks, do not directly generalize to assembly code due to the unique challenges of assembly: (1) the low information density of assembly and (2) the diverse optimizations in assembly code. To overcome these challenges, this work proposes a hierarchical attention mechanism that builds attention summaries to capture the semantics more effectively and designs contrastive learning objectives to train LLMs to learn assembly optimization. Equipped with these techniques, this work develops Nova, a generative LLM for assembly code. Nova outperforms existing techniques on binary code decompilation by up to 14.84 -- 21.58% (absolute percentage point…
Peer Reviews
Decision·ICLR 2025 Poster
Originality -------------- 1. While hierarchical attention mechanisms are not new, the design of this one is innovative in that: it takes into account the specific format and constraints of assembly instructions, and it accommodates for using regular tokens in the same sequence (e.g., natural text instructions). 2. The contrastive objective losses, as well, encode a priori knowledge of the underlying data: compilation stages preserve semantics, and optimization stages are sequential. Quality --
Quality ---------- 1. One of the 3 motivating cases in the introduction, malware detection, is not evaluated or considered at all in the rest of the paper. I understand the scope of the paper needs to end somewhere, but it would have strengthened the paper to include experiments on such a dataset. 2. Details are missing in how the authors are certain that test data sets (both for decompilation and for similarity detection) do not overlap with any of the training data, including the pre-training
1. This paper is well-structured and easy to follow. Concepts such as hierarchical attention and contrastive learning are clearly explained. 2. The paper proposes a new method for encoding assembly code by using a Hierarchical Attention Mechanism to effectively capture the semantics of assembly instructions, while employing Contrastive Learning to ensure that functionally equivalent assembly code, even at different optimization levels, is represented similarly. This novel combination allows the
1. Unclear motivation for introducing several inductive bias by Hierarchical Attention Mechanism. While the added attention mask inductive bias shows promising results in the BCD task, its impact in the BCSD task is minimal. This discrepancy raises questions about why the inductive bias performs well in one task but fails to offer significant improvements in the other. 2. Lack of Design Discussion. The paper lacks sufficient discussion on key design components like Preceding-Instruction Attentio
1. Clear Writing and Novel Application: The paper is well-written and easy to follow. The idea of applying hierarchical attention to assembly code is interesting and novel. While hierarchical attention is commonly used in NLP tasks, applying this mechanism to assembly code is, to the best of my knowledge, unprecedented. 2. Promising Results: The evaluation results are promising. Nova demonstrates substantial improvements in both decompilation accuracy and similarity detection compared to existi
Generalizability: The model is trained exclusively on x86 assembly code, which may limit its generalizability to other assembly languages, such as ARM or MIPS. Realism of Evaluation Settings: (1) The decompilation prompt requires optimization level information, but it is unclear if this information is accessible in stripped binaries. (2) For baseline models like GPT, fine-tuning with additional data isn’t necessary, raising questions about the fairness of the comparison. If GPT were given a f
Code & Models
Videos
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Computational Physics and Python Applications
MethodsContrastive Learning · Multi-Head Attention · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Linear Layer · Adam · Weight Decay · Cosine Annealing · Byte Pair Encoding · Dense Connections
