Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation
Hanzhuo Tan, Xiaolong Tian, Hanrui Qi, Jiaming Liu, Zuchen Gao, Siyi Wang, Qi Luo, Jing Li, Yuqun Zhang

TL;DR
Decompile-Bench is a large-scale, open-source dataset of two million binary-source function pairs designed to advance the development and evaluation of LLM-based binary decompilers, addressing previous limitations in benchmark quality and scale.
Contribution
It introduces the first extensive binary-source function pair dataset and a comprehensive evaluation benchmark for LLM decompilation, improving assessment accuracy and model training.
Findings
Fine-tuning with Decompile-Bench improves re-executability by 20%.
The dataset covers 2 million function pairs from 100 million binaries.
Evaluation metrics are thoroughly analyzed for decompiler assessment.
Abstract
Recent advances in LLM-based decompilers have been shown effective to convert low-level binaries into human-readable source code. However, there still lacks a comprehensive benchmark that provides large-scale binary-source function pairs, which is critical for advancing the LLM decompilation technology. Creating accurate binary-source mappings incurs severe issues caused by complex compilation settings and widespread function inlining that obscure the correspondence between binaries and their original source code. Previous efforts have either relied on used contest-style benchmarks, synthetic binary-source mappings that diverge significantly from the mappings in real world, or partially matched binaries with only code lines or variable names, compromising the effectiveness of analyzing the binary functionality. To alleviate these issues, we introduce Decompile-Bench, the first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsLogic, programming, and type systems · Software Engineering Research · Security and Verification in Computing
