REBENCH: A Procedural, Fair-by-Construction Benchmark for LLMs on Stripped-Binary Types and Names (Extended Version)
Jun Yeon Won, Xin Jin, Shiqing Ma, Zhiqiang Lin

TL;DR
REBench is a comprehensive, standardized benchmark dataset designed to evaluate large language models in binary reverse engineering tasks, addressing inconsistencies in existing datasets and evaluation methods.
Contribution
The paper introduces REBench, a unified, knowledge-base-driven benchmark dataset that enables fair and consistent evaluation of LLMs on binary reverse engineering tasks.
Findings
LLMs show significant difficulty in complex reverse engineering tasks
REBench consolidates diverse datasets into a single, comprehensive benchmark
The methodology preserves task difficulty and applicability across architectures
Abstract
Large Language Models (LLMs) have achieved remarkable progress in recent years, driving their adoption across a wide range of domains, including computer security. In reverse engineering, LLMs are increasingly applied to critical tasks such as function and variable name recovery and type inference. However, despite the rapid growth of research in this area, progress has been hindered by the absence of a standardized dataset. Existing studies rely on disparate datasets, preprocessing pipelines, and evaluation metrics, making fair comparisons between approaches difficult and obscuring a clear understanding of LLM capabilities in binary analysis. To address these challenges, we present REBench, a comprehensive benchmark dataset for evaluating LLMs on binary reverse engineering tasks. REBench consolidates a superset of existing datasets, comprising hundreds of millions of lines of source…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
