CrackMeBench: Binary Reverse Engineering for Agents
Isaac David, Arthur Gervais

TL;DR
CrackMeBench is a new benchmark for evaluating language-model agents on binary reverse engineering tasks using CrackMe-style challenges within a controlled, reproducible environment.
Contribution
It introduces a standardized, executable benchmark with diverse CrackMe tasks, enabling systematic assessment of source-code reasoning in binary analysis by AI agents.
Findings
GPT-5.5 achieves 92% pass@3 on generated tasks.
Models perform worse on harder, generated tasks, highlighting difficulty.
CrackMeBench provides detailed metrics for progress measurement.
Abstract
Benchmarks for coding agents increasingly measure source-level software repair, and cybersecurity benchmarks increasingly measure broad capture-the-flag performance. Classical binary reverse engineering remains less precisely specified: given only an executable, can an agent recover validation logic and produce an input, serial, artifact, or key generator accepted by the program? We introduce CrackMeBench, a benchmark for evaluating language-model agents on educational CrackMe-style reverse-engineering tasks. CrackMeBench focuses on deterministic binary validation problems with executable oracles, symbol-poor binaries, explicit local tool access, and externally scored submissions rather than free-form explanations. The v0 benchmark combines eight public calibration CrackMes with twelve generated main-score tasks built from seeded C, Rust, and Go templates, and agents run through an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
