CrackMeBench: Binary Reverse Engineering for Agents

Isaac David; Arthur Gervais

arXiv:2605.10597·cs.SE·May 12, 2026

CrackMeBench: Binary Reverse Engineering for Agents

Isaac David, Arthur Gervais

PDF

TL;DR

CrackMeBench is a new benchmark for evaluating language-model agents on binary reverse engineering tasks using CrackMe-style challenges within a controlled, reproducible environment.

Contribution

It introduces a standardized, executable benchmark with diverse CrackMe tasks, enabling systematic assessment of source-code reasoning in binary analysis by AI agents.

Findings

01

GPT-5.5 achieves 92% pass@3 on generated tasks.

02

Models perform worse on harder, generated tasks, highlighting difficulty.

03

CrackMeBench provides detailed metrics for progress measurement.

Abstract

Benchmarks for coding agents increasingly measure source-level software repair, and cybersecurity benchmarks increasingly measure broad capture-the-flag performance. Classical binary reverse engineering remains less precisely specified: given only an executable, can an agent recover validation logic and produce an input, serial, artifact, or key generator accepted by the program? We introduce CrackMeBench, a benchmark for evaluating language-model agents on educational CrackMe-style reverse-engineering tasks. CrackMeBench focuses on deterministic binary validation problems with executable oracles, symbol-poor binaries, explicit local tool access, and externally scored submissions rather than free-form explanations. The v0 benchmark combines eight public calibration CrackMes with twelve generated main-score tasks built from seeded C, Rust, and Go templates, and agents run through an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.