Idioms: Neural Decompilation With Joint Code and Type Definition Prediction

Luke Dramko; Claire Le Goues; Edward J. Schwartz

arXiv:2502.04536·cs.SE·June 18, 2025

Idioms: Neural Decompilation With Joint Code and Type Definition Prediction

Luke Dramko, Claire Le Goues, Edward J. Schwartz

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper presents Idioms, a neural decompiler that jointly predicts code and user-defined types, significantly improving decompilation accuracy on realistic benchmarks and enabling better reverse engineering of compiled software.

Contribution

The work introduces Realtype, a challenging new dataset, and a novel neural decompilation method that jointly predicts code and types, surpassing existing models in accuracy.

Findings

01

Achieves 54.4% accuracy on ExeBench, outperforming prior models.

02

Performs at least 95% better on the Realtype dataset.

03

State-of-the-art results in neural decompilation accuracy.

Abstract

Decompilers are important tools for reverse engineers that help them analyze software at a higher level of abstraction than assembly code. Unfortunately, because compilation is lossy, deterministic decompilers produce code that is missing many of the details that make source code readable in the first place, like variable names and types. Neural decompilers, on the other hand, offer the ability to statistically fill in these details. Existing work in neural decompilation, however, suffers from substantial limitations that preclude its use on real code, such as the inability to define composite types, which is essential to fully specify function semantics. In this work, we introduce a new dataset, Realtype, that includes substantially more complicated and realistic types than existing neural decompilation benchmarks, and Idioms, a new neural decompilation approach to finetune any LLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

squaresLab/idioms
noneOfficial

Datasets

ejschwartz/idioms-realtype
dataset· 43 dl
43 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification