Learning to Find Usages of Library Functions in Optimized Binaries
Toufique Ahmed, Premkumar Devanbu, Anand Ashok Sawant

TL;DR
This paper presents a supervised learning method to improve the recovery of function calls in optimized binaries, enhancing decompilation accuracy especially under high optimization levels.
Contribution
It introduces a novel dataset creation and augmentation approach for training models to identify function calls in binaries, integrated with Ghidra for better decompilation results.
Findings
Significant improvement in function call recovery accuracy.
Enhanced decompilation quality at high optimization levels.
Effective use of data augmentation and pre-training techniques.
Abstract
Much software, whether beneficent or malevolent, is distributed only as binaries, sans source code. Absent source code, understanding binaries' behavior can be quite challenging, especially when compiled under higher levels of compiler optimization. These optimizations can transform comprehensible, "natural" source constructions into something entirely unrecognizable. Reverse engineering binaries, especially those suspected of being malevolent or guilty of intellectual property theft, are important and time-consuming tasks. There is a great deal of interest in tools to "decompile" binaries back into more natural source code to aid reverse engineering. Decompilation involves several desirable steps, including recreating source-language constructions, variable names, and perhaps even comments. One central step in creating binaries is optimizing function calls, using steps such as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
