Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries
Ali Al-Kaswan, Toufique Ahmed, Maliheh Izadi, Anand Ashok Sawant,, Premkumar Devanbu, Arie van Deursen

TL;DR
This paper extends pre-trained source code language models to generate summaries for decompiled binaries, creating a new dataset and achieving state-of-the-art results, thus aiding reverse engineering tasks.
Contribution
The authors introduce CAPYBARA, a large dataset of decompiled functions with documentation, and adapt CodeT5 to BinT5, a model that effectively summarizes decompiled binary code.
Findings
BinT5 achieves BLEU-4 scores over 60 for source code summaries.
Model performance is robust across different dataset sizes and compiler optimizations.
Synthetic data generation and deduplication improve summarization quality.
Abstract
Reverse engineering binaries is required to understand and analyse programs for which the source code is unavailable. Decompilers can transform the largely unreadable binaries into a more readable source code-like representation. However, reverse engineering is time-consuming, much of which is taken up by labelling the functions with semantic information. While the automated summarisation of decompiled code can help Reverse Engineers understand and analyse binaries, current work mainly focuses on summarising source code, and no suitable dataset exists for this task. In this work, we extend large pre-trained language models of source code to summarise decompiled binary functions. Furthermore, we investigate the impact of input and data properties on the performance of such models. Our approach consists of two main components; the data and the model. We first build CAPYBARA, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Software Testing and Debugging Techniques
MethodsGated Linear Unit · Multi-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Attention Dropout · Dropout · Dense Connections · Adafactor · Refunds@Expedia|||How do I get a full refund from Expedia?
