Extending Source Code Pre-Trained Language Models to Summarise   Decompiled Binaries

Ali Al-Kaswan; Toufique Ahmed; Maliheh Izadi; Anand Ashok Sawant,; Premkumar Devanbu; Arie van Deursen

arXiv:2301.01701·cs.CR·January 16, 2023·5 cites

Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries

Ali Al-Kaswan, Toufique Ahmed, Maliheh Izadi, Anand Ashok Sawant,, Premkumar Devanbu, Arie van Deursen

PDF

Open Access 1 Repo 5 Models 1 Datasets

TL;DR

This paper extends pre-trained source code language models to generate summaries for decompiled binaries, creating a new dataset and achieving state-of-the-art results, thus aiding reverse engineering tasks.

Contribution

The authors introduce CAPYBARA, a large dataset of decompiled functions with documentation, and adapt CodeT5 to BinT5, a model that effectively summarizes decompiled binary code.

Findings

01

BinT5 achieves BLEU-4 scores over 60 for source code summaries.

02

Model performance is robust across different dataset sizes and compiler optimizations.

03

Synthetic data generation and deduplication improve summarization quality.

Abstract

Reverse engineering binaries is required to understand and analyse programs for which the source code is unavailable. Decompilers can transform the largely unreadable binaries into a more readable source code-like representation. However, reverse engineering is time-consuming, much of which is taken up by labelling the functions with semantic information. While the automated summarisation of decompiled code can help Reverse Engineers understand and analyse binaries, current work mainly focuses on summarising source code, and no suitable dataset exists for this task. In this work, we extend large pre-trained language models of source code to summarise decompiled binary functions. Furthermore, we investigate the impact of input and data properties on the performance of such models. Our approach consists of two main components; the data and the model. We first build CAPYBARA, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aise-tudelft/capybara-bint5
pytorchOfficial

Models

Datasets

AISE-TUDelft/Capybara
dataset· 77 dl
77 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Software Testing and Debugging Techniques

MethodsGated Linear Unit · Multi-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Attention Dropout · Dropout · Dense Connections · Adafactor · Refunds@Expedia|||How do I get a full refund from Expedia?