On Training a Neural Network to Explain Binaries
Alexander Interrante-Grant, Andy Davis, Heather Preslier, and Tim Leek

TL;DR
This paper explores training neural networks to generate English descriptions of binary code functionality, introduces a novel dataset evaluation method called Embedding Distance Correlation, and assesses dataset quality for this task.
Contribution
It presents a new dataset evaluation technique, Embedding Distance Correlation, and applies it to assess the quality of datasets for binary code understanding with neural networks.
Findings
Existing datasets are often of low quality based on EDC scores.
The EDC method reliably indicates dataset usefulness for code understanding tasks.
A new dataset derived from Stack Overflow was created for this research.
Abstract
In this work, we begin to investigate the possibility of training a deep neural network on the task of binary code understanding. Specifically, the network would take, as input, features derived directly from binaries and output English descriptions of functionality to aid a reverse engineer in investigating the capabilities of a piece of closed-source software, be it malicious or benign. Given recent success in applying large language models (generative AI) to the task of source code summarization, this seems a promising direction. However, in our initial survey of the available datasets, we found nothing of sufficiently high quality and volume to train these complex models. Instead, we build our own dataset derived from a capture of Stack Overflow containing 1.1M entries. A major result of our work is a novel dataset evaluation method using the correlation between two distances on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
