CodableLLM: Automating Decompiled and Source Code Mapping for LLM Dataset Generation
Dylan Manuel, Paul Rad

TL;DR
CodableLLM is a Python framework that automates the process of mapping decompiled binaries to source code, enabling the creation of high-quality datasets for training large language models in code understanding.
Contribution
It introduces a novel automated method for aligning decompiled and source code, supporting multiple languages and improving dataset quality for code-focused LLMs.
Findings
Efficient dataset generation for code understanding models
Improved alignment accuracy between decompiled and source code
Outperforms existing tools in dataset creation efficiency
Abstract
The generation of large, high-quality datasets for code understanding and generation remains a significant challenge, particularly when aligning decompiled binaries with their original source code. To address this, we present CodableLLM, a Python framework designed to automate the creation and curation of datasets by mapping decompiled functions to their corresponding source functions. This process enhances the alignment between decompiled and source code representations, facilitating the development of large language models (LLMs) capable of understanding and generating code across multiple abstraction levels. CodableLLM supports multiple programming languages and integrates with existing decompilers and parsers to streamline dataset generation. This paper presents the design and implementation of CodableLLM, evaluates its performance in dataset creation, and compares it to existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
