CodableLLM: Automating Decompiled and Source Code Mapping for LLM Dataset Generation

Dylan Manuel; Paul Rad

arXiv:2507.22066·cs.SE·July 31, 2025

CodableLLM: Automating Decompiled and Source Code Mapping for LLM Dataset Generation

Dylan Manuel, Paul Rad

PDF

TL;DR

CodableLLM is a Python framework that automates the process of mapping decompiled binaries to source code, enabling the creation of high-quality datasets for training large language models in code understanding.

Contribution

It introduces a novel automated method for aligning decompiled and source code, supporting multiple languages and improving dataset quality for code-focused LLMs.

Findings

01

Efficient dataset generation for code understanding models

02

Improved alignment accuracy between decompiled and source code

03

Outperforms existing tools in dataset creation efficiency

Abstract

The generation of large, high-quality datasets for code understanding and generation remains a significant challenge, particularly when aligning decompiled binaries with their original source code. To address this, we present CodableLLM, a Python framework designed to automate the creation and curation of datasets by mapping decompiled functions to their corresponding source functions. This process enhances the alignment between decompiled and source code representations, facilitating the development of large language models (LLMs) capable of understanding and generating code across multiple abstraction levels. CodableLLM supports multiple programming languages and integrates with existing decompilers and parsers to streamline dataset generation. This paper presents the design and implementation of CodableLLM, evaluates its performance in dataset creation, and compares it to existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.