ComPile: A Large IR Dataset from Production Sources

Aiden Grossman; Ludger Paehler; Konstantinos Parasyris; Tal Ben-Nun,; Jacob Hegna; William Moses; Jose M Monsalve Diaz; Mircea Trofin; Johannes; Doerfert

arXiv:2309.15432·cs.PL·May 1, 2024

ComPile: A Large IR Dataset from Production Sources

Aiden Grossman, Ludger Paehler, Konstantinos Parasyris, Tal Ben-Nun,, Jacob Hegna, William Moses, Jose M Monsalve Diaz, Mircea Trofin, Johannes, Doerfert

PDF

Open Access 2 Datasets

TL;DR

This paper introduces ComPile, a large dataset of 182 billion tokens of LLVM IR generated from real-world code across multiple languages, aiming to improve machine learning models and compiler tools by leveraging program structure.

Contribution

The paper presents a novel large-scale dataset of LLVM IR from production code, enabling better model training and compiler research by utilizing program structure.

Findings

01

Dataset contains 182 billion tokens of LLVM IR.

02

Proves utility for large language model training and compiler introspection.

03

Shows promise for machine-learned compiler components.

Abstract

Code is increasingly becoming a core data modality of modern machine learning research impacting not only the way we write code with conversational agents like OpenAI's ChatGPT, Google's Bard, or Anthropic's Claude, the way we translate code from one language into another, but also the compiler infrastructure underlying the language. While modeling approaches may vary and representations differ, the targeted tasks often remain the same within the individual classes of models. Relying solely on the ability of modern models to extract information from unstructured code does not take advantage of 70 years of programming language and compiler development by not utilizing the structure inherent to programs in the data collection. This detracts from the performance of models working over a tokenized representation of input code and precludes the use of these models in the compiler itself. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Software Engineering Research · Adversarial Robustness in Machine Learning