LLM Translation of Compiler Intermediate Representation
Andrea Valenzuela Ramirez, Cristian Gutierrez-Gomez, Marta Barroso, Dario Garcia-Gasulla, Sara Royuela

TL;DR
This paper introduces IRIS-14B, a large language model trained to translate GCC's GIMPLE IR to LLVM IR, significantly improving cross-toolchain interoperability in compiler workflows.
Contribution
First large-scale LLM specifically trained for IR-to-IR translation, outperforming existing models and enabling seamless cross-toolchain integration.
Findings
IRIS-14B outperforms state-of-the-art models by up to 44 percentage points.
The model is trained on paired IRs from real-world C code and competitive programming problems.
Supports hybrid neuro-symbolic compiler architectures for cross-toolchain workflows.
Abstract
GCC and LLVM underpin much of modern software infrastructure, relying on distinct Intermediate Representations (IRs) to drive optimizations and code generation. However, the semantic and structural differences between these IRs create significant barriers for cross-toolchain interaction, limiting the reuse of compiler frontends, backends, and optimization pipelines across programming languages and compilation ecosystems. Traditional rule-based translators have attempted to bridge this gap, but their complexity and maintenance cost have hindered practical adoption. In this context, Large Language Models (LLMs) appear to be an emerging technology that offers a data-driven alternative, capable of learning complex mappings between heterogeneous compiler IRs directly from sufficiently representative examples. To explore this approach, this paper presents IRIS-14B, a 14-billion-parameter…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
