SuperCoder: Assembly Program Superoptimization with Large Language Models

Anjiang Wei; Tarun Suresh; Huanmi Tan; Yinglun Xu; Gagandeep Singh; Ke Wang; Alex Aiken

arXiv:2505.11480·cs.CL·February 2, 2026

SuperCoder: Assembly Program Superoptimization with Large Language Models

Anjiang Wei, Tarun Suresh, Huanmi Tan, Yinglun Xu, Gagandeep Singh, Ke Wang, Alex Aiken

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that large language models can effectively serve as superoptimizers for assembly code, outperforming traditional compiler optimizations through a new benchmark and reinforcement learning techniques.

Contribution

It introduces the first large-scale benchmark for assembly superoptimization and shows how fine-tuning LLMs with reinforcement learning significantly improves their performance.

Findings

01

Claude-opus-4 achieves 51.5% test-passing rate and 1.43x speedup.

02

SuperCoder attains 95.0% correctness and 1.46x speedup after fine-tuning.

03

Reinforcement learning and iterative refinement enhance LLM superoptimization results.

Abstract

Superoptimization is the task of transforming a program into a faster one while preserving its input-output behavior. In this work, we investigate whether large language models (LLMs) can serve as superoptimizers, generating assembly programs that outperform code already optimized by industry-standard compilers. We construct the first large-scale benchmark for this problem, consisting of 8,072 assembly programs averaging 130 lines, in contrast to prior datasets restricted to 2-15 straight-line, loop-free programs. We evaluate 23 LLMs on this benchmark and find that the strongest baseline, Claude-opus-4, achieves a 51.5% test-passing rate and a 1.43x average speedup over gcc -O3. To further enhance performance, we fine-tune models with reinforcement learning, optimizing a reward function that integrates correctness and performance speedup. Starting from Qwen2.5-Coder-7B-Instruct (61.4%…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

- The paper uses top closed-sourced (claude-opus-4/gpt-5) and open-sourced (DeepSeek-V3, Llama-4 etc.) models to study the impact of their dataset. It shows the gap between closed and open LLMs and how different adaptations can narrow it. - The paper builds a realistic, large benchmark with ~130 LOC, and many with loops. This is rare for assembly-level work and targets an important problem. - The analysis of learned programs provides qualitative insight into how LLMs are improving already optimi

Weaknesses

- Dataset deliberately samples programs with large -o0 -> -o3 gains, results might be different on other code distributions. - The paper could benefit from automatically explaining why the programs are faster, maybe using a chain-of-though model here to optimize the program and output thinking tokens would help?

Reviewer 02Rating 2Confidence 3

Strengths

1. The topic is ambitious and relevant, connecting large language models, reinforcement learning, and compiler optimization. 2. The paper attempts to move beyond high-level code generation to a more demanding low-level optimization setting. 3. The experimental setup is relatively thorough and could stimulate future work on applying learning-based methods to compiler research.

Weaknesses

1. The task formulation is loosely defined, relying on test-based correctness rather than formal equivalence, which makes the conclusions uncertain. 2. The dataset is built from code of unclear quality and has not been validated as a standard benchmark. Its representativeness and reproducibility are questionable. 3. The reinforcement learning setup functions as a simple post-hoc reward filter rather than a meaningful sequential learning process. There is no comparison to simpler fine-tuning st

Reviewer 03Rating 4Confidence 3

Strengths

- The paper might be the first to evaluate LLMs on this task. (* I am not too familiar with the literature and cannot fully judge the novelty claims.) - The reinforcement learned model matches the performance of Claude Opus 4 while having 7B parameters.

Weaknesses

- Limited scope: unlike code synthesis, where humans can review and fix the generated code, tasks like compilation and superoptimization are too hard to review manually. It is not clear what the use case of an unverified code superoptimizer may be, apart from very tight inner loops in high performance computing? - Limited contribution: the evaluations are fairly limited (no "maximum speedup at k samples" scaling curves, no experimentation with prompts despite the analysis of failure modes). - Li

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Logic, programming, and type systems · Software Engineering Research