MaxCode: A Max-Reward Reinforcement Learning Framework for Automated Code Optimization
Jiefu Ou, Sapana Chaudhary, Kaj Bostrom, Nathaniel Weir, Shuai Zhang, Huzefa Rangwala, George Karypis

TL;DR
MaxCode introduces a reinforcement learning framework that guides large language models to generate optimized code by leveraging execution feedback, natural language critiques, and reward-based reranking, significantly improving code performance.
Contribution
The paper presents MaxCode, a novel max-reward reinforcement learning approach that unifies search methods, incorporates feedback-driven diagnostics, and enhances exploration for automated code optimization.
Findings
Achieves 20.3% speedup improvement on CUDA benchmarks.
Attains 10.1% relative ranking improvement in code optimization.
Demonstrates effectiveness across CUDA and C++ benchmarks.
Abstract
Large Language Models (LLMs) demonstrate strong capabilities in general coding tasks but encounter two key challenges when optimizing code: (i) the complexity of writing optimized code (such as performant CUDA kernels and competition-level CPU code) requires expertise in systems, algorithms and specific languages and (ii) requires interpretation of performance metrics like timing and device utilization beyond binary correctness. In this work, we explore inference-time search algorithms that guide the LLM to discover better solutions through iterative refinement based on execution feedback. Our approach, called MaxCode unifies existing search methods under a max-reward reinforcement learning framework, making the observation and action-value functions modular for modification. To enhance the observation space, we integrate a natural language critique model that converts raw execution…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The topic and direction are practical and promising. Code optimization is important for CUDA and C++. 2. The paper is well-written, and the motivation is clear.
1. The presentation needs to be improved, especially Figure 1. The code in the figure is not clear. 2. Commas and periods are missing in the formulas, e.g., Eq. (3), (4). In Lines 272-282, there is no Eq number. 3. The evaluation data is limited. The results are evaluated on only two benchmarks. 4. Figures 3 and 4 are not clear. 5. The proposed method is not novel. It seems to combine the search algorithm and RL. 6. The baselines are limited. Please compare with both open-source and closed-sou
- The paper tackles optimizing code efficiency via LLMs problem by reframing inference-time search under a unified max-reward RL perspective. - The formulation is modular and general, making it easy to plug into existing search pipelines such as CUDA-LLM or Effi-Learner, with consistent empirical gains across both CUDA and C++ domains. - The critique-augmented observation design is intuitive yet effective, improving exploration quality without modifying base model weights. - The experimental
- The proposed max-reward RL formulation mainly reinterprets existing search heuristics under a unified lens. While clean and modular, it does not introduce a fundamentally new learning algorithm or search operator beyond the combination of best-so-far reward and critique-based observation. - The section describing the generative value/reward-to-go model contradicts its reported results: the text claims underperformance on KernelBench-L1, yet Table 2 shows a clear gain. This discrepancy weakens
- Problem relevance: Inference-time optimization for LLM-generated code is practical and timely; even small speedups can be impactful in deployment. - Conceptual unification: Recasting existing refinement methods under a max-reward RL framework offers a common lens and a value-guided expansion heuristic that can plug into multiple search variants. - Empirical signal: On KernelBench/PIE, integrating MaxCode yields consistent but incremental improvements overall, especially with CUDA-LLM.
- The framework mainly adapts existing iteration-based methods and execution-feedback-based prompt strategies; the max-reward RL formulation appears conceptual rather than introducing new algorithms. - Several core elements (e.g., critique usage, value estimation mechanism, Q-function applicability) are only loosely described, limiting reproducibility and technical insight. - Performance gains are small and inconsistent, baselines lack diversity beyond execution-feedback approaches, and ablation
- **Actionable Feedback Loop:** Raw execution feedback (e.g., "20% slower than the baseline") is often unhelpful. Translating this into a diagnostic natural language critique (e.g., "probable memory bandwidth bottleneck, consider fusing operations") provides a much richer and more actionable signal for the generator LLM's iterative refinement. - **Strong Empirical Results:** The method achieves significant relative speedup improvements over strong baselines. The ablation studies validated that t
- **Critique Model as a Black Box:** The critique model is central to the paper's positive results, but it is treated as a given. There is no analysis of the quality of the critiques, common failure modes (e.g., does it ever "hallucinate" a bottleneck?), or whether a much smaller, fine-tuned, or distilled model could serve this purpose at a fraction of the cost. - **Unanalyzed Computational Cost:** The framework introduces significant computational overhead. At each step, it requires an inferenc
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Software Engineering Research · Advanced Neural Network Applications
