From Large to Small: Transferring CUDA Optimization Expertise via Reasoning Graph

Junfeng Gong; Zhiyi Wei; Junying Chen; Cheng Liu; Huawei Li

arXiv:2510.19873·cs.LG·October 24, 2025

From Large to Small: Transferring CUDA Optimization Expertise via Reasoning Graph

Junfeng Gong, Zhiyi Wei, Junying Chen, Cheng Liu, Huawei Li

PDF

Open Access 3 Reviews

TL;DR

This paper introduces ReGraphT, a retrieval-augmented framework that transfers reasoning capabilities from large language models to smaller models for CUDA optimization, improving performance while maintaining privacy and efficiency.

Contribution

ReGraphT is a novel, training-free method that organizes CUDA optimization trajectories into a reasoning graph and uses Monte Carlo Graph Search to enhance small models' reasoning abilities.

Findings

01

ReGraphT achieves 2.33X speedup on CUDAEval and ParEval benchmarks.

02

ReGraphT enables small models to approach LLM-level performance.

03

The framework outperforms fine-tuned HPC models and other retrieval methods.

Abstract

Despite significant evolution of CUDA programming and domain-specific libraries, effectively utilizing GPUs with massively parallel engines remains difficult. Large language models (LLMs) show strong potential in generating optimized CUDA code from sequential code. However, using LLMs in practice faces two major challenges: cloud-based APIs pose risks of code leakage, and local deployment is often computationally expensive and inefficient. These drawbacks have spurred interest in small language models (SLMs), which are more lightweight and privacy-friendly. Encouragingly, recent studies show that SLMs can achieve performance comparable to LLMs on specific tasks. While SLMs can match LLMs on domain-specific tasks, their limited reasoning abilities lead to suboptimal performance in complex CUDA generation according to our experiments. To bridge this gap, we propose ReGraphT, a…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

Strengths by dimension Originality * Introduces a training‑free way to transfer multi‑step optimization “know‑how” from an LLM to small code models by turning LLM‑generated optimization trajectories into a reusable CUDA Reasoning Graph and casting CUDA generation as graph traversal. This graph abstraction (nodes as optimization techniques, edges as validated transitions with examples) and the merge procedure are clearly new in the CUDA‑code LLM literature. The formal definition, Algorithm 1, a

Weaknesses

1. Positioning vs prior search‑based reasoning is underdeveloped. The paper adapts MCTS to a cyclic “reasoning graph,” but the case for novelty over existing search‑guided generation is thin. Closest neighbors include MCTS‑style reasoning for code and RAG (e.g., RethinkMCTS for code generation; MCTS‑RAG), and iterative search over program transformations in compiler auto‑scheduling (e.g., Halide, TVM). The paper cites these lines of work but does not empirically contrast against them nor arti

Reviewer 02Rating 6Confidence 3

Strengths

- Practical Relevance and Problem Significance: The paper tackles a real and important problem. Optimizing code for parallel architectures like GPUs is a critical bottleneck in high-performance computing. Making this capability accessible via smaller, locally deployable models has significant practical implications for developer productivity, code privacy, and computational cost. The training-free nature of the framework further enhances its practicality. - Thorough and Comprehensive Evaluation:

Weaknesses

- Convern on generalizability of the reasoning graph: While the paper demonstrates that the graph's structure converges, its generalizability to out-of-distribution problems is not fully explored. The graph is built from a dataset of 10K CUDA files filtered down significantly. It is unclear how well a single, pre-constructed graph would perform on CUDA tasks from entirely different domains (e.g., scientific simulation vs. deep learning kernels) that might require novel optimization patterns not

Reviewer 03Rating 2Confidence 4

Strengths

- method achieves near LLM performance - it allows for local deployment (I wonder how important this is for CUDA code generation?). It would be more sensitive for private conversations but in this context is not really needed, why not use a bigger model?

Weaknesses

- Code link is empty - The paper content leans toward systems and engineering design, rather than core theoretical or algorithmic ML contributions. Therefore it doesnt seem appropriate for ICLR. - The reasoning transfer mechanism is algorithmic plumbing rather than a model innovation - Is more of an engineering project report

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Parallel Computing and Optimization Techniques · Big Data and Digital Economy