StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning

Shiyang Li; Zijian Zhang; Winson Chen; Yuebo Luo; Mingyi Hong; Caiwen Ding

arXiv:2603.02637·cs.MA·March 4, 2026

StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning

Shiyang Li, Zijian Zhang, Winson Chen, Yuebo Luo, Mingyi Hong, Caiwen Ding

PDF

Open Access

TL;DR

StitchCUDA is a multi-agent framework that automates end-to-end GPU program generation using rubric-based reinforcement learning, significantly improving success rates and performance over existing methods.

Contribution

The paper introduces StitchCUDA, a novel multi-agent system with rubric-based reinforcement learning for end-to-end GPU programming, addressing limitations of prior single-kernel optimization approaches.

Findings

01

Achieves nearly 100% success rate on GPU programming tasks

02

Provides 1.72x speedup over multi-agent baseline

03

Outperforms RL model baselines by 2.73x

Abstract

Modern machine learning (ML) workloads increasingly rely on GPUs, yet achieving high end-to-end performance remains challenging due to dependencies on both GPU kernel efficiency and host-side settings. Although LLM-based methods show promise on automated GPU kernel generation, prior works mainly focus on single-kernel optimization and do not extend to end-to-end programs, hindering practical deployment. To address the challenge, in this work, we propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a Planner to orchestrate whole system design, a Coder dedicated to implementing it step-by-step, and a Verifier for correctness check and performance profiling using Nsys/NCU. To fundamentally improve the Coder's ability in end-to-end GPU programming, StitchCUDA integrates rubric-based agentic reinforcement learning over two atomic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Reinforcement Learning in Robotics