KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning

Kris Shengjun Dong; Sahil Modi; Dima Nikiforov; Sana Damani; Edward Lin; Siva Kumar Sastry Hari; Christos Kozyrakis

arXiv:2602.14293·cs.LG·February 17, 2026

KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning

Kris Shengjun Dong, Sahil Modi, Dima Nikiforov, Sana Damani, Edward Lin, Siva Kumar Sastry Hari, Christos Kozyrakis

PDF

Open Access

TL;DR

KernelBlaster introduces a memory-augmented reinforcement learning framework that enhances CUDA code optimization across GPU generations by learning from past experience, significantly improving performance over traditional methods.

Contribution

It proposes a novel MAIC-RL framework with a persistent knowledge base for CUDA optimization, enabling systematic exploration and transfer of knowledge across GPU architectures.

Findings

01

Achieves up to 2.50x speedup on KernelBench levels

02

Outperforms baseline by 1.43x on average

03

Provides open-source framework for reproducible CUDA optimization

Abstract

Optimizing CUDA code across multiple generations of GPU architectures is challenging, as achieving peak performance requires an extensive exploration of an increasingly complex, hardware-specific optimization space. Traditional compilers are constrained by fixed heuristics, whereas finetuning Large Language Models (LLMs) can be expensive. However, agentic workflows for CUDA code optimization have limited ability to aggregate knowledge from prior exploration, leading to biased sampling and suboptimal solutions. We propose KernelBlaster, a Memory-Augmented In-context Reinforcement Learning (MAIC-RL) framework designed to improve CUDA optimization search capabilities of LLM-based GPU coding agents. KernelBlaster enables agents to learn from experience and make systematically informed decisions on future tasks by accumulating knowledge into a retrievable Persistent CUDA Knowledge Base. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Embedded Systems Design Techniques