CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

Han Guo; Jack Zhang; Arjun Menon; Driss Guessous; Vijay Thakkar; Yoon Kim; Tri Dao

arXiv:2605.19269·cs.LG·May 21, 2026

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

Han Guo, Jack Zhang, Arjun Menon, Driss Guessous, Vijay Thakkar, Yoon Kim, Tri Dao

PDF

TL;DR

CODA introduces a GPU kernel abstraction that reparameterizes many Transformer operators as GEMM-plus-epilogue programs, improving efficiency by reducing memory movement.

Contribution

The paper presents CODA, a novel GPU kernel abstraction that expresses non-attention Transformer computations as GEMM-plus-epilogue programs, enabling high performance and expressiveness.

Findings

01

CODA kernels achieve high performance on Transformer workloads.

02

GEMM-plus-epilogue programming covers nearly all non-attention computations.

03

The abstraction maintains the performance structure of expert-written GEMMs.

Abstract

Transformer training systems are built around dense linear algebra, yet a nontrivial fraction of end-to-end time is spent on surrounding memory-bound operators. Normalization, activations, residual updates, reductions, and related computations repeatedly move large intermediate tensors through global memory while performing little arithmetic, making data movement an increasingly important bottleneck in otherwise highly optimized training stacks. We introduce CODA, a GPU kernel abstraction that expresses these computations as GEMM-plus-epilogue programs. CODA is based on the observation that many Transformer operators exposed as separate framework kernels can be algebraically reparameterized to execute while a GEMM output tile remains on chip, before it is written to memory. The abstraction fixes the GEMM mainloop and exposes a small set of composable epilogue primitives for scaling,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.