CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs
Han Guo, Jack Zhang, Arjun Menon, Driss Guessous, Vijay Thakkar, Yoon Kim, Tri Dao

TL;DR
CODA introduces a GPU kernel abstraction that reparameterizes many Transformer operators as GEMM-plus-epilogue programs, improving efficiency by reducing memory movement.
Contribution
The paper presents CODA, a novel GPU kernel abstraction that expresses non-attention Transformer computations as GEMM-plus-epilogue programs, enabling high performance and expressiveness.
Findings
CODA kernels achieve high performance on Transformer workloads.
GEMM-plus-epilogue programming covers nearly all non-attention computations.
The abstraction maintains the performance structure of expert-written GEMMs.
Abstract
Transformer training systems are built around dense linear algebra, yet a nontrivial fraction of end-to-end time is spent on surrounding memory-bound operators. Normalization, activations, residual updates, reductions, and related computations repeatedly move large intermediate tensors through global memory while performing little arithmetic, making data movement an increasingly important bottleneck in otherwise highly optimized training stacks. We introduce CODA, a GPU kernel abstraction that expresses these computations as GEMM-plus-epilogue programs. CODA is based on the observation that many Transformer operators exposed as separate framework kernels can be algebraically reparameterized to execute while a GEMM output tile remains on chip, before it is written to memory. The abstraction fixes the GEMM mainloop and exposes a small set of composable epilogue primitives for scaling,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
