Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs

Rupanshu Soi; Rohan Yadav; Fredrik Kjolstad; Alex Aiken; Maryam Mehri Dehnavi; Michael Garland; Michael Bauer

arXiv:2512.18134·cs.PL·December 23, 2025

Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs

Rupanshu Soi, Rohan Yadav, Fredrik Kjolstad, Alex Aiken, Maryam Mehri Dehnavi, Michael Garland, Michael Bauer

PDF

Open Access

TL;DR

This paper presents Twill, a system that automatically derives optimal software pipelining and warp specialization schedules for GPU programs, improving utilization of complex GPU architectures.

Contribution

It introduces a joint optimization formulation for SWP and WS, and implements Twill, the first system to automatically generate guaranteed optimal schedules for iterative GPU programs.

Findings

01

Twill successfully rediscovered expert-designed schedules for Flash Attention.

02

Twill guarantees optimal schedules across different GPU architectures.

03

The approach is heuristic-free and easily adaptable to new architectures.

Abstract

GPU architectures have continued to grow in complexity, with recent incarnations introducing increasingly powerful fixed-function units for matrix multiplication and data movement to accompany highly parallel general-purpose cores. To fully leverage these machines, software must use sophisticated schedules that maximally utilize all hardware resources. Since realizing such schedules is complex, both programmers and compilers routinely employ program transformations, such as software pipelining (SWP) and warp specialization (WS), to do so in practice. However, determining how best to use SWP and WS in combination is a challenging problem that is currently handled through a mix of brittle compilation heuristics and fallible human intuition, with little insight into the space of solutions. To remedy this situation, we introduce a novel formulation of SWP and WS as a joint optimization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems · Cloud Computing and Resource Management