Kernel Looping: Eliminating Synchronization Boundaries for Peak Inference Performance
David Koeplinger, Darshan Gandhi, Pushkar Nandkar, Nathan Sheeley,, Matheen Musaddiq, Leon Zhang, Reid Goodbar, Matthew Shaffer, Han Wang, Angela, Wang, Mingran Wang, Raghu Prabhakar

TL;DR
Kernel looping is a novel optimization technique that reduces synchronization overheads in GPU-based AI inference, significantly boosting token generation speed and hardware utilization.
Contribution
This paper introduces kernel looping, a specialized global optimization that eliminates synchronization costs between kernel calls in modern dataflow architectures for AI inference.
Findings
Speeds up decode phase by up to 2.2× on SN40L
Achieves up to 2.5× speedup across multiple sockets
Enables over 90% peak performance on 8 and 16 sockets
Abstract
Token generation speed is critical to power the next wave of AI inference applications. GPUs significantly underperform during token generation due to synchronization overheads at kernel boundaries, utilizing only 21% of their peak memory bandwidth. While recent dataflow architectures mitigate these overheads by enabling aggressive fusion of decoder layers into a single kernel, they too leave performance on the table due to synchronization penalties at layer boundaries. This paper presents kernel looping, a specialized global optimization technique which exploits an optimization opportunity brought by combining the unique layer-level fusion possible in modern dataflow architectures with the repeated layer structure found in language models. Kernel looping eliminates synchronization costs between consecutive calls to the same kernel by transforming these calls into a single call to a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Neural Networks and Applications · Image Retrieval and Classification Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
