Kernel Looping: Eliminating Synchronization Boundaries for Peak   Inference Performance

David Koeplinger; Darshan Gandhi; Pushkar Nandkar; Nathan Sheeley,; Matheen Musaddiq; Leon Zhang; Reid Goodbar; Matthew Shaffer; Han Wang; Angela; Wang; Mingran Wang; Raghu Prabhakar

arXiv:2410.23668·cs.CL·November 1, 2024

Kernel Looping: Eliminating Synchronization Boundaries for Peak Inference Performance

David Koeplinger, Darshan Gandhi, Pushkar Nandkar, Nathan Sheeley,, Matheen Musaddiq, Leon Zhang, Reid Goodbar, Matthew Shaffer, Han Wang, Angela, Wang, Mingran Wang, Raghu Prabhakar

PDF

Open Access

TL;DR

Kernel looping is a novel optimization technique that reduces synchronization overheads in GPU-based AI inference, significantly boosting token generation speed and hardware utilization.

Contribution

This paper introduces kernel looping, a specialized global optimization that eliminates synchronization costs between kernel calls in modern dataflow architectures for AI inference.

Findings

01

Speeds up decode phase by up to 2.2× on SN40L

02

Achieves up to 2.5× speedup across multiple sockets

03

Enables over 90% peak performance on 8 and 16 sockets

Abstract

Token generation speed is critical to power the next wave of AI inference applications. GPUs significantly underperform during token generation due to synchronization overheads at kernel boundaries, utilizing only 21% of their peak memory bandwidth. While recent dataflow architectures mitigate these overheads by enabling aggressive fusion of decoder layers into a single kernel, they too leave performance on the table due to synchronization penalties at layer boundaries. This paper presents kernel looping, a specialized global optimization technique which exploits an optimization opportunity brought by combining the unique layer-level fusion possible in modern dataflow architectures with the repeated layer structure found in language models. Kernel looping eliminates synchronization costs between consecutive calls to the same kernel by transforming these calls into a single call to a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · Neural Networks and Applications · Image Retrieval and Classification Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings