ClusterFusion++: Expanding Cluster-Level Fusion to Full Transformer-Block Decoding
ChiHeng Jin, Hongche Yu, Xihui Chen

TL;DR
ClusterFusion++ is a CUDA extension that significantly accelerates full Transformer decoder block decoding in large language models by expanding fusion scope and optimizing execution, leading to 1.34x throughput improvements.
Contribution
It extends fusion to the entire Transformer decoder block and introduces a CUDA-Graph-compatible execution mode with TMA descriptors for efficiency.
Findings
Achieves 1.34x throughput increase on RTX 5090 for Pythia-2.8B.
Maintains high output fidelity with near-token-identical generation.
Broadens fusion scope to include all components of the Transformer decoder block.
Abstract
Large language model (LLM) decoding is latency-sensitive and often bottlenecked by fragmented operator execution and repeated off-chip materialization of intermediate tensors. Prior work expands fusion scope by leveraging thread-block clusters and on-chip inter-block collectives to fuse attention-side operators such as QKV projection, attention, and output projection. We develop ClusterFusion++, a CUDA-level extension that broadens fusion to the full Transformer decoder block for GPT-NeoX/Pythia models: LayerNorm -> QKV -> RoPE -> decode attention -> output projection -> Post-LN -> MLP -> residual. We additionally engineer a CUDA-Graph-compatible execution mode with persistent Tensor Memory Accelerator (TMA) descriptors to reduce per-step overhead. On an NVIDIA RTX 5090-class GPU, ClusterFusion++ improves throughput by 1.34x for Pythia-2.8B and yields similar gains for Pythia-6.9B,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
