ClusterFusion++: Expanding Cluster-Level Fusion to Full Transformer-Block Decoding

ChiHeng Jin; Hongche Yu; Xihui Chen

arXiv:2604.23553·cs.DC·April 28, 2026

ClusterFusion++: Expanding Cluster-Level Fusion to Full Transformer-Block Decoding

ChiHeng Jin, Hongche Yu, Xihui Chen

PDF

TL;DR

ClusterFusion++ is a CUDA extension that significantly accelerates full Transformer decoder block decoding in large language models by expanding fusion scope and optimizing execution, leading to 1.34x throughput improvements.

Contribution

It extends fusion to the entire Transformer decoder block and introduces a CUDA-Graph-compatible execution mode with TMA descriptors for efficiency.

Findings

01

Achieves 1.34x throughput increase on RTX 5090 for Pythia-2.8B.

02

Maintains high output fidelity with near-token-identical generation.

03

Broadens fusion scope to include all components of the Transformer decoder block.

Abstract

Large language model (LLM) decoding is latency-sensitive and often bottlenecked by fragmented operator execution and repeated off-chip materialization of intermediate tensors. Prior work expands fusion scope by leveraging thread-block clusters and on-chip inter-block collectives to fuse attention-side operators such as QKV projection, attention, and output projection. We develop ClusterFusion++, a CUDA-level extension that broadens fusion to the full Transformer decoder block for GPT-NeoX/Pythia models: LayerNorm -> QKV -> RoPE -> decode attention -> output projection -> Post-LN -> MLP -> residual. We additionally engineer a CUDA-Graph-compatible execution mode with persistent Tensor Memory Accelerator (TMA) descriptors to reduce per-step overhead. On an NVIDIA RTX 5090-class GPU, ClusterFusion++ improves throughput by 1.34x for Pythia-2.8B and yields similar gains for Pythia-6.9B,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.