ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive

Xinhao Luo; Zihan Liu; Yangjie Zhou; Shihan Fang; Ziyu Huang; Yu Feng; Chen Zhang; Shixuan Sun; Zhenzhe Zheng; Jingwen Leng; Minyi Guo

arXiv:2508.18850·cs.DC·August 27, 2025

ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive

Xinhao Luo, Zihan Liu, Yangjie Zhou, Shihan Fang, Ziyu Huang, Yu Feng, Chen Zhang, Shixuan Sun, Zhenzhe Zheng, Jingwen Leng, Minyi Guo

PDF

TL;DR

ClusterFusion introduces cluster-level communication primitives and a scheduling framework to expand operator fusion in LLM inference, significantly reducing latency by enabling on-chip intermediate data processing.

Contribution

It proposes ClusterReduce and ClusterGather primitives for structured on-chip communication, enabling broader operator fusion in LLM inference on modern GPU architectures.

Findings

01

Achieves 1.61x average latency reduction over state-of-the-art frameworks.

02

Enables on-chip intermediate data exchange, reducing off-chip memory traffic.

03

Successfully fuses multiple decoding stages into single kernels.

Abstract

Large language model (LLM) decoding suffers from high latency due to fragmented execution across operators and heavy reliance on off-chip memory for data exchange and reduction. This execution model limits opportunities for fusion and incurs significant memory traffic and kernel launch overhead. While modern architectures such as NVIDIA Hopper provide distributed shared memory and low-latency intra-cluster interconnects, they expose only low-level data movement instructions, lacking structured abstractions for collective on-chip communication. To bridge this software-hardware gap, we introduce two cluster-level communication primitives, ClusterReduce and ClusterGather, which abstract common communication patterns and enable structured, high-speed data exchange and reduction between thread blocks within a cluster, allowing intermediate results to be on-chip without involving off-chip…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.