Syncopate: Efficient Multi-GPU AI Kernels via Automatic Chunk-Centric Compute-Communication Overlap

Xinwei Qiang; Yue Guan; Zhengding Hu; Keren Zhou; Yufei Ding; Adnan Aziz

arXiv:2601.20595·cs.DC·April 6, 2026

Syncopate: Efficient Multi-GPU AI Kernels via Automatic Chunk-Centric Compute-Communication Overlap

Xinwei Qiang, Yue Guan, Zhengding Hu, Keren Zhou, Yufei Ding, Adnan Aziz

PDF

TL;DR

Syncopate is a compiler and runtime that enables automatic fine-grained compute-communication overlap within a single GPU kernel, significantly improving multi-GPU workload performance.

Contribution

It introduces a chunk abstraction and transformations for fine-grained overlap, enabling more efficient multi-GPU communication handling.

Findings

01

Achieves an average 1.3× speedup on multi-GPU workloads.

02

Up to 4.7× speedup in certain cases.

03

Enables reuse of chunk plans from existing compilers or templates.

Abstract

Communication has become a first-order bottleneck in large-cale GPU workloads, and existing distributed compilers address it mainly by overlapping whole compute and communication kernels at the stream level. This coarse granularity incurs extra kernel launches, forces device-wide synchronizations at kernel boundaries, and leaves substantial slack when the slowest tile or kernel stretches the communication tail. We present Syncopate, a compiler and runtime that enables automatic fine-grained overlap inside a single fused kernel. Syncopate introduces a communication chunk abstraction that decouples communication granularity from kernel structure and backend mechanisms, allowing chunk-level plans to be ported from existing distributed compilers, written directly by users, or instantiated from reusable templates. Given a local Triton kernel and a chunk schedule, Syncopate performs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.