TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training

Man Liu; Xingchen Liu; Xingjian Tian; Bing Lu; Shengkay Lyu; Shengquan Yin; Wenjing Huang; Zheng Wei; Hairui Zhao; Guangming Tan; Dingwen Tao

arXiv:2604.24088·cs.DC·April 28, 2026

TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training

Man Liu, Xingchen Liu, Xingjian Tian, Bing Lu, Shengkay Lyu, Shengquan Yin, Wenjing Huang, Zheng Wei, Hairui Zhao, Guangming Tan, Dingwen Tao

PDF

TL;DR

TACO is a novel FP8-based compression framework that reduces communication overhead in large-scale tensor-parallel training, significantly improving throughput while maintaining accuracy.

Contribution

We introduce TACO, a robust FP8 compression method with adaptive scaling and fused operators, enabling efficient, high-fidelity tensor communication in large-scale LLM training.

Findings

01

Achieves up to 1.87X throughput improvement

02

Maintains near-lossless accuracy in GPT and Qwen models

03

Effectively reduces memory traffic and kernel overhead

Abstract

Handling communication overhead in large-scale tensor-parallel training remains a critical challenge due to the dense, near-zero distributions of intermediate tensors, which exacerbate errors under frequent communication and introduce significant computational overhead during compression. To this end, we propose TACO (Tensor-parallel Adaptive COmmunication compression), a robust FP8-based framework for compressing TP intermediate tensors. First, we employ a data-driven reshaping strategy combined with an Adaptive Scale-Hadamard Transform to enable high-fidelity FP8 quantization, while its Dual-Scale Quantization mechanism ensures numerical stability throughout training. Second, we design a highly fused compression operator to reduce memory traffic and kernel launch overhead, allowing efficient overlap with communication. Finally, we integrate TACO with existing state-of-the-art methods…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.