EQuARX: Efficient Quantized AllReduce in XLA for Distributed Machine Learning Acceleration
Ibrahim Ahmed, Clemens Schaefer, Gil Tabak, Denis Vnukov, Zenong Zhang, Felix chern, Anatoliy Yevtushenko, Andy Davis

TL;DR
EQuARX introduces a native, efficient quantized AllReduce method within XLA for TPUs, significantly accelerating distributed training of large models with minimal accuracy loss.
Contribution
The paper presents a novel dynamic block-wise quantized AllReduce implementation in XLA, enabling faster distributed training of large models on TPUs.
Findings
Achieves 1.8X speedup over BF16 AllReduce.
Accelerates Gemma 3 27B prefill stage by 1.25X.
Accelerates Gemma 3 12B prefill stage by 1.1X.
Abstract
While Large Language Models (LLMs) have become highly influential, their enormous scale presents significant deployment challenges. Efficiently serving these models typically requires distributing them across numerous accelerator devices, which introduces substantial performance overhead from inter-device communication (collectives). While model quantization has been widely adopted to reduce the memory and compute requirements of LLM weights and activations with minimal quality impact, applying quantization directly to collectives like AllReduce is inherently difficult due to the inter-device summation involved, which can lead to numerical instability or significant error accumulation. In this work, we present a native dynamic block-wise efficient quantized AllReduce within the XLA compiler for TPUs (EQuARX). By using TPU-friendly quantization and deep pipelining of communication and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
