ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training

Adel Nabli (MLIA; Mila); Louis Fournier (MLIA); Pierre Erbacher (MLIA); Louis Serrano (MLIA); Eugene Belilovsky (Mila); Edouard Oyallon (MLIA)

arXiv:2406.02613·cs.LG·October 15, 2025

ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training

Adel Nabli (MLIA, Mila), Louis Fournier (MLIA), Pierre Erbacher (MLIA), Louis Serrano (MLIA), Eugene Belilovsky (Mila), Edouard Oyallon (MLIA)

PDF

Open Access 1 Repo

TL;DR

ACCO is a novel distributed training algorithm for large language models that reduces communication overhead and GPU idle time by synchronizing delayed gradients, enabling faster and more scalable training across heterogeneous hardware.

Contribution

We introduce ACCO, a memory-efficient optimization algorithm that synchronizes delayed gradients during training, improving scalability and efficiency over existing methods like ZeRO-1.

Findings

01

ACCO reduces training time compared to ZeRO-1.

02

Supports heterogeneous hardware environments effectively.

03

Maintains convergence properties similar to standard distributed optimization.

Abstract

Training LLMs relies on distributed implementations using multiple GPUs to compute gradients in parallel with sharded optimizers. However, synchronizing gradients in data parallel setups introduces communication overhead that grows with the number of workers, limiting parallelization efficiency. Local optimization algorithms reduce communications but incur high memory costs as they prevent optimizer state sharding, hindering scalability. To address this, we propose \textbf{AC}cumulate while \textbf{CO}mmunicate (ACCO), a memory-efficient optimization algorithm for distributed LLM training. By synchronizing delayed gradients while computing new ones, ACCO reduces GPU idle time and supports heterogeneous hardware. To mitigate the convergence issues caused by delayed updates, we introduce a novel technique ensuring training dynamics align with standard distributed optimization. Compared to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

edouardoyallon/acco
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Machine Learning in Materials Science

MethodsALIGN