Canzona: A Unified, Asynchronous, and Load-Balanced Framework for Distributed Matrix-based Optimizers

Liangyu Wang; Siqi Zhang; Junjie Wang; Yiming Dong; Bo Zheng; Zihan Qiu; Shengkun Tang; Di Wang; Rui Men; Dayiheng Liu

arXiv:2602.06079·cs.DC·February 9, 2026

Canzona: A Unified, Asynchronous, and Load-Balanced Framework for Distributed Matrix-based Optimizers

Liangyu Wang, Siqi Zhang, Junjie Wang, Yiming Dong, Bo Zheng, Zihan Qiu, Shengkun Tang, Di Wang, Rui Men, Dayiheng Liu

PDF

Open Access

TL;DR

Canzona is a novel framework that enables efficient, asynchronous, and load-balanced distributed matrix-based optimization for large language models, overcoming fragmentation and communication challenges.

Contribution

It introduces a unified approach with new partitioning and scheduling strategies to improve optimizer efficiency in distributed LLM training.

Findings

01

Achieves 1.57x speedup in iteration time on 256 GPUs.

02

Reduces optimizer step latency by 5.8x.

03

Maintains efficiency of parallel architectures with large models.

Abstract

The scaling of Large Language Models (LLMs) drives interest in matrix-based optimizers (e.g., Shampoo, Muon, SOAP) for their convergence efficiency; yet their requirement for holistic updates conflicts with the tensor fragmentation in distributed frameworks like Megatron. Existing solutions are suboptimal: synchronous approaches suffer from computational redundancy, while layer-wise partitioning fails to reconcile this conflict without violating the geometric constraints of efficient communication primitives. To bridge this gap, we propose Canzona, a Unified, Asynchronous, and Load-Balanced framework that decouples logical optimizer assignment from physical parameter distribution. For Data Parallelism, we introduce an alpha-Balanced Static Partitioning strategy that respects atomicity while neutralizing the load imbalance. For Tensor Parallelism, we design an Asynchronous Compute…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Tensor decomposition and applications · Stochastic Gradient Optimization Techniques