Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping

Muru Zhang; Mayank Mishra; Zhongzhu Zhou; William Brandon; Jue Wang; Yoon Kim; Jonathan Ragan-Kelley; Shuaiwen Leon Song; Ben Athiwaratkun; Tri Dao

arXiv:2501.06589·cs.LG·June 23, 2025

Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping

Muru Zhang, Mayank Mishra, Zhongzhu Zhou, William Brandon, Jue Wang, Yoon Kim, Jonathan Ragan-Kelley, Shuaiwen Leon Song, Ben Athiwaratkun, Tri Dao

PDF

1 Repo 1 Video

TL;DR

This paper introduces Ladder Residual, an architectural modification that enables communication and computation overlap in large model inference, significantly improving speed by decoupling communication bottlenecks.

Contribution

It proposes Ladder Residual, a novel architectural approach applicable to residual models, that decouples communication from computation to accelerate large model inference, especially in tensor parallelism.

Findings

01

29% speedup in inference for 70B parameter Transformer with Ladder Residual

02

Comparable performance of Ladder Transformer to standard models at 1B and 3B scales

03

Minimal accuracy loss when converting parts of Llama-3.1 8B to Ladder Residual

Abstract

Large language model inference is both memory-intensive and time-consuming, often requiring distributed algorithms to efficiently scale. Various model parallelism strategies are used in multi-gpu training and inference to partition computation across multiple devices, reducing memory load and computation time. However, using model parallelism necessitates communication of information between GPUs, which has been a major bottleneck and limits the gains obtained by scaling up the number of devices. We introduce Ladder Residual, a simple architectural modification applicable to all residual-based models that enables straightforward overlapping that effectively hides the latency of communication. Our insight is that in addition to systems optimization, one can also redesign the model architecture to decouple communication from computation. While Ladder Residual can allow…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mayank31398/ladder-residual-inference
pytorchOfficial

Videos

Ladder-Residual: Parallelism-Aware Architecture for Accelerating Large Model Inference with Communication Overlapping· slideslive

Taxonomy

MethodsAttention Is All You Need · Absolute Position Encodings · Adam · Residual Connection · Dropout · Softmax · Byte Pair Encoding · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Multi-Head Attention