APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs

Yuxiang Huang; Mingye Li; Xu Han; Chaojun Xiao; Weilin Zhao; Sun Ao; Hao Zhou; Jie Zhou; Zhiyuan Liu; Maosong Sun

arXiv:2502.12085·cs.LG·May 27, 2025

APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs

Yuxiang Huang, Mingye Li, Xu Han, Chaojun Xiao, Weilin Zhao, Sun Ao, Hao Zhou, Jie Zhou, Zhiyuan Liu, Maosong Sun

PDF

Open Access 1 Repo 1 Video

TL;DR

APB is a novel framework that significantly accelerates long-context inference in large language models by passing compressed context blocks across GPUs, reducing compute and increasing parallelism without sacrificing performance.

Contribution

We propose APB, a new long-context inference method that uses multi-host approximate attention and communication of key-value pairs to improve speed and scalability.

Findings

01

Achieves up to 9.2x speedup over FlashAttn

02

Maintains task performance while accelerating inference

03

Supports diverse models and parallelism configurations

Abstract

While long-context inference is crucial for advancing large language model (LLM) applications, its prefill speed remains a significant bottleneck. Current approaches, including sequence parallelism strategies and compute reduction through approximate attention mechanisms, still fall short of delivering optimal inference efficiency. This hinders scaling the inputs to longer sequences and processing long-context queries in a timely manner. To address this, we introduce APB, an efficient long-context inference framework that leverages multi-host approximate attention to enhance prefill speed by reducing compute and enhancing parallelism simultaneously. APB introduces a communication mechanism for essential key-value pairs within a sequence parallelism framework, enabling a faster inference speed while maintaining task performance. We implement APB by incorporating a tailored FlashAttn…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thunlp/apb
noneOfficial

Videos

APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs· underline

Taxonomy

TopicsContext-Aware Activity Recognition Systems

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings