Communication-Efficient Multi-Device Inference Acceleration for Transformer Models
Xiao Liu, Lijun Zhang, Deepak Ganesan, Hui Guan

TL;DR
ASTRA is a communication-efficient multi-device inference framework for Transformer models that significantly reduces latency and bandwidth requirements while maintaining accuracy, enabling faster AI applications in bandwidth-limited environments.
Contribution
ASTRA introduces a novel integration of sequence parallelism and mixed-precision attention with compression techniques to enable efficient multi-device Transformer inference.
Findings
Achieves up to 2.64X speedup over single-device inference.
Achieves up to 15.25X speedup over existing multi-device methods.
Operates effectively under bandwidths as low as 10 Mbps.
Abstract
Transformer models power many AI applications but suffer from high inference latency, limiting their use in real-time settings. Multi-device inference can reduce latency by parallelizing computation. Yet, existing methods require high inter-device bandwidth, making them impractical for bandwidth-constrained environments. We propose ASTRA, a communication-efficient framework that accelerates Transformer inference through a novel integration of sequence parallelism and a Mixed-Precision Attention mechanism designed to minimize inter-device communication. ASTRA compresses non-local token embeddings via vector quantization and preserves task accuracy through two optimizations, Noise-Augmented Quantization and Distributed Class Tokens. Experiments on ViT and GPT2 across vision and NLP tasks show that ASTRA achieves up to 2.64X speedups over single-device inference and up to 15.25X speedups…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper is well-presented and easy to follow - Communication overhead is significant in large Transformer model distributed settings - The use of codebooks is interesting
- The models used for inference are small and it is not clear to me that these hold at scale.
1. Addresses a real bottleneck: The paper identifies and tackles a genuine problem, that communication dominates latency (58.6-93.5%) in bandwidth-constrained multi-device inference. 2. Novel compression approach: The Mixed-Precision Attention mechanism is creative, using full-precision for local tokens and VQ for remote tokens. 3. Good evaluation: Extensive experiments across multiple architectures (ViT, GPT-2), tasks (classification, language modeling), and conditions (bandwidth, device count,
1. Limited architectural types: The evaluation focuses only on ViT and GPT-2, which are relatively small and dated models. Modern applications use much larger models (e.g., LLaMA variants). The scalability claims are weakened without evidence on contemporary, production-scale models. 2. Severe zero-shot degradation: Table 3 shows large performance drops in zero-shot settings (e.g., GPT-2M perplexity increases from 43.22 to 62.29, a 44% degradation). This is a critical limitation for practical d
This paper observes a real bottleneck in multi-device Transformer inference for low-bandwidth or edge environments, which is increasingly relevant for real-time AI applications.
1. ASTRA integrates known techniques (sequence parallelism + token quantization + noise augmentation), so the main contribution is in practical integration and bandwidth optimization, not a fundamentally new inference algorithm. 2. Lacks formal characterization of attention approximation error due to vector quantization and noise injection. Consider adding error bounds or theoretical analysis of how quantization and noise affect attention computation and model accuracy. 3. Experiments assume sta
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Power Transformer Diagnostics and Insulation
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing
