Characterizing Communication Patterns in Distributed Large Language Model Inference
Lang Xu, Kaushik Kandadi Suresh, Quentin Anthony, Nawras Alnaasan, Dhabaleswar K. Panda

TL;DR
This paper analyzes communication patterns in distributed large language model inference, providing insights into how different parallelization strategies impact performance and offering practical guidance for optimizing deployment.
Contribution
It offers a detailed characterization of communication dynamics in distributed LLM inference, combining profiling and analytical models to inform parallelization choices.
Findings
Tensor parallelism has high network overhead but fast response times for short sequences.
Pipeline parallelism reduces data transfer but increases overall latency.
Careful tuning of combined parallelization approaches is necessary for balanced performance.
Abstract
Large Language Models (LLMs) built on transformer architectures have transformed natural language processing, achieving remarkable performance across diverse applications. While distributed inference frameworks enable practical deployment of these models, inter-GPU communication creates significant performance constraints that limit service quality in real-world systems. This paper investigates communication dynamics in distributed LLM serving-analyzing how various parallelization approaches coordinate data exchange between GPU workers during inference. We study dense transformer-based models as representative examples of contemporary architectures widely used in operational deployments. Our work combines detailed profiling measurements with predictive analytical models to characterize communication behavior across different parallelization configurations. Results show that tensor…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
