Parallelization Strategies for Dense LLM Deployment: Navigating Through Application-Specific Tradeoffs and Bottlenecks

Burak Topcu; Musa Oguzhan Cim; Poovaiah Palangappa; Meena Arunachalam; Mahmut Taylan Kandemir

arXiv:2603.05692·cs.DC·March 9, 2026

Parallelization Strategies for Dense LLM Deployment: Navigating Through Application-Specific Tradeoffs and Bottlenecks

Burak Topcu, Musa Oguzhan Cim, Poovaiah Palangappa, Meena Arunachalam, Mahmut Taylan Kandemir

PDF

Open Access

TL;DR

This paper explores how different parallelization strategies affect the performance of dense large language models, providing insights into optimizing latency and throughput for various application needs.

Contribution

It offers an empirical analysis of intra-node parallelization schemes for dense LLMs, highlighting how hybrid tensor and pipeline parallelism can balance latency and throughput.

Findings

01

Tensor Parallelism improves latency performance.

02

Pipeline Parallelism enhances throughput.

03

Hybrid strategies enable flexible latency-throughput tradeoffs.

Abstract

Breakthroughs in the generative AI domain have fueled an explosion of large language model (LLM)-powered applications, whose workloads fundamentally consist of sequences of inferences through transformer architectures. Within this rapidly expanding ecosystem, dense LLMs--those that activate all model parameters for each token generation--form the foundation for advanced expert-based variants. Dense models continue to dominate because of their strong generalization ability, scalability, ease of fine-tuning, and versatility across diverse tasks. In LLM inference systems, performance is mainly characterized by latency, response time, and throughput (i.e., tokens generated per unit of time). Latency and throughput are inherently coupled: optimizing for one often comes at the expense of the other. Moreover, batching strategies and parallelism configurations, which are essential when dense…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Machine Learning in Materials Science · Advanced Neural Network Applications