Comparative Analysis of Large Language Model Inference Serving Systems: A Performance Study of vLLM and HuggingFace TGI
Saicharan Kolluru

TL;DR
This paper empirically compares vLLM and HuggingFace TGI for LLM inference, highlighting their performance differences in throughput, latency, and scalability across various deployment scenarios.
Contribution
It provides a comprehensive benchmarking study of vLLM and TGI, revealing their strengths and guiding system selection based on workload needs.
Findings
vLLM achieves up to 24x higher throughput than TGI.
TGI has lower tail latencies for interactive use.
Performance varies significantly with workload and model size.
Abstract
The deployment of Large Language Models (LLMs) in production environments requires efficient inference serving systems that balance throughput, latency, and resource utilization. This paper presents a comprehensive empirical evaluation of two prominent open-source LLM serving frameworks: vLLM and HuggingFace Text Generation Inference (TGI). We benchmark these systems across multiple dimensions including throughput performance, end-to-end latency, GPU memory utilization, and scalability characteristics using LLaMA-2 models ranging from 7B to 70B parameters. Our experiments reveal that vLLM achieves up to 24x higher throughput than TGI under high-concurrency workloads through its novel PagedAttention mechanism, while TGI demonstrates lower tail latencies for interactive single-user scenarios. We provide detailed performance profiles for different deployment scenarios and offer practical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Big Data and Digital Economy · Natural Language Processing Techniques
