The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines
Matias Martinez

TL;DR
This study evaluates how hyperparameters affect the inference performance of 20 large language models using vLLM and HuggingFace pipelines, revealing irregular throughput landscapes and the benefits of hyperparameter optimization for different GPU models.
Contribution
It provides a comprehensive analysis of hyperparameter impacts on LLM inference performance and demonstrates optimization benefits across different inference engines and hardware configurations.
Findings
Throughput landscapes are irregular with distinct peaks.
Hyperparameter optimization improves throughput by 9.16% on average.
GPU model changes benefit from hyperparameter tuning for better performance.
Abstract
The recent surge of open-source large language models (LLMs) enables developers to create AI-based solutions while maintaining control over aspects such as privacy and compliance, thereby providing governance and ownership of the model deployment process. To utilize these LLMs, inference engines are needed. These engines load the model's weights onto available resources, such as GPUs, and process queries to generate responses. The speed of inference, or performance, of the LLM, is critical for real-time applications, as it computes millions or billions of floating point operations per inference. Recently, advanced inference engines such as vLLM have emerged, incorporating novel mechanisms such as efficient memory management to achieve state-of-the-art performance. In this paper, we analyze the performance, particularly the throughput (tokens generated per unit of time), of 20 LLMs using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
