The Impact of Hyperparameters on Large Language Model Inference   Performance: An Evaluation of vLLM and HuggingFace Pipelines

Matias Martinez

arXiv:2408.01050·cs.SE·August 5, 2024·2 cites

The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines

Matias Martinez

PDF

Open Access

TL;DR

This study evaluates how hyperparameters affect the inference performance of 20 large language models using vLLM and HuggingFace pipelines, revealing irregular throughput landscapes and the benefits of hyperparameter optimization for different GPU models.

Contribution

It provides a comprehensive analysis of hyperparameter impacts on LLM inference performance and demonstrates optimization benefits across different inference engines and hardware configurations.

Findings

01

Throughput landscapes are irregular with distinct peaks.

02

Hyperparameter optimization improves throughput by 9.16% on average.

03

GPU model changes benefit from hyperparameter tuning for better performance.

Abstract

The recent surge of open-source large language models (LLMs) enables developers to create AI-based solutions while maintaining control over aspects such as privacy and compliance, thereby providing governance and ownership of the model deployment process. To utilize these LLMs, inference engines are needed. These engines load the model's weights onto available resources, such as GPUs, and process queries to generate responses. The speed of inference, or performance, of the LLM, is critical for real-time applications, as it computes millions or billions of floating point operations per inference. Recently, advanced inference engines such as vLLM have emerged, incorporating novel mechanisms such as efficient memory management to achieve state-of-the-art performance. In this paper, we analyze the performance, particularly the throughput (tokens generated per unit of time), of 20 LLMs using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings