A Data-driven ML Approach for Maximizing Performance in LLM-Adapter Serving

Ferran Agullo; Joan Oliveras; Chen Wang; Alberto Gutierrez-Torre; Olivier Tardieu; Alaa Youssef; Jordi Torres; Josep Ll. Berral

arXiv:2508.08343·cs.PF·November 20, 2025

A Data-driven ML Approach for Maximizing Performance in LLM-Adapter Serving

Ferran Agullo, Joan Oliveras, Chen Wang, Alberto Gutierrez-Torre, Olivier Tardieu, Alaa Youssef, Jordi Torres, Josep Ll. Berral

PDF

TL;DR

This paper presents a data-driven machine learning method and a Digital Twin for optimizing GPU throughput in LLM-adapter serving systems, effectively balancing performance and resource constraints.

Contribution

It introduces the first Digital Twin for LLM-adapter serving, enabling efficient training data generation and accurate throughput prediction under heterogeneous workloads.

Findings

01

Digital Twin reproduces throughput within 5.1% of real results

02

ML approach predicts optimal adapter configurations with at most 7.2% error

03

Method improves GPU utilization and prevents request starvation

Abstract

With the rapid adoption of Large Language Models (LLMs), LLM-adapters have become increasingly common, providing lightweight specialization of large-scale models. Serving hundreds or thousands of these adapters on a single GPU allows request aggregation, increasing throughput, but may also cause request starvation if GPU memory limits are exceeded. To address this issue, this study focuses on determining the joint configuration of concurrent and parallel adapters that maximizes GPU throughput without inducing starvation, given heterogeneous adapter and traffic properties. We propose a data-driven ML approach leveraging interpretable models to tackle this caching problem and introduce the first Digital Twin capable of reproducing an LLM-adapter serving system, enabling efficient training data generation. Experiments with the vLLM framework and LoRA adapters show that the Digital Twin…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.