Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving
Ferran Agullo, Joan Oliveras, Chen Wang, Alberto Gutierrez-Torre, Olivier Tardieu, Alaa Youssef, Jordi Torres, Josep Ll. Berral

TL;DR
This paper introduces a data-driven pipeline that optimizes GPU placement for distributed LLM adapter serving, significantly improving resource efficiency by accurately predicting performance and maximizing throughput while avoiding errors.
Contribution
It presents a novel combination of a Digital Twin, ML performance models, and a greedy algorithm to optimize GPU utilization in LLM adapter serving systems.
Findings
Achieves below 5% throughput estimation error with high fidelity.
Reduces the number of GPUs needed for target workloads.
Demonstrates substantial GPU efficiency improvements.
Abstract
Large Language Model (LLM) adapters enable low-cost model specialization, but introduce complex caching and scheduling challenges in distributed serving systems where hundreds of adapters must be hosted concurrently. While prior work has largely focused on latency minimization, resource efficiency through throughput maximization remains underexplored. This paper presents a data-driven pipeline that, for a given workload, computes an adapter placement that serves the workload with the minimum number of GPUs while avoiding request starvation and GPU memory errors. To that end, the approach identifies the maximum feasible throughput attainable on each GPU by leveraging accurate performance predictions learned from real serving behavior. The proposed pipeline integrates three components: (i) a Digital Twin (DT) tailored to LLM-adapter serving, (ii) a distilled machine learning (ML) model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT and Edge/Fog Computing · Software-Defined Networks and 5G · Caching and Content Delivery
