Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving

Ferran Agullo; Joan Oliveras; Chen Wang; Alberto Gutierrez-Torre; Olivier Tardieu; Alaa Youssef; Jordi Torres; Josep Ll. Berral

arXiv:2602.24044·cs.DC·March 2, 2026

Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving

Ferran Agullo, Joan Oliveras, Chen Wang, Alberto Gutierrez-Torre, Olivier Tardieu, Alaa Youssef, Jordi Torres, Josep Ll. Berral

PDF

Open Access

TL;DR

This paper introduces a data-driven pipeline that optimizes GPU placement for distributed LLM adapter serving, significantly improving resource efficiency by accurately predicting performance and maximizing throughput while avoiding errors.

Contribution

It presents a novel combination of a Digital Twin, ML performance models, and a greedy algorithm to optimize GPU utilization in LLM adapter serving systems.

Findings

01

Achieves below 5% throughput estimation error with high fidelity.

02

Reduces the number of GPUs needed for target workloads.

03

Demonstrates substantial GPU efficiency improvements.

Abstract

Large Language Model (LLM) adapters enable low-cost model specialization, but introduce complex caching and scheduling challenges in distributed serving systems where hundreds of adapters must be hosted concurrently. While prior work has largely focused on latency minimization, resource efficiency through throughput maximization remains underexplored. This paper presents a data-driven pipeline that, for a given workload, computes an adapter placement that serves the workload with the minimum number of GPUs while avoiding request starvation and GPU memory errors. To that end, the approach identifies the maximum feasible throughput attainable on each GPU by leveraging accurate performance predictions learned from real serving behavior. The proposed pipeline integrates three components: (i) a Digital Twin (DT) tailored to LLM-adapter serving, (ii) a distilled machine learning (ML) model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIoT and Edge/Fog Computing · Software-Defined Networks and 5G · Caching and Content Delivery