Chameleon: Adaptive Caching and Scheduling for Many-Adapter LLM Inference Environments

Nikoleta Iliakopoulou; Jovan Stojkovic; Chloe Alverti; Tianyin Xu; Hubertus Franke; Josep Torrellas

arXiv:2411.17741·cs.DC·November 14, 2025

Chameleon: Adaptive Caching and Scheduling for Many-Adapter LLM Inference Environments

Nikoleta Iliakopoulou, Jovan Stojkovic, Chloe Alverti, Tianyin Xu, Hubertus Franke, Josep Torrellas

PDF

Open Access

TL;DR

Chameleon is a system that improves LLM inference efficiency by caching adapters and scheduling tasks intelligently, significantly reducing latency and increasing throughput in multi-task, high-load environments.

Contribution

It introduces adapter caching and adapter-aware scheduling techniques tailored for many-adapter LLM inference environments, addressing workload heterogeneity and scheduler inefficiencies.

Findings

01

Reduces P99 latency by 80.7% under high load

02

Improves throughput by 1.5x over baselines

03

Effectively minimizes adapter loading times and prevents head-of-line blocking

Abstract

The widespread adoption of LLMs has driven an exponential rise in their deployment, imposing substantial demands on inference clusters. These clusters must handle numerous concurrent queries for different LLM downstream tasks. To handle multi-task settings with vast LLM parameter counts, methods like Low-Rank Adaptation (LoRA) enable task-specific fine-tuning while sharing most of the base LLM model across tasks. Hence, they allow concurrent task serving with minimal memory requirements. However, existing LLM serving systems face inefficiencies: they overlook workload heterogeneity, impose high link bandwidth from frequent adapter loading, and suffer from head-of-line blocking in their schedulers. To address these challenges, we present Chameleon, a novel LLM serving system optimized for many adapter environments, that relies on two core ideas: adapter caching and adapter-aware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Distributed and Parallel Computing Systems · Data Quality and Management