Efficient Multi-Adapter LLM Serving via Cross-Model KV-Cache Reuse with Activated LoRA

Allison Li; Kristjan Greenewald; Thomas Parnell; Navid Azizan

arXiv:2512.17910·cs.DC·December 23, 2025

Efficient Multi-Adapter LLM Serving via Cross-Model KV-Cache Reuse with Activated LoRA

Allison Li, Kristjan Greenewald, Thomas Parnell, Navid Azizan

PDF

Open Access

TL;DR

This paper introduces a novel LLM serving engine that enables efficient adapter switching by reusing key-value caches across models using Activated LoRA, significantly reducing latency and computation overhead.

Contribution

The work presents the first implementation of cross-model KV-cache reuse with Activated LoRA in a production inference engine, improving efficiency during multi-adapter LLM serving.

Findings

01

Up to 58x latency reduction in multi-adapter pipelines

02

Over 100x improvement in time-to-first-token

03

Benefits scale with model size and sequence length

Abstract

Modern large language model (LLM) systems increasingly rely on multi-turn pipelines that are composed of multiple task-specific adapters, yet existing serving frameworks remain inefficient, incurring substantial recomputation overhead when switching between adapters. We present the first LLM serving engine that supports cross-model prefix cache reuse between base and adapted models via Activated LoRA (aLoRA), enabling efficient and fine-grained adapter switching during inference. Our design extends the vLLM framework by introducing base-aligned block hashing and activation-aware masking within the model execution path, permitting cache reuse across models while preserving compatibility with existing serving engine optimizations. Integrated into a production-grade inference stack, this approach supports dynamic adapter activation without excessive key-value tensor recomputation.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Advanced Neural Network Applications