ExpertWeave: Efficiently Serving Expert-Specialized Fine-Tuned Adapters at Scale

Ge Shi; Hanieh Sadri; Qian Wang; Yu Zhang; Ying Xiong; Yong Zhang; Zhenan Fan

arXiv:2508.17624·cs.DC·August 26, 2025

ExpertWeave: Efficiently Serving Expert-Specialized Fine-Tuned Adapters at Scale

Ge Shi, Hanieh Sadri, Qian Wang, Yu Zhang, Ying Xiong, Yong Zhang, Zhenan Fan

PDF

TL;DR

ExpertWeave is a system that enables efficient, scalable serving of multiple expert-specialized adapters for large language models, significantly reducing memory usage and increasing throughput with minimal latency overhead.

Contribution

It introduces a novel system that allows concurrent serving of multiple ESFT adapters over a shared MoE base model with minimal resource overhead and seamless integration.

Findings

01

Can serve multiple adapters on a single accelerator where baseline fails

02

Achieves up to 94x more KV cache capacity and 18% higher throughput

03

Maintains low latency overhead even with 20 adapters

Abstract

Expert-Specialized Fine-Tuning (ESFT) adapts Mixture-of-Experts (MoE) large language models to enhance their task-specific performance by selectively tuning the top-activated experts for the task. Serving these fine-tuned models at scale is challenging: deploying merged models in isolation is prohibitively resource-hungry, while existing multi-adapter serving systems with LoRA-style additive updates are incompatible with ESFT's expert-oriented paradigm. We present ExpertWeave, a system that serves multiple ESFT adapters concurrently over a single shared MoE base model, drastically reducing the memory footprint and improving resource utilization. To seamlessly integrate into existing inference pipelines for MoE models with non-intrusive modifications and minimal latency overhead, ExpertWeave introduces a virtual-memory-assisted expert weight manager that co-locates base-model and adapter…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.