SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving
Sunghyeon Woo, Ahreum Seo, Jaegwang Lee, Jaeeun Kil, Hanbae Seo, Joonghoon Kim, Baeseong Park, Se Jung Kwon, Dongsoo Lee

TL;DR
SUN introduces a novel approach for multi-LLM serving that enables cross-model sharing of decode execution, significantly improving GPU utilization and throughput while maintaining accuracy and supporting low-bit decoding.
Contribution
The paper presents SUN, a method that decomposes Transformer decoders to allow shared decoding across models, reducing resource usage and increasing efficiency in multi-LLM serving.
Findings
SUN achieves up to 2.0x throughput improvement per GPU.
SUN maintains accuracy comparable to full fine-tuning.
QSUN speeds up decoding by 45% with similar accuracy.
Abstract
In multi-model LLM serving, decode execution remains inefficient due to model-specific resource partitioning: since cross-model batching is not possible, memory-bound decoding often suffers from severe GPU underutilization, especially under skewed workloads. We propose Shared Use of Next-token Prediction (SUN), the first approach that enables cross-model sharing of decode execution in disaggregated multi-LLM serving. SUN decomposes a decoder-only Transformer into a prefill module and a decode module, and fine-tunes only the task-specific prefill module, enabling a frozen decode module to be shared across models. This design enables a model-agnostic decode routing policy that balances decode requests across shared workers to maximize utilization. Across diverse tasks and model families, SUN achieves accuracy comparable to full fine-tuning while maintaining system throughput with fewer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Network Packet Processing and Optimization
