SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving

Sunghyeon Woo; Ahreum Seo; Jaegwang Lee; Jaeeun Kil; Hanbae Seo; Joonghoon Kim; Baeseong Park; Se Jung Kwon; Dongsoo Lee

arXiv:2603.02599·cs.AI·March 4, 2026

SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving

Sunghyeon Woo, Ahreum Seo, Jaegwang Lee, Jaeeun Kil, Hanbae Seo, Joonghoon Kim, Baeseong Park, Se Jung Kwon, Dongsoo Lee

PDF

Open Access

TL;DR

SUN introduces a novel approach for multi-LLM serving that enables cross-model sharing of decode execution, significantly improving GPU utilization and throughput while maintaining accuracy and supporting low-bit decoding.

Contribution

The paper presents SUN, a method that decomposes Transformer decoders to allow shared decoding across models, reducing resource usage and increasing efficiency in multi-LLM serving.

Findings

01

SUN achieves up to 2.0x throughput improvement per GPU.

02

SUN maintains accuracy comparable to full fine-tuning.

03

QSUN speeds up decoding by 45% with similar accuracy.

Abstract

In multi-model LLM serving, decode execution remains inefficient due to model-specific resource partitioning: since cross-model batching is not possible, memory-bound decoding often suffers from severe GPU underutilization, especially under skewed workloads. We propose Shared Use of Next-token Prediction (SUN), the first approach that enables cross-model sharing of decode execution in disaggregated multi-LLM serving. SUN decomposes a decoder-only Transformer into a prefill module and a decode module, and fine-tunes only the task-specific prefill module, enabling a frozen decode module to be shared across models. This design enables a model-agnostic decode routing policy that balances decode requests across shared workers to maximize utilization. Across diverse tasks and model families, SUN achieves accuracy comparable to full fine-tuning while maintaining system throughput with fewer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Network Packet Processing and Optimization