SOMA: Efficient Multi-turn LLM Serving via Small Language Model

Xueqi Cheng; Qiong Wu; Zhengyi Zhou; Xugui Zhou; Tyler Derr; Yushun Dong

arXiv:2605.11317·cs.CL·May 13, 2026

SOMA: Efficient Multi-turn LLM Serving via Small Language Model

Xueqi Cheng, Qiong Wu, Zhengyi Zhou, Xugui Zhou, Tyler Derr, Yushun Dong

PDF

1 Repo

TL;DR

SOMA introduces a framework that uses small language models with learned prompts to efficiently serve multi-turn LLM conversations, reducing costs while maintaining response quality.

Contribution

It proposes a novel method to adapt small models to local dialogue regions using soft prompts and LoRA, balancing efficiency and coherence in multi-turn LLM serving.

Findings

01

SOMA achieves significant cost reduction in multi-turn dialogue serving.

02

The method maintains high response quality comparable to larger models.

03

Extensive experiments validate the effectiveness of SOMA.

Abstract

Large Language Models (LLMs) are increasingly deployed in multi-turn dialogue settings where preserving conversational context across turns is essential. A standard serving practice concatenates the full dialogue history at every turn, which reliably maintains coherence but incurs substantial cost in latency, memory, and API expenditure, especially when queries are routed to large proprietary models. Existing approaches often struggle to balance the trade-off between response quality and efficiency. We propose a framework that exploits the early turns of a session to estimate a local response manifold and then adapt a smaller surrogate model to this local region for the remainder of the conversation. Concretely, we learn soft prompts that maximize semantic divergence between the large and surrogate small language models' responses to surface least-aligned local directions, stabilize…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

LabRAI/SOMA
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.