LoopServe: An Adaptive Dual-phase LLM Inference Acceleration System for Multi-Turn Dialogues

Haoyang Li; Zhanchao Xu; Yiming Li; Xuejia Chen; Darian Li; Anxin Tian; Qingfa Xiao; Cheng Deng; Jun Wang; Qing Li; Lei Chen; Mingxuan Yuan

arXiv:2507.13681·cs.CL·September 29, 2025

LoopServe: An Adaptive Dual-phase LLM Inference Acceleration System for Multi-Turn Dialogues

Haoyang Li, Zhanchao Xu, Yiming Li, Xuejia Chen, Darian Li, Anxin Tian, Qingfa Xiao, Cheng Deng, Jun Wang, Qing Li, Lei Chen, Mingxuan Yuan

PDF

1 Datasets

TL;DR

LoopServe is an adaptive framework that accelerates large language model inference in multi-turn dialogues by dynamically selecting relevant context and compressing key-value caches, improving efficiency without sacrificing response quality.

Contribution

It introduces a novel dual-phase adaptive inference method with online sparsification and progressive compression, addressing limitations of fixed heuristics in multi-turn dialogue processing.

Findings

01

Significantly faster inference across multiple dialogue datasets.

02

Outperforms existing acceleration baselines in effectiveness.

03

Maintains high response quality with reduced computational cost.

Abstract

Multi-turn dialogues are essential in many real-world applications of large language models, such as chatbots and virtual assistants. As conversation histories become longer, existing large language models face increasing computational and memory challenges, which hinder their ability to provide efficient and responsive interactions. Most current acceleration methods either compress the context or optimize key value caching, but they often rely on fixed or position-based heuristics that do not adapt well to the dynamic and unpredictable patterns found in actual multi-turn conversations. As a result, these models cannot accurately identify and prioritize the most relevant context, leading to degraded response quality. In this paper, we present LoopServe, an adaptive dual-phase inference acceleration framework for large language models in multi-turn dialogues. LoopServe introduces two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

TreeAILab/Multi-turn_Long-context_Benchmark_for_LLMs
dataset· 106 dl
106 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.