SONIC: Segmented Optimized Nexus for Information Compression in Key-Value Caching

Hong Chen; Xiang Liu; Bo Wang; Yuxuan Fan; Yuanlin Chu; Zongluo Li; Xiaowen Chu; Xuming Hu

arXiv:2601.21927·cs.CL·January 30, 2026

SONIC: Segmented Optimized Nexus for Information Compression in Key-Value Caching

Hong Chen, Xiang Liu, Bo Wang, Yuxuan Fan, Yuanlin Chu, Zongluo Li, Xiaowen Chu, Xuming Hu

PDF

Open Access

TL;DR

SONIC is a learning-based framework that compresses multi-turn dialogue context into compact tokens, significantly improving efficiency and performance in large language model applications with minimal context loss.

Contribution

SONIC introduces a novel, adaptable compression method for multi-turn dialogue contexts that outperforms existing baselines and maintains semantic richness without retraining.

Findings

01

Achieves 80% and 50% compression ratios with superior performance.

02

Improves MTBench101 scores by 35.55% over baselines.

03

Speeds up inference by 50.1% compared to full-context generation.

Abstract

The linear growth of Key-Value (KV) cache remains a bottleneck for multi-turn LLM deployment. Existing KV cache compression methods often fail to account for the structural properties of multi-turn dialogues, relying on heuristic eviction that risks losing critical context. We propose \textbf{SONIC}, a learning-based framework that compresses historical segments into compact and semantically rich \textbf{Nexus} tokens. By integrating dynamic budget training, SONIC allows flexible adaptation to varying memory constraints without retraining. Experiments show that at compression ratios of 80\% and 50\%, SONIC consistently outperforms baselines such as H2O and StreamingLLM on four diverse multi-turn benchmarks. Specifically, on the widely used MTBench101 benchmark, SONIC achieves an average score improvement of 35.55\% over state-of-the-art baselines, validating its effectiveness in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCaching and Content Delivery · Topic Modeling · Parallel Computing and Optimization Techniques