SONIC: Segmented Optimized Nexus for Information Compression in Key-Value Caching
Hong Chen, Xiang Liu, Bo Wang, Yuxuan Fan, Yuanlin Chu, Zongluo Li, Xiaowen Chu, Xuming Hu

TL;DR
SONIC is a learning-based framework that compresses multi-turn dialogue context into compact tokens, significantly improving efficiency and performance in large language model applications with minimal context loss.
Contribution
SONIC introduces a novel, adaptable compression method for multi-turn dialogue contexts that outperforms existing baselines and maintains semantic richness without retraining.
Findings
Achieves 80% and 50% compression ratios with superior performance.
Improves MTBench101 scores by 35.55% over baselines.
Speeds up inference by 50.1% compared to full-context generation.
Abstract
The linear growth of Key-Value (KV) cache remains a bottleneck for multi-turn LLM deployment. Existing KV cache compression methods often fail to account for the structural properties of multi-turn dialogues, relying on heuristic eviction that risks losing critical context. We propose \textbf{SONIC}, a learning-based framework that compresses historical segments into compact and semantically rich \textbf{Nexus} tokens. By integrating dynamic budget training, SONIC allows flexible adaptation to varying memory constraints without retraining. Experiments show that at compression ratios of 80\% and 50\%, SONIC consistently outperforms baselines such as H2O and StreamingLLM on four diverse multi-turn benchmarks. Specifically, on the widely used MTBench101 benchmark, SONIC achieves an average score improvement of 35.55\% over state-of-the-art baselines, validating its effectiveness in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Topic Modeling · Parallel Computing and Optimization Techniques
