When Less is More: The LLM Scaling Paradox in Context Compression
Ruishan Guo, Yibing Liu, Guoxin Ma, Yan Wang, Yueyang Zhang, Long Xia, Kecheng Chen, Zhiyuan Sun, Daiting Shi

TL;DR
This paper uncovers a paradox where increasing compressor size in language models can reduce context fidelity despite lower reconstruction error, due to knowledge overwriting and semantic drift.
Contribution
It identifies and analyzes the Size-Fidelity Paradox in context compression, revealing how larger models can harm faithful context reconstruction.
Findings
Mid-sized compressors outperform larger ones in faithful recovery.
Larger models tend to overwrite facts and paraphrase content, reducing fidelity.
Compressors organize memory into broader semantic subspaces, increasing ambiguity.
Abstract
Scaling up model parameters has long been a prevalent training paradigm driven by the assumption that larger models yield superior generation capabilities. However, under lossy context compression in a compressor--decoder setup, we find a \textbf{\textit{Size-Fidelity Paradox}}: increasing compressor size can lessen the faithfulness of reconstructed contexts though reconstruction error decreases. Across 27 compressor setups spanning model families, scales, and compression rates, we coin this paradox arising from two dominant factors: 1) \textit{knowledge overwriting}: larger models increasingly replace source facts with their own prior beliefs, \textit{e.g.}, ``the white strawberry`` ``the red strawberry``; and 2) \textit{semantic drift}: larger models tend to paraphrase or restructure content instead of reproducing it verbatim, \textit{e.g.}, ``Alice hit Bob`` ``Bob hit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
