Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation
Julia Belikova, Danila Rozhevskii, Dennis Svirin, Konstantin Polev, Alexander Panchenko

TL;DR
This paper introduces a method to detect when compressed token representations in large language models lose essential information, enabling better management of context length limitations in resource-constrained environments.
Contribution
It proposes a novel detection methodology for overflow in compressed representations, transitioning from query-agnostic to query-aware detection techniques.
Findings
Query-agnostic saturation statistics can distinguish compressed from uncompressed tokens.
Query-aware classifiers achieve 0.72 AUC-ROC in overflow detection.
Incorporating query information improves detection accuracy.
Abstract
Efficient long-context processing remains a crucial challenge for contemporary large language models (LLMs), especially in resource-constrained environments. Soft compression architectures promise to extend effective context length by replacing long token sequences with smaller sets of learned compressed tokens. Yet, the limits of compressibility -- and when compression begins to erase task-relevant content -- remain underexplored. In this paper, we define token overflow as a regime in which compressed representations no longer contain sufficient information to answer a given query, and propose a methodology to characterize and detect it. In the xRAG soft-compression setting, we find that query-agnostic saturation statistics reliably separate compressed from uncompressed token representations, providing a practical tool for identifying compressed tokens but showing limited overflow…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Machine Learning and Algorithms · Topic Modeling
