Context Embeddings for Efficient Answer Generation in RAG
David Rau, Shuai Wang, Herv\'e D\'ejean, St\'ephane Clinchant

TL;DR
This paper introduces COCOM, a context compression technique for RAG that significantly speeds up answer generation by reducing long contexts to embeddings, with adjustable quality-speed trade-offs.
Contribution
COCOM is a novel context compression method that efficiently handles multiple contexts, improving decoding speed and answer quality over previous approaches.
Findings
Achieves up to 5.69× speed-up in decoding time.
Outperforms existing context compression methods in quality and efficiency.
Effectively manages multiple contexts for faster answer generation.
Abstract
Retrieval-Augmented Generation (RAG) allows overcoming the limited knowledge of LLMs by extending the input with external information. As a consequence, the contextual inputs to the model become much longer which slows down decoding time directly translating to the time a user has to wait for an answer. We address this challenge by presenting COCOM, an effective context compression method, reducing long contexts to only a handful of Context Embeddings speeding up the generation time by a large margin. Our method allows for different compression rates trading off decoding time for answer quality. Compared to earlier methods, COCOM allows for handling multiple contexts more effectively, significantly reducing decoding time for long inputs. Our method demonstrates a speed-up of up to 5.69 while achieving higher performance compared to existing efficient context compression methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems
