Clustered Retrieved Augmented Generation (CRAG)
Simon Akesson, Frances A. Santos

TL;DR
CRAG is a novel method that significantly reduces token usage in retrieval-augmented generation with LLMs, maintaining response quality and improving efficiency especially with larger external knowledge sources.
Contribution
CRAG introduces an effective approach to minimize prompt tokens in RAG, addressing limitations of context window size and cost without degrading output quality.
Findings
CRAG reduces tokens by at least 46%, up to 90%.
CRAG maintains stable token counts regardless of the number of reviews.
CRAG outperforms RAG in handling larger external knowledge sources.
Abstract
Providing external knowledge to Large Language Models (LLMs) is a key point for using these models in real-world applications for several reasons, such as incorporating up-to-date content in a real-time manner, providing access to domain-specific knowledge, and contributing to hallucination prevention. The vector database-based Retrieval Augmented Generation (RAG) approach has been widely adopted to this end. Thus, any part of external knowledge can be retrieved and provided to some LLM as the input context. Despite RAG approach's success, it still might be unfeasible for some applications, because the context retrieved can demand a longer context window than the size supported by LLM. Even when the context retrieved fits into the context window size, the number of tokens might be expressive and, consequently, impact costs and processing time, becoming impractical for most applications.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning in Healthcare
