CompactRAG: Reducing LLM Calls and Token Overhead in Multi-Hop Question Answering
Hao Yang, Zhiyu Yang, Xupeng Zhang, Wei Wei, Yunjie Zhang, Lin Yang

TL;DR
CompactRAG significantly reduces the number of LLM calls and token usage in multi-hop question answering by restructuring knowledge offline and carefully decomposing queries online, maintaining accuracy while improving efficiency.
Contribution
It introduces a decoupled framework that minimizes LLM interactions in multi-hop RAG, enabling efficient reasoning with fewer calls and lower token consumption.
Findings
Achieves competitive accuracy on multiple datasets.
Reduces token consumption compared to iterative RAG.
Invokes LLM only twice during inference regardless of reasoning hops.
Abstract
Retrieval-augmented generation (RAG) has become a key paradigm for knowledge-intensive question answering. However, existing multi-hop RAG systems remain inefficient, as they alternate between retrieval and reasoning at each step, resulting in repeated LLM calls, high token consumption, and unstable entity grounding across hops. We propose CompactRAG, a simple yet effective framework that decouples offline corpus restructuring from online reasoning. In the offline stage, an LLM reads the corpus once and converts it into an atomic QA knowledge base, which represents knowledge as minimal, fine-grained question-answer pairs. In the online stage, complex queries are decomposed and carefully rewritten to preserve entity consistency, and are resolved through dense retrieval followed by RoBERTa-based answer extraction. Notably, during inference, the LLM is invoked only twice in total - once…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Expert finding and Q&A systems · Advanced Graph Neural Networks
