LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG
Yijia Zheng, Marcel Worring

TL;DR
LatentRAG introduces a continuous latent space approach for reasoning and retrieval in agentic RAG, significantly reducing inference latency while maintaining high performance on complex question answering tasks.
Contribution
It proposes a novel latent space framework that aligns reasoning and retrieval, enabling efficient multi-step question answering with reduced latency.
Findings
Achieves comparable accuracy to explicit methods on benchmark datasets.
Reduces inference latency by approximately 90%.
Supports end-to-end joint optimization of reasoning and retrieval.
Abstract
Single-step retrieval-augmented generation (RAG) provides an efficient way to incorporate external information for simple question answering tasks but struggles with complex questions. Agentic RAG extends this paradigm by replacing single-step retrieval with a multi-step process, in which the large language model (LLM) acts as a search agent that generates intermediate thoughts and subqueries to iteratively interact with the retrieval system. This iterative process incurs substantial latency due to the autoregressive generation of lengthy thoughts and subqueries. To address this limitation, we propose LatentRAG, a novel framework that shifts both reasoning and retrieval from discrete language space to continuous latent space. Unlike existing explicit methods that generate natural language thoughts or subqueries token-by-token, LatentRAG produces latent tokens for thoughts and subqueries…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
