Latent Abstraction for Retrieval-Augmented Generation
Ha Lan N.T, Minh-Anh Nguyen, Dung D. Le

TL;DR
LAnR introduces a unified LLM framework that encodes, retrieves, and generates within its latent space, improving retrieval efficiency and performance on QA tasks by eliminating separate retriever components.
Contribution
It proposes a novel unified approach where the LLM performs retrieval and generation jointly in latent space, removing the need for explicit query generation and separate retriever modules.
Findings
LAnR outperforms existing RAG methods on six QA benchmarks.
It reduces the number of retrieval calls, enhancing inference efficiency.
The model's answer token entropy signals retrieval sufficiency effectively.
Abstract
Retrieval-Augmented Generation (RAG) has become a standard approach for enhancing large language models (LLMs) with external knowledge, mitigating hallucinations, and improving factuality. However, existing systems rely on generating natural language queries at each hop and maintaining a strict architectural separation between retriever and generator, preventing them from leveraging the full representational capacity of the LLM. We propose \textbf{LAnR} (Latent Abstraction for RAG), a unified framework in which a single LLM jointly performs encoding, retrieval, and generation entirely within its own latent space. Rather than generating textual queries, LAnR produces dense retrieval vectors from the hidden states of a designated \texttt{[PRED]} token and uses them to match against encoded document representations from the same model. Furthermore, LAnR adaptively decides when sufficient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
