Multi-Head RAG: Solving Multi-Aspect Problems with LLMs
Maciej Besta, Ales Kubicek, Robert Gerstenberger, Marcin Chrapek, Roman Niggli, Patrik Okanovic, Yi Zhu, Patrick Iff, Michal Podstawski, Lucas Weitzendorf, Mingyuan Chi, Joanna Gajda, Piotr Nyczyk, J\"urgen M\"uller, Hubert Niewiadomski, Torsten Hoefler

TL;DR
Multi-Head RAG (MRAG) enhances retrieval-augmented generation by using multi-head attention activations to better capture diverse semantic aspects, significantly improving retrieval success for complex, multi-aspect queries.
Contribution
Introducing MRAG, a novel approach that uses Transformer multi-head attention as retrieval keys to improve multi-aspect document retrieval in RAG systems.
Findings
Up to 20% higher retrieval success ratios on real-world tasks.
MRAG outperforms 18 RAG baselines in complex query retrieval.
Improved downstream language model generation quality.
Abstract
Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs) by retrieving supporting documents into the prompt, but existing methods do not explicitly target queries that require fetching multiple documents with substantially different content. Such multi-aspect queries are challenging because relevant documents can be far apart in embedding space, making joint retrieval difficult. We introduce Multi-Head RAG (MRAG), which addresses this gap with a simple yet powerful idea: using Transformer multi-head attention activations rather than the standard decoder-layer embedding, as retrieval keys. It leverages the observation that different heads capture different semantic aspects. This yields multi-aspect embeddings for both documents and queries, improving retrieval accuracy on complex queries. We show MRAG's design advantages over 18 RAG baselines, up to 20% higher retrieval…
Peer Reviews
Decision·Submitted to ICLR 2026
- The proposed method for using multiple blocks and their multi-head attentions is novel. - The authors clarify the position of their proposed method by carefully referring to conventional models. - The authors created a new benchmark dataset that requires retrieving multi-aspect documents. - The experiments on the created benchmark dataset show the effectiveness of MRAG. - The authors discuss the validity of the computational complexity of MRAG.
- The experiments are conducted only with GPT-4o. To generalize the discussion about the observed results, additional models such as open language models should be used.
1) Training-free use of multi-head attention activations as aspect-specific embeddings; plug-and-play with any Transformer, no model changes, and same embedding dimensionality as standard RAG (so minimal storage/latency overhead). 2) MRAG matches vanilla RAG’s leading terms while outperforming many recent variants in practicality. 3) Comprehensive evaluation design for multi-aspect queries (three datasets + bespoke metrics) and clear gains in retrieval and downstream generation.
1) The reranker is heuristic (voting with head-importance); effects vs. strong cross-encoder rerankers or dense-sparse hybrids aren’t deeply quantified. 2) Fusion strategies can add variance and computational/token cost, tempering the “free lunch” narrative when stacking with other RAG upgrades.
1. The paper clearly identifies a genuine limitation of current RAG systems in retrieving documents that represent semantically distinct aspects of a complex query. 2. The proposed Multi Head RAG is conceptually simple, practical to implement, and can be directly integrated into existing RAG pipelines and vector databases without additional training or storage overhead.
1. The core assumption that each attention head captures a distinct semantic aspect is not empirically validated within the experiments of this paper. 2. The work lacks qualitative evidence such as visualization or case analysis to show how different heads retrieve different information.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · WordPiece · Linear Warmup With Linear Decay · Attention Dropout · Weight Decay · Residual Connection · Layer Normalization · BERT · Byte Pair Encoding
