Revela: Dense Retriever Learning via Language Modeling
Fengyu Cai, Tong Chen, Xinran Zhao, Sihao Chen, Hongming Zhang, Sherry Tongshuang Wu, Iryna Gurevych, Heinz Koeppl

TL;DR
Revela introduces a scalable self-supervised training framework for dense retrievers that leverages language modeling techniques to learn semantic dependencies without requiring annotated data, outperforming supervised models on multiple benchmarks.
Contribution
The paper presents Revela, a novel method that adapts language modeling objectives for self-supervised dense retriever training, reducing data and compute needs while improving performance.
Findings
Revela surpasses larger supervised models on CoIR and BRIGHT benchmarks.
Revela achieves state-of-the-art unsupervised performance on BEIR with significantly less data.
Performance improves with larger batch sizes and model sizes, demonstrating scalability.
Abstract
Dense retrievers play a vital role in accessing external and specialized knowledge to augment language models (LMs). Training dense retrievers typically requires annotated query-document pairs, which are costly to create and scarce in specialized domains (e.g., code) or in complex settings (e.g., requiring reasoning). These practical challenges have sparked growing interest in self-supervised retriever learning. Since LMs are trained to capture token-level dependencies through a self-supervised learning objective (i.e., next token prediction), we can analogously cast retrieval as learning dependencies among chunks of tokens. This analogy naturally leads to the question: How can we adapt self-supervised learning objectives in the spirit of language modeling to train retrievers? To answer this question, we introduce Revela, a unified and scalable training framework for self-supervised…
Peer Reviews
Decision·ICLR 2026 Oral
1. The approach of using NTP paradigm to "distill" similarity signals into a dense retriever along with he in-batch attention and weighting by similarity scores method, are interesting and novel ideas. 2. Improved scalability and calibration compared to quadratic baseline (pairwise distillation). 3. Great experimental design including selection of training data and baselines. 4. Thorough ablation study showcases generalization, effect of batch size, LLM base performance, mixing training corpora
1. Training data creation methodology and statistics are underspecified. It would be helpful to understand how the filtering was done and what the "handcrafted rules" L249 are. Similarly, while constructing a batch (following example in Appendix B.2), how were the topics chosen?
- Revela proposes a novel self-supervised framework that jointly trains a dense retriever and a language model by embedding retriever similarity scores as in-batch attention weights within transformer blocks. This allows the model to learn retrieval ability directly from next-token prediction without any labeled query-document pairs, achieving strong performance on multiple retrieval benchmarks with far less data and computation. - The use of retriever similarity as in-batch attention weights se
- The motivation for using next-token prediction is unclear. The authors need to provide a detailed explanation of why this training objective can enhance the retriever’s capability. - The In-batch Attention section is somewhat confusing. It states that In-batch Attention consists of two parts — Standard Self-Attention and In-batch Attention — but within In-batch Attention itself, there is another self-attention output s. I suggest the authors restructure this description for greater clarity. -
**Unified objective without labels:** The method removes the need for query–document pairs by turning retrieval into language modeling. I like the idea of using NTP for training retriever as it can provide fine-grained supervision compared to InfoNCE. There is a related work of “REPLUG” which also uses supervision from language modeling to train retriever. But the approach from this paper: adding a simple in-batch attention path to a standard decoder-only stack and learning both the retriever an
**Applicability**: Adoption may be harder than plug-and-play methods like REPLUG. Revela adds an extra in-batch attention path inside decoder blocks and uses retriever similarities as attention weights, which means touching the LM internals and maintaining custom masking. Even if the code changes are “minimal,” such modifications increase maintenance and may impact speed or memory in non-obvious ways. **Batch-composition dependence:** The method conditions next-token prediction on *other docu
Videos
Taxonomy
TopicsTopic Modeling · Information Retrieval and Search Behavior · Multimodal Machine Learning Applications
