DRAMA: Diverse Augmentation from Large Language Models to Smaller Dense Retrievers

Xueguang Ma; Xi Victoria Lin; Barlas Oguz; Jimmy Lin; Wen-tau Yih; Xilun Chen

arXiv:2502.18460·cs.CL·June 4, 2025

DRAMA: Diverse Augmentation from Large Language Models to Smaller Dense Retrievers

Xueguang Ma, Xi Victoria Lin, Barlas Oguz, Jimmy Lin, Wen-tau Yih, Xilun Chen

PDF

Open Access 1 Repo 5 Models

TL;DR

DRAMA is a training framework that uses large language models to enhance smaller dense retrievers, improving their generalization, multilingual, and long-context capabilities while maintaining efficiency.

Contribution

It introduces a novel single-stage contrastive learning method leveraging pruned LLMs and diverse augmented data to train smaller, more effective dense retrievers.

Findings

01

DRAMA outperforms traditional retrievers in multilingual tasks.

02

It enhances long-context understanding in dense retrieval.

03

The framework achieves strong results across multiple languages and tasks.

Abstract

Large language models (LLMs) have demonstrated strong effectiveness and robustness while fine-tuned as dense retrievers. However, their large parameter size brings significant inference time computational challenges, including high encoding costs for large-scale corpora and increased query latency, limiting their practical deployment. While smaller retrievers offer better efficiency, they often fail to generalize effectively with limited supervised fine-tuning data. In this work, we introduce DRAMA, a training framework that leverages LLMs to train smaller generalizable dense retrievers. In particular, we adopt pruned LLMs as the backbone and train on diverse LLM-augmented data in a single-stage contrastive learning setup. Experiments show that DRAMA offers better multilingual and long-context capabilities than traditional encoder-based retrievers, and achieves strong performance across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/dpr-scale
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsContrastive Learning · ADaptive gradient method with the OPTimal convergence rate