Teaching Dense Retrieval Models to Specialize with Listwise Distillation   and LLM Data Augmentation

Manveer Singh Tamber; Suleman Kazi; Vivek Sourabh; Jimmy Lin

arXiv:2502.19712·cs.IR·February 28, 2025

Teaching Dense Retrieval Models to Specialize with Listwise Distillation and LLM Data Augmentation

Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, Jimmy Lin

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces a listwise distillation approach from cross-encoders and uses LLM-generated synthetic queries to improve dense retrieval models for specialized tasks, overcoming limitations of standard fine-tuning methods.

Contribution

It proposes a novel training strategy combining listwise distillation and synthetic query augmentation to enhance domain-specific dense retrieval performance.

Findings

01

Listwise distillation improves retrieval effectiveness.

02

Synthetic queries can match human-written query utility.

03

Limitations include cross-encoder teacher bottlenecks.

Abstract

While the current state-of-the-art dense retrieval models exhibit strong out-of-domain generalization, they might fail to capture nuanced domain-specific knowledge. In principle, fine-tuning these models for specialized retrieval tasks should yield higher effectiveness than relying on a one-size-fits-all model, but in practice, results can disappoint. We show that standard fine-tuning methods using an InfoNCE loss can unexpectedly degrade effectiveness rather than improve it, even for domain-specific scenarios. This holds true even when applying widely adopted techniques such as hard-negative mining and negative de-noising. To address this, we explore a training strategy that uses listwise distillation from a teacher cross-encoder, leveraging rich relevance signals to fine-tune the retriever. We further explore synthetic query generation using large language models. Through listwise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

manveertamber/enhancing_domain_adaptation
pytorchOfficial

Models

🤗
kasys/jmed-me5-v0.1
model· 1 dl
1 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation Retrieval and Search Behavior · Topic Modeling · Natural Language Processing Techniques

MethodsSparse Evolutionary Training · InfoNCE