Teaching Dense Retrieval Models to Specialize with Listwise Distillation and LLM Data Augmentation
Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, Jimmy Lin

TL;DR
This paper introduces a listwise distillation approach from cross-encoders and uses LLM-generated synthetic queries to improve dense retrieval models for specialized tasks, overcoming limitations of standard fine-tuning methods.
Contribution
It proposes a novel training strategy combining listwise distillation and synthetic query augmentation to enhance domain-specific dense retrieval performance.
Findings
Listwise distillation improves retrieval effectiveness.
Synthetic queries can match human-written query utility.
Limitations include cross-encoder teacher bottlenecks.
Abstract
While the current state-of-the-art dense retrieval models exhibit strong out-of-domain generalization, they might fail to capture nuanced domain-specific knowledge. In principle, fine-tuning these models for specialized retrieval tasks should yield higher effectiveness than relying on a one-size-fits-all model, but in practice, results can disappoint. We show that standard fine-tuning methods using an InfoNCE loss can unexpectedly degrade effectiveness rather than improve it, even for domain-specific scenarios. This holds true even when applying widely adopted techniques such as hard-negative mining and negative de-noising. To address this, we explore a training strategy that uses listwise distillation from a teacher cross-encoder, leveraging rich relevance signals to fine-tune the retriever. We further explore synthetic query generation using large language models. Through listwise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Topic Modeling · Natural Language Processing Techniques
MethodsSparse Evolutionary Training · InfoNCE
