TL;DR
This paper systematically compares sparse and dense retrieval methods in decoder-only LLMs, revealing that sparse retrieval scales better and achieves state-of-the-art results when combined with contrastive and knowledge distillation training.
Contribution
It provides the first comprehensive analysis of how different retrieval paradigms and training objectives scale in decoder-only LLMs, highlighting the superiority of sparse retrieval at larger scales.
Findings
Sparse retrieval outperforms dense retrieval across benchmarks.
Scaling benefits are significant only with contrastive learning.
Combining CL and KD at 8B scale yields state-of-the-art results.
Abstract
Scaling large language models (LLMs) has shown great potential for improving retrieval model performance; however, previous studies have mainly focused on dense retrieval trained with contrastive loss (CL), neglecting the scaling behavior of other retrieval paradigms and optimization techniques, such as sparse retrieval and knowledge distillation (KD). In this work, we conduct a systematic comparative study on how different retrieval paradigms (sparse vs. dense) and fine-tuning objectives (CL vs. KD vs. their combination) affect retrieval performance across different model scales. Using MSMARCO passages as the training dataset, decoder-only LLMs (Llama-3 series: 1B, 3B, 8B), and a fixed compute budget, we evaluate various training configurations on both in-domain (MSMARCO, TREC DL) and out-of-domain (BEIR) benchmarks. Our key findings reveal that: (1) Scaling behaviors emerge clearly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsKnowledge Distillation
