Beyond Hard Negatives: The Importance of Score Distribution in Knowledge Distillation for Dense Retrieval

Youngjoon Jang; Seongtae Hong; Hyeonseok Moon; Heuiseok Lim

arXiv:2604.04734·cs.IR·April 29, 2026

Beyond Hard Negatives: The Importance of Score Distribution in Knowledge Distillation for Dense Retrieval

Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, Heuiseok Lim

PDF

2 Datasets

TL;DR

This paper emphasizes the importance of maintaining the teacher score distribution in knowledge distillation for dense retrieval, proposing a stratified sampling method that improves model generalization.

Contribution

It introduces a Stratified Sampling strategy to better emulate teacher score distribution, enhancing knowledge distillation beyond hard negative mining.

Findings

01

Stratified Sampling outperforms top-K and random sampling in various benchmarks.

02

Preserving teacher score variance improves student model generalization.

03

Focusing on score distribution enhances distillation effectiveness.

Abstract

Transferring knowledge from a cross-encoder teacher via Knowledge Distillation (KD) has become a standard paradigm for training retrieval models. While existing studies have largely focused on mining hard negatives to improve discrimination, the systematic composition of training data and the resulting teacher score distribution have received relatively less attention. In this work, we highlight that focusing solely on hard negatives prevents the student from learning the comprehensive preference structure of the teacher, potentially hampering generalization. To effectively emulate the teacher score distribution, we propose a Stratified Sampling strategy that uniformly covers the entire score spectrum. Experiments on in-domain and out-of-domain benchmarks confirm that Stratified Sampling, which preserves the variance and entropy of teacher scores, serves as a robust baseline,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.