Disentangling Locality and Entropy in Ranking Distillation

Andrew Parry; Debasis Ganguly; Sean MacAvaney

arXiv:2505.21058·cs.IR·May 28, 2025

Disentangling Locality and Entropy in Ranking Distillation

Andrew Parry, Debasis Ganguly, Sean MacAvaney

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the effects of sampling and data augmentation strategies in ranking distillation, revealing their impacts on model effectiveness and proposing more efficient training approaches.

Contribution

It provides a theoretical and empirical analysis of how sampling and entropy influence ranking distillation, offering insights to improve training efficiency and effectiveness.

Findings

01

Sampling strategies can be spurious or harmful in distillation.

02

Data augmentation affects model bias and intrinsic behavior.

03

Understanding training dynamics can lead to more efficient ranking models.

Abstract

The training process of ranking models involves two key data selection decisions: a sampling strategy, and a labeling strategy. Modern ranking systems, especially those for performing semantic search, typically use a ``hard negative'' sampling strategy to identify challenging items using heuristics and a distillation labeling strategy to transfer ranking "knowledge" from a more capable model. In practice, these approaches have grown increasingly expensive and complex, for instance, popular pretrained rankers from SentenceTransformers involve 12 models in an ensemble with data provenance hampering reproducibility. Despite their complexity, modern sampling and labeling strategies have not been fully ablated, leaving the underlying source of effectiveness gains unclear. Thus, to better understand why models improve and potentially reduce the expense of training effective models, we conduct…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. An important, although well-studied problem. 2. A novel (though potentially vacuous) bound that connects the diameter of the query and the teacher’s entropy to the generalization gap. 3. Paper has both in-domain and out-of-domain experiments.

Weaknesses

1. The paper is quite confusing, it uses inconsistent terminology (while not defying some crucial parts). It also makes quite a few technical claims (although tangential to their main results) that seem to be incorrect. Math is confusing: formulas appear to have errors (e.g., the risk minimization definition) and unexplained symbols (e.g., eta). Data augmentation is mentioned, but never explained properly. More detailed comments are below. 2. The proposed bound appears to be vacuous because

Reviewer 02Rating 6Confidence 2

Strengths

Hard-negative mining and ranking distillation are widely used; the paper addresses a real gap in IR training methodology.

Weaknesses

1. Theoretical reasoning is largely descriptive, not prescriptive The generalization bound identifies factors but does not provide actionable guidance for selecting sampling policies or entropy levels in practice. The theoretical exposition is mathematically sound but overly symbol-heavy. 2. While the study provides clear insight into bi-encoder retrievers trained via distillation, it remains unclear whether the identified effects of locality and teacher entropy extend to LLM-based retrieval se

Reviewer 03Rating 6Confidence 2

Strengths

* Intriguing study that questions the standard recipe (multi-stage hard negative mining) for neural ranking and seeks a deeper, more principled understanding of its training dynamics. * Presents a theoretical framework that rigorously examines generalization bounds through the lenses of locality and entropy. * Offers evidences that once locality and entropy are properly managed, complex multi-stage training pipelines add limited value.

Weaknesses

* The experimental configuration is somewhat narrow, particularly in its treatment of negative sampling. The paper relies on simplified strategies rather than the more sophisticated dual-encoder mining pipelines commonly used in practice. * The evaluation is restricted to a limited set of ranking benchmarks, excluding broader or multitask-oriented suites such as MTEB that better capture generalization across domains. As a result, the findings may not fully extend to modern multitask embedding tr

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Criteria Decision Making