Impact of Shallow vs. Deep Relevance Judgments on BERT-based Reranking Models

Gabriel Iturra-Bocaz; Danny Vo; Petra Galuscakova

arXiv:2506.23191·cs.IR·July 1, 2025

Impact of Shallow vs. Deep Relevance Judgments on BERT-based Reranking Models

Gabriel Iturra-Bocaz, Danny Vo, Petra Galuscakova

PDF

TL;DR

This study compares how shallow and deep relevance judgments influence BERT-based reranking models in neural IR, finding that shallow judgments improve generalization, while deep judgments may benefit from more negative examples.

Contribution

It provides an empirical comparison of shallow versus deep relevance judgments and their effects on BERT reranking performance in neural IR.

Findings

01

Shallow-judged datasets enhance model generalization.

02

Deep-judged datasets can be improved with more negative examples.

03

Results are based on MS MARCO and LongEval collections.

Abstract

This paper investigates the impact of shallow versus deep relevance judgments on the performance of BERT-based reranking models in neural Information Retrieval. Shallow-judged datasets, characterized by numerous queries each with few relevance judgments, and deep-judged datasets, involving fewer queries with extensive relevance judgments, are compared. The research assesses how these datasets affect the performance of BERT-based reranking models trained on them. The experiments are run on the MS MARCO and LongEval collections. Results indicate that shallow-judged datasets generally enhance generalization and effectiveness of reranking models due to a broader range of available contexts. The disadvantage of the deep-judged datasets might be mitigated by a larger number of negative training examples.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.