Diagnosing LLM Reranker Behavior Under Fixed Evidence Pools

Baris Arat; Emre Sefer

arXiv:2602.18613·cs.LG·February 24, 2026

Diagnosing LLM Reranker Behavior Under Fixed Evidence Pools

Baris Arat, Emre Sefer

PDF

Open Access

TL;DR

This paper introduces a diagnostic method to evaluate large language model rerankers independently of retrieval quality by using fixed evidence pools, revealing diverse behaviors and limitations in lexical coverage and redundancy.

Contribution

The paper presents a controlled diagnostic framework that isolates reranking behavior, enabling direct comparison of models' ranking strategies without retrieval influence.

Findings

01

Different LLMs show varied redundancy patterns.

02

LLMs underperform on lexical coverage at small budgets.

03

Rankings diverge from baseline strategies, highlighting model-specific behaviors.

Abstract

Standard reranking evaluations study how a reranker orders candidates returned by an upstream retriever. This setup couples ranking behavior with retrieval quality, so differences in output cannot be attributed to the ranking policy alone. We introduce a controlled diagnostic that isolates reranking by using Multi-News clusters as fixed evidence pools. We limit each pool to exactly eight documents and pass identical inputs to all rankers. Within this setup, BM25 and MMR serve as interpretable reference points for lexical matching and diversity optimization. Across 345 clusters, we find that redundancy patterns vary by model: one LLM implicitly diversifies at larger selection budgets, while another increases redundancy. In contrast, LLMs underperform on lexical coverage at small selection budgets. As a result, LLM rankings diverge substantially from both baselines rather than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation Retrieval and Search Behavior · Expert finding and Q&A systems · Software Engineering Research