Diagnosing LLM Reranker Behavior Under Fixed Evidence Pools
Baris Arat, Emre Sefer

TL;DR
This paper introduces a diagnostic method to evaluate large language model rerankers independently of retrieval quality by using fixed evidence pools, revealing diverse behaviors and limitations in lexical coverage and redundancy.
Contribution
The paper presents a controlled diagnostic framework that isolates reranking behavior, enabling direct comparison of models' ranking strategies without retrieval influence.
Findings
Different LLMs show varied redundancy patterns.
LLMs underperform on lexical coverage at small budgets.
Rankings diverge from baseline strategies, highlighting model-specific behaviors.
Abstract
Standard reranking evaluations study how a reranker orders candidates returned by an upstream retriever. This setup couples ranking behavior with retrieval quality, so differences in output cannot be attributed to the ranking policy alone. We introduce a controlled diagnostic that isolates reranking by using Multi-News clusters as fixed evidence pools. We limit each pool to exactly eight documents and pass identical inputs to all rankers. Within this setup, BM25 and MMR serve as interpretable reference points for lexical matching and diversity optimization. Across 345 clusters, we find that redundancy patterns vary by model: one LLM implicitly diversifies at larger selection budgets, while another increases redundancy. In contrast, LLMs underperform on lexical coverage at small selection budgets. As a result, LLM rankings diverge substantially from both baselines rather than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Expert finding and Q&A systems · Software Engineering Research
