How Good are LLM-based Rerankers? An Empirical Analysis of State-of-the-Art Reranking Models

Abdelrahman Abdallah; Bhawna Piryani; Jamshid Mozafari; Mohammed Ali; Adam Jatowt

arXiv:2508.16757·cs.CL·August 26, 2025

How Good are LLM-based Rerankers? An Empirical Analysis of State-of-the-Art Reranking Models

Abdelrahman Abdallah, Bhawna Piryani, Jamshid Mozafari, Mohammed Ali, Adam Jatowt

PDF

1 Datasets 1 Video

TL;DR

This paper systematically evaluates state-of-the-art reranking models, including LLM-based and lightweight approaches, across multiple benchmarks to understand their performance on familiar and novel queries in information retrieval.

Contribution

It provides a comprehensive empirical comparison of 22 reranking methods, analyzing factors like training data overlap, architecture, and efficiency, revealing insights into their generalization capabilities.

Findings

01

LLM-based rerankers outperform on familiar queries

02

Lightweight models are more efficient and comparable on some tasks

03

Query novelty significantly affects reranking effectiveness

Abstract

In this work, we present a systematic and comprehensive empirical evaluation of state-of-the-art reranking methods, encompassing large language model (LLM)-based, lightweight contextual, and zero-shot approaches, with respect to their performance in information retrieval tasks. We evaluate in total 22 methods, including 40 variants (depending on used LLM) across several established benchmarks, including TREC DL19, DL20, and BEIR, as well as a novel dataset designed to test queries unseen by pretrained models. Our primary goal is to determine, through controlled and fair comparisons, whether a performance disparity exists between LLM-based rerankers and their lightweight counterparts, particularly on novel queries, and to elucidate the underlying causes of any observed differences. To disentangle confounding factors, we analyze the effects of training data overlap, model architecture,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

abdoelsayed/FutureQueryEval
dataset· 79 dl
79 dl

Videos

How Good are LLM-based Rerankers? An Empirical Analysis of State-of-the-Art Reranking Models· underline