Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers
Zhiyuan Peng, Ting-ruen Wei, Tingyu Song, Yilun Zhao

TL;DR
This paper introduces FLOPs-based metrics RPP and QPP for evaluating LLM rerankers, providing a hardware-independent way to measure efficiency and effectiveness trade-offs in information retrieval tasks.
Contribution
It proposes new FLOPs-based metrics and an estimator to better evaluate LLM rerankers' efficiency-effectiveness tradeoff independent of hardware and implementation details.
Findings
RPP and QPP offer consistent efficiency measurements across different hardware.
The FLOPs estimator accurately predicts model FLOPs without running experiments.
Comprehensive experiments reveal insights into the efficiency-effectiveness tradeoff.
Abstract
Large Language Models (LLMs) have recently been applied to reranking tasks in information retrieval, achieving strong performance. However, their high computational demands often hinder practical deployment. Existing studies evaluate the efficiency of LLM-based rerankers using proxy metrics such as latency, the number of forward passes, input tokens, and output tokens. However, these metrics depend on hardware and running-time choices (\eg parallel or not, batch size, etc), and often fail to account for model size, making it difficult to interpret and obscuring the evaluation of the efficiency-effectiveness tradeoff. To address this issue, we propose \ours\footnote{https://github.com/zhiyuanpeng/EER-FLOPs.} for LLM-based rerankers: RPP (ranking metrics per PetaFLOP), measuring how much ranking quality (e.g., NDCG or MRR) a method achieves per PetaFLOP, and QPP (queries per PetaFLOP),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsInformation Retrieval and Search Behavior · Natural Language Processing Techniques · Topic Modeling
