TL;DR
This study evaluates the efficiency, robustness, and reasoning overhead of LLM-based retrievers across multiple tasks, highlighting trade-offs between effectiveness and latency, and assessing confidence calibration issues.
Contribution
It provides a comprehensive empirical analysis of various retrievers, extending evaluation metrics, and quantifies reasoning overhead and confidence calibration challenges.
Findings
Some reasoning-specialized retrievers achieve high effectiveness with competitive throughput.
Large LLM-based bi-encoders often incur high latency with modest gains.
Confidence scores are unreliable for downstream decision-making.
Abstract
Large language model retrievers improve performance on complex queries, but their practical value depends on efficiency, robustness, and reliable confidence signals in addition to accuracy. We reproduce a reasoning-intensive retrieval benchmark (BRIGHT) across 12 tasks and 14 retrievers, and extend evaluation with cold-start indexing cost, query latency distributions and throughput, corpus scaling, robustness to controlled query perturbations, and confidence use (AUROC) for predicting query success. We also quantify \emph{reasoning overhead} by comparing standard queries to five provided reasoning-augmented variants, measuring accuracy gains relative to added latency. We find that some reasoning-specialized retrievers achieve strong effectiveness while remaining competitive in throughput, whereas several large LLM-based bi-encoders incur substantial latency for modest gains. Reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
