TL;DR
This paper systematically evaluates the robustness of LLM-based dense retrievers, focusing on their generalizability and stability against various data and adversarial challenges, providing insights for future design.
Contribution
It offers the first comprehensive analysis of LLM-based retriever robustness across multiple benchmarks and attack types, highlighting factors influencing their stability and generalizability.
Findings
Instruction-tuned models excel but struggle with broad generalization.
LLM retrievers are more robust to typos and poisoning than encoder-only models.
Larger models tend to be more robust and stable.
Abstract
Decoder-only large language models (LLMs) are increasingly replacing BERT-style architectures as the backbone for dense retrieval, achieving substantial performance gains and broad adoption. However, the robustness of these LLM-based retrievers remains underexplored. In this paper, we present the first systematic study of the robustness of state-of-the-art open-source LLM-based dense retrievers from two complementary perspectives: generalizability and stability. For generalizability, we evaluate retrieval effectiveness across four benchmarks spanning 30 datasets, using linear mixed-effects models to estimate marginal mean performance and disentangle intrinsic model capability from dataset heterogeneity. Our analysis reveals that while instruction-tuned models generally excel, those optimized for complex reasoning often suffer a ``specialization tax,'' exhibiting limited generalizability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
