Language Model Re-rankers are Fooled by Lexical Similarities
Lovisa Hagstr\"om, Ercong Nie, Ruben Halifa, Helmut Schmid, Richard Johansson, Alexander Junge

TL;DR
This paper critically evaluates language model re-rankers in retrieval-augmented generation, revealing their vulnerabilities to lexical similarities and highlighting the need for more robust evaluation datasets.
Contribution
It provides a comprehensive analysis of LM re-rankers' performance across datasets, introduces a novel separation metric, and identifies key weaknesses and potential improvements.
Findings
LM re-rankers often fail to outperform BM25 on DRUID dataset.
A new separation metric based on BM25 scores helps explain re-ranker errors.
Certain improvement methods are effective mainly on the NQ dataset.
Abstract
Language model (LM) re-rankers are used to refine retrieval results for retrieval-augmented generation (RAG). They are more expensive than lexical matching methods like BM25 but assumed to better process semantic information and the relations between the query and the retrieved answers. To understand whether LM re-rankers always live up to this assumption, we evaluate 6 different LM re-rankers on the NQ, LitQA2 and DRUID datasets. Our results show that LM re-rankers struggle to outperform a simple BM25 baseline on DRUID. Leveraging a novel separation metric based on BM25 scores, we explain and identify re-ranker errors stemming from lexical dissimilarities. We also investigate different methods to improve LM re-ranker performance and find these methods mainly useful for NQ. Taken together, our work identifies and explains weaknesses of LM re-rankers and points to the need for more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
