Reproducing Complex Set-Compositional Information Retrieval
Vincent Degenhart, Dewi Timman, Arjen P. de Vries, Faegheh Hasibi, Mohanna Hoveyda

TL;DR
This study benchmarks retrieval methods on complex set-based queries, revealing that neural methods excel on standard datasets but struggle with controlled, attribute-based relevance, especially as query complexity increases.
Contribution
It introduces LIMIT+, a new benchmark for evaluating retrieval methods on attribute-based, constraint satisfaction tasks, and provides a comprehensive reproducibility study of existing approaches.
Findings
Neural retrievers outperform BM25 on QUEST but not on LIMIT+.
Performance degrades with increased compositional complexity across all methods.
Classic lexical methods maintain high performance on LIMIT+ while neural methods collapse.
Abstract
Complex information needs may involve set-compositional queries using conjunction, disjunction, and exclusion, yet it remains unclear whether current retrieval paradigms genuinely satisfy such constraints or exploit `semantic shortcuts'. We conduct a reproducibility study to benchmark major retrieval families and reasoning-targeted methods on QUEST and QUEST+Variants, and introduce LIMIT+, a controlled benchmark where relevance depends on arbitrary attribute predicates and constraint satisfaction, and less on pretrained knowledge. Our findings show that (i) on QUEST, the best neural retrievers achieve an effectiveness that is more than double what can be achieved with BM25 (Recall@100 0.41 vs.\ 0.20), but reasoning-targeted methods like ReasonIR and Search-R1 do not outperform general-purpose retrievers uniformly; (ii) on LIMIT+, gains fail to transfer, where the strongest QUEST…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
