Towards Reliable Testing for Multiple Information Retrieval System Comparisons
David Otero, Javier Parapar, \'Alvaro Barreiro

TL;DR
This paper evaluates the reliability of multiple comparison procedures in Information Retrieval system testing, proposing that Wilcoxon with Benjamini-Hochberg correction offers optimal balance between error control and statistical power.
Contribution
It introduces a new approach to assess multiple comparison procedures in IR, demonstrating the effectiveness of Wilcoxon with Benjamini-Hochberg correction through experiments.
Findings
Wilcoxon plus Benjamini-Hochberg maintains proper Type I error rates.
This combination provides the best statistical power among tested methods.
The approach is validated on simulated and real TREC data.
Abstract
Null Hypothesis Significance Testing is the \textit{de facto} tool for assessing effectiveness differences between Information Retrieval systems. Researchers use statistical tests to check whether those differences will generalise to online settings or are just due to the samples observed in the laboratory. Much work has been devoted to studying which test is the most reliable when comparing a pair of systems, but most of the IR real-world experiments involve more than two. In the multiple comparisons scenario, testing several systems simultaneously may inflate the errors committed by the tests. In this paper, we use a new approach to assess the reliability of multiple comparison procedures using simulated and real TREC data. Experiments show that Wilcoxon plus the Benjamini-Hochberg correction yields Type I error rates according to the significance level for typical sample sizes while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Web Data Mining and Analysis · Information Retrieval and Search Behavior
