Stop Using the Wilcoxon Test: Myth, Misconception and Misuse in IR Research
Juli\'an Urbano

TL;DR
This paper critically examines the misuse of the Wilcoxon signed-rank test in IR research, demonstrating its unreliability and advocating for its abandonment to improve methodological rigor.
Contribution
It provides a systematic review and empirical analysis showing that Wilcoxon is often misapplied and unreliable in IR benchmarking, challenging its continued use.
Findings
Wilcoxon test frequently loses control of Type I error in IR settings.
Misconceptions about Wilcoxon's assumptions are widespread in IR literature.
Abandoning Wilcoxon would enhance the methodological soundness of IR evaluation.
Abstract
In benchmarking of Information Retrieval systems, the Wilcoxon signed-rank test is often treated as a safer alternative to the t-test. This belief is fueled by textbooks and recommendations that portray Wilcoxon as the proper non-parametric alternative because metric scores are not normally distributed. We argue that this narrative is misleading and harmful. A careful review of Statistics textbooks reveals inconsistencies and omissions in how the assumptions underlying these tests are presented, fostering confusion that has propagated into IR research. As a result, Wilcoxon has been routinely misapplied for decades, creating a false sense of safety against a threat that was never there to begin with, while introducing another one so severe that it virtually guarantees the test will break down and mislead researchers. Through a combination of systematic literature review, analysis and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
