Please, Don't Forget the Difference and the Confidence Interval when Seeking for the State-of-the-Art Status
Yves Bestgen

TL;DR
This paper advocates for using bootstrap confidence intervals over traditional SOTA comparisons in NLP, emphasizing their ability to highlight performance differences and quantify superiority, supported by case studies and a Python toolkit.
Contribution
It introduces the widespread use of bootstrap confidence intervals for NLP system comparison, providing practical tools and illustrating their advantages over significance testing.
Findings
Bootstrap confidence intervals effectively highlight performance differences.
Confidence intervals quantify the degree of system superiority.
Tools for calculating these intervals are freely available in Python.
Abstract
This paper argues for the widest possible use of bootstrap confidence intervals for comparing NLP system performances instead of the state-of-the-art status (SOTA) and statistical significance testing. Their main benefits are to draw attention to the difference in performance between two systems and to help assessing the degree of superiority of one system over another. Two cases studies, one comparing several systems and the other based on a K-fold cross-validation procedure, illustrate these benefits. A python module for obtaining these confidence intervals as well as a second function implementing the Fisher-Pitman test for paired samples are freely available on PyPi.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Advanced Text Analysis Techniques · Explainable Artificial Intelligence (XAI)
