Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets
Rotem Dror, Gili Baumer, Marina Bogomolov, and Roi Reichart

TL;DR
This paper introduces a statistically rigorous framework for analyzing the replicability of NLP algorithm performance across multiple datasets, addressing the limitations of traditional methods and demonstrating its effectiveness in various NLP tasks.
Contribution
It proposes a novel Replicability Analysis framework that improves statistical validity in multi-dataset NLP algorithm comparisons.
Findings
Framework provides more reliable significance testing across datasets.
Empirical validation across four diverse NLP applications.
Highlights limitations of current practices in NLP statistical analysis.
Abstract
With the ever-growing amounts of textual data from a large variety of languages, domains, and genres, it has become standard to evaluate NLP algorithms on multiple datasets in order to ensure consistent performance across heterogeneous setups. However, such multiple comparisons pose significant challenges to traditional statistical analysis methods in NLP and can lead to erroneous conclusions. In this paper, we propose a Replicability Analysis framework for a statistically sound analysis of multiple comparisons between algorithms for NLP tasks. We discuss the theoretical advantages of this framework over the current, statistically unjustified, practice in the NLP literature, and demonstrate its empirical value across four applications: multi-domain dependency parsing, multilingual POS tagging, cross-domain sentiment classification and word similarity prediction.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research
