TL;DR
This paper demonstrates that reporting a single performance score for sequence tagging models is insufficient, as seed variability can significantly affect results, and advocates for using score distributions from multiple runs for fair comparison.
Contribution
It introduces the practice of comparing score distributions over multiple runs instead of single scores, highlighting the importance of stability and robustness in model evaluation.
Findings
Seed variability causes significant performance differences.
Multiple runs provide more reliable performance estimates.
Certain architectures are both more accurate and more stable.
Abstract
In this paper we show that reporting a single performance score is insufficient to compare non-deterministic approaches. We demonstrate for common sequence tagging tasks that the seed value for the random number generator can result in statistically significant (p < 10^-4) differences for state-of-the-art systems. For two recent systems for NER, we observe an absolute difference of one percentage point F1-score depending on the selected seed value, making these systems perceived either as state-of-the-art or mediocre. Instead of publishing and reporting single performance scores, we propose to compare score distributions based on multiple executions. Based on the evaluation of 50.000 LSTM-networks for five sequence tagging tasks, we present network architectures that produce both superior performance as well as are more stable with respect to the remaining hyperparameters.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
