TL;DR
Robustness Gym is an extensible toolkit that unifies multiple NLP evaluation methods, enabling easier comparison and development of robustness assessments, demonstrated through case studies on sentiment analysis, NEL, and summarization.
Contribution
It introduces Robustness Gym, a unified platform for diverse NLP robustness evaluations, simplifying comparison and fostering new research.
Findings
Sentiment model performance degrades by 18%+ under robustness testing.
Commercial NEL systems lag academic ones by 10%+ on rare entities.
Summarization models struggle with abstraction, degrading by 9%+.
Abstract
Despite impressive performance on standard benchmarks, deep neural networks are often brittle when deployed in real-world systems. Consequently, recent research has focused on testing the robustness of such models, resulting in a diverse set of evaluation methodologies ranging from adversarial attacks to rule-based data transformations. In this work, we identify challenges with evaluating NLP systems and propose a solution in the form of Robustness Gym (RG), a simple and extensible evaluation toolkit that unifies 4 standard evaluation paradigms: subpopulations, transformations, evaluation sets, and adversarial attacks. By providing a common platform for evaluation, Robustness Gym enables practitioners to compare results from all 4 evaluation paradigms with just a few clicks, and to easily develop and share novel evaluation methods using a built-in set of abstractions. To validate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
