Robustness Gym: Unifying the NLP Evaluation Landscape

Karan Goel; Nazneen Rajani; Jesse Vig; Samson Tan; Jason Wu; Stephan; Zheng; Caiming Xiong; Mohit Bansal; Christopher R\'e

arXiv:2101.04840·cs.CL·January 14, 2021

Robustness Gym: Unifying the NLP Evaluation Landscape

Karan Goel, Nazneen Rajani, Jesse Vig, Samson Tan, Jason Wu, Stephan, Zheng, Caiming Xiong, Mohit Bansal, Christopher R\'e

PDF

2 Repos

TL;DR

Robustness Gym is an extensible toolkit that unifies multiple NLP evaluation methods, enabling easier comparison and development of robustness assessments, demonstrated through case studies on sentiment analysis, NEL, and summarization.

Contribution

It introduces Robustness Gym, a unified platform for diverse NLP robustness evaluations, simplifying comparison and fostering new research.

Findings

01

Sentiment model performance degrades by 18%+ under robustness testing.

02

Commercial NEL systems lag academic ones by 10%+ on rare entities.

03

Summarization models struggle with abstraction, degrading by 9%+.

Abstract

Despite impressive performance on standard benchmarks, deep neural networks are often brittle when deployed in real-world systems. Consequently, recent research has focused on testing the robustness of such models, resulting in a diverse set of evaluation methodologies ranging from adversarial attacks to rule-based data transformations. In this work, we identify challenges with evaluating NLP systems and propose a solution in the form of Robustness Gym (RG), a simple and extensible evaluation toolkit that unifies 4 standard evaluation paradigms: subpopulations, transformations, evaluation sets, and adversarial attacks. By providing a common platform for evaluation, Robustness Gym enables practitioners to compare results from all 4 evaluation paradigms with just a few clicks, and to easily develop and share novel evaluation methods using a built-in set of abstractions. To validate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.