Evaluating Models' Local Decision Boundaries via Contrast Sets
Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben, Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth, Gottumukkala, Nitish Gupta, Hanna Hajishirzi, Gabriel Ilharco, Daniel, Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang

TL;DR
This paper introduces contrast sets, a new annotation method involving small perturbations to test data, to better evaluate models' true linguistic understanding beyond standard test sets.
Contribution
It proposes a systematic annotation paradigm for creating contrast sets that reveal models' local decision boundaries and true capabilities.
Findings
Model performance drops up to 25% on contrast sets
Contrast sets reveal systematic gaps in models' understanding
The approach improves evaluation of linguistic capabilities
Abstract
Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model's decision boundary, which can be used to more accurately evaluate a model's true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
[Drama] Who invented Contrast Sets?· youtube
Evaluating NLP Models via Contrast Sets· youtube
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Multimodal Machine Learning Applications
