Evaluating Models' Local Decision Boundaries via Contrast Sets

Matt Gardner; Yoav Artzi; Victoria Basmova; Jonathan Berant; Ben; Bogin; Sihao Chen; Pradeep Dasigi; Dheeru Dua; Yanai Elazar; Ananth; Gottumukkala; Nitish Gupta; Hanna Hajishirzi; Gabriel Ilharco; Daniel; Khashabi; Kevin Lin; Jiangming Liu; Nelson F. Liu; Phoebe Mulcaire; Qiang; Ning; Sameer Singh; Noah A. Smith; Sanjay Subramanian; Reut Tsarfaty; Eric; Wallace; Ally Zhang; Ben Zhou

arXiv:2004.02709·cs.CL·October 5, 2020·44 cites

Evaluating Models' Local Decision Boundaries via Contrast Sets

Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben, Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth, Gottumukkala, Nitish Gupta, Hanna Hajishirzi, Gabriel Ilharco, Daniel, Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang

PDF

Open Access 1 Repo 2 Videos

TL;DR

This paper introduces contrast sets, a new annotation method involving small perturbations to test data, to better evaluate models' true linguistic understanding beyond standard test sets.

Contribution

It proposes a systematic annotation paradigm for creating contrast sets that reveal models' local decision boundaries and true capabilities.

Findings

01

Model performance drops up to 25% on contrast sets

02

Contrast sets reveal systematic gaps in models' understanding

03

The approach improves evaluation of linguistic capabilities

Abstract

Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model's decision boundary, which can be used to more accurately evaluate a model's true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

allenai/contrast-sets
noneOfficial

Videos

[Drama] Who invented Contrast Sets?· youtube

Evaluating NLP Models via Contrast Sets· youtube

Taxonomy

TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Multimodal Machine Learning Applications