Counterfactually-Augmented SNLI Training Data Does Not Yield Better Generalization Than Unaugmented Data
William Huang, Haokun Liu, and Samuel R. Bowman

TL;DR
Counterfactually-augmented SNLI training data does not improve model generalization or robustness compared to unaugmented data, and may even reduce performance, indicating the need for new data augmentation methods.
Contribution
This study evaluates the effectiveness of counterfactual data augmentation on natural language inference and finds it does not enhance model generalization or robustness.
Findings
Counterfactual augmentation does not improve out-of-domain generalization.
Models trained on augmented data can perform worse on challenge examples.
Standard crowdsourcing methods for augmentation are ineffective for NLI datasets.
Abstract
A growing body of work shows that models exploit annotation artifacts to achieve state-of-the-art performance on standard crowdsourced benchmarks---datasets collected from crowdworkers to create an evaluation task---while still failing on out-of-domain examples for the same task. Recent work has explored the use of counterfactually-augmented data---data built by minimally editing a set of seed examples to yield counterfactual labels---to augment training data associated with these benchmarks and build more robust classifiers that generalize better. However, Khashabi et al. (2020) find that this type of augmentation yields little benefit on reading comprehension tasks when controlling for dataset size and cost of collection. We build upon this work by using English natural language inference data to test model generalization and robustness and find that models trained on a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Mobile Crowdsensing and Crowdsourcing · Domain Adaptation and Few-Shot Learning
