CORE: A Retrieve-then-Edit Framework for Counterfactual Data Generation

Tanay Dixit; Bhargavi Paranjape; Hannaneh Hajishirzi; Luke Zettlemoyer

arXiv:2210.04873·cs.CL·November 2, 2022

CORE: A Retrieve-then-Edit Framework for Counterfactual Data Generation

Tanay Dixit, Bhargavi Paranjape, Hannaneh Hajishirzi, Luke Zettlemoyer

PDF

Open Access 1 Repo

TL;DR

CORE introduces a retrieval-augmented framework that generates diverse, natural counterfactuals for data augmentation, significantly enhancing model robustness and out-of-distribution generalization in NLP tasks.

Contribution

The paper proposes CORE, a novel retrieval-then-edit framework that leverages unlabeled data and large language models to produce diverse counterfactual examples for improved model training.

Findings

01

CORE outperforms existing data augmentation methods in OOD generalization.

02

Retrieval-based perturbations increase diversity and naturalness of counterfactuals.

03

CORE can also be used to promote diversity in manual perturbations.

Abstract

Counterfactual data augmentation (CDA) -- i.e., adding minimally perturbed inputs during training -- helps reduce model reliance on spurious correlations and improves generalization to out-of-distribution (OOD) data. Prior work on generating counterfactuals only considered restricted classes of perturbations, limiting their effectiveness. We present COunterfactual Generation via Retrieval and Editing (CORE), a retrieval-augmented generation framework for creating diverse counterfactual perturbations for CDA. For each training example, CORE first performs a dense retrieval over a task-related unlabeled text corpus using a learned bi-encoder and extracts relevant counterfactual excerpts. CORE then incorporates these into prompts to a large language model with few-shot learning capabilities, for counterfactual editing. Conditioning language model edits on naturally occurring data results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tanay2001/core
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsCounterfactuals Explanations