# Good-Enough Compositional Data Augmentation

**Authors:** Jacob Andreas

arXiv: 1904.09545 · 2020-05-20

## TL;DR

This paper introduces a simple, model-agnostic data augmentation method that improves sequence model performance by encouraging compositional generalization through fragment replacement, significantly reducing error rates and perplexity.

## Contribution

The proposed protocol provides a straightforward way to incorporate compositional inductive bias into sequence models, enhancing their generalization capabilities across tasks.

## Key findings

- Reduces error rate by up to 87% on SCAN diagnostic tasks.
- Decreases perplexity by approximately 1% on small multilingual corpora.
- Applicable to both neural sequence models and n-gram language models.

## Abstract

We propose a simple data augmentation protocol aimed at providing a compositional inductive bias in conditional and unconditional sequence models. Under this protocol, synthetic training examples are constructed by taking real training examples and replacing (possibly discontinuous) fragments with other fragments that appear in at least one similar environment. The protocol is model-agnostic and useful for a variety of tasks. Applied to neural sequence-to-sequence models, it reduces error rate by as much as 87% on diagnostic tasks from the SCAN dataset and 16% on a semantic parsing task. Applied to n-gram language models, it reduces perplexity by roughly 1% on small corpora in several languages.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.09545/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/1904.09545/full.md

## References

50 references — full list in the complete paper: https://tomesphere.com/paper/1904.09545/full.md

---
Source: https://tomesphere.com/paper/1904.09545