Is linguistically-motivated data augmentation worth it?

Ray Groshan; Michael Ginn; Alexis Palmer

arXiv:2506.03593·cs.CL·June 5, 2025

Is linguistically-motivated data augmentation worth it?

Ray Groshan, Michael Ginn, Alexis Palmer

PDF

Open Access 1 Video

TL;DR

This paper systematically compares linguistically-naive and linguistically-motivated data augmentation methods for low-resource languages, finding that linguistically-motivated strategies can improve performance if the synthetic data closely resembles the original data distribution.

Contribution

It provides the first comprehensive empirical evaluation of both augmentation strategies across two low-resource languages and two sequence-to-sequence tasks.

Findings

01

Linguistically-motivated augmentation can outperform naive methods.

02

Effectiveness depends on the similarity of synthetic data to training data.

03

Strategies show varied benefits across languages and tasks.

Abstract

Data augmentation, a widely-employed technique for addressing data scarcity, involves generating synthetic data examples which are then used to augment available training data. Researchers have seen surprising success from simple methods, such as random perturbations from natural examples, where models seem to benefit even from data with nonsense words, or data that doesn't conform to the rules of the language. A second line of research produces synthetic data that does in fact follow all linguistic constraints; these methods require some linguistic expertise and are generally more challenging to implement. No previous work has done a systematic, empirical comparison of both linguistically-naive and linguistically-motivated data augmentation strategies, leaving uncertainty about whether the additional time and effort of linguistically-motivated data augmentation work in fact yields…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Is linguistically-motivated data augmentation worth it?· underline

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · ICT in Developing Communities