To Augment or Not to Augment? A Comparative Study on Text Augmentation   Techniques for Low-Resource NLP

G\"ozde G\"ul \c{S}ahin

arXiv:2111.09618·cs.CL·November 19, 2021

To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP

G\"ozde G\"ul \c{S}ahin

PDF

Open Access

TL;DR

This study systematically compares various text augmentation techniques across multiple low-resource languages and NLP tasks, revealing their varying effectiveness and highlighting the importance of task, language, and model considerations.

Contribution

It provides a comprehensive analysis of syntax, token, and character-level augmentation methods for low-resource NLP tasks, filling a gap in systematic performance evaluation.

Findings

01

Character-level augmentation is most consistently effective.

02

Augmentation significantly improves dependency parsing performance.

03

Effectiveness varies by language, task, and model type.

Abstract

Data-hungry deep neural networks have established themselves as the standard for many NLP tasks including the traditional sequence tagging ones. Despite their state-of-the-art performance on high-resource languages, they still fall behind of their statistical counter-parts in low-resource scenarios. One methodology to counter attack this problem is text augmentation, i.e., generating new synthetic training data points from existing data. Although NLP has recently witnessed a load of textual augmentation techniques, the field still lacks a systematic performance analysis on a diverse set of languages and sequence tagging tasks. To fill this gap, we investigate three categories of text augmentation methodologies which perform changes on the syntax (e.g., cropping sub-sentences), token (e.g., random word insertion) and character (e.g., character swapping) levels. We systematically compare…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsmBERT