Data Augmentation via Dependency Tree Morphing for Low-Resource   Languages

G\"ozde G\"ul \c{S}ahin; Mark Steedman

arXiv:1903.09460·cs.CL·March 25, 2019·5 cites

Data Augmentation via Dependency Tree Morphing for Low-Resource Languages

G\"ozde G\"ul \c{S}ahin, Mark Steedman

PDF

Open Access 2 Repos

TL;DR

This paper introduces dependency tree-based data augmentation techniques, crop and rotate, to improve NLP performance in low-resource languages by synthetically expanding training datasets.

Contribution

It proposes novel dependency tree morphing methods for data augmentation, demonstrating their effectiveness in enhancing low-resource language NLP tasks.

Findings

01

Augmentation improves POS tagging accuracy for most low-resource languages.

02

Techniques are especially beneficial for languages with complex case systems.

03

Simple tree manipulations can significantly boost low-resource NLP performance.

Abstract

Neural NLP systems achieve high scores in the presence of sizable training dataset. Lack of such datasets leads to poor system performances in the case low-resource languages. We present two simple text augmentation techniques using dependency trees, inspired from image processing. We crop sentences by removing dependency links, and we rotate sentences by moving the tree fragments around the root. We apply these techniques to augment the training sets of low-resource languages in Universal Dependencies project. We implement a character-level sequence tagging model and evaluate the augmented datasets on part-of-speech tagging task. We show that crop and rotate provides improvements over the models trained with non-augmented data for majority of the languages, especially for languages with rich case marking systems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications