Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks

Minju Seo; Jinheon Baek; James Thorne; Sung Ju Hwang

arXiv:2402.13482·cs.CL·February 22, 2024·2 cites

Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks

Minju Seo, Jinheon Baek, James Thorne, Sung Ju Hwang

PDF

Open Access

TL;DR

This paper introduces RADA, a retrieval-augmented data augmentation method that enhances low-resource domain tasks by retrieving relevant examples from other datasets and prompting LLMs to generate more diverse and relevant training data.

Contribution

The paper proposes a novel retrieval-augmented data augmentation framework that improves data diversity and relevance in low-resource settings by leveraging cross-dataset retrieval and LLM prompting.

Findings

01

RADA outperforms existing LLM-based augmentation methods in low-resource scenarios.

02

Retrieving relevant examples from other datasets enhances the quality of generated data.

03

The approach is effective across multiple datasets and augmentation scenarios.

Abstract

Despite large successes of recent language models on diverse tasks, they suffer from severe performance degeneration in low-resource settings with limited training data available. Many existing works tackle this problem by generating synthetic data from the training data and then training models on them, recently using Large Language Models (LLMs). However, in low-resource settings, the amount of seed data samples to use for data augmentation is very small, which makes generated samples suboptimal and less diverse. To tackle this challenge, we propose a novel method that augments training data by incorporating a wealth of examples from other datasets, along with the given training data. Specifically, we first retrieve the relevant instances from other datasets, such as their input-output pairs or contexts, based on their similarities with the given seed data, and then prompt LLMs to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Parallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems