Can LLMs Augment Low-Resource Reading Comprehension Datasets? Opportunities and Challenges
Vinay Samuel, Houda Aynaou, Arijit Ghosh Chowdhury, Karthik Venkat, Ramanan, Aman Chadha

TL;DR
This paper investigates GPT-4's ability to generate synthetic datasets for low-resource reading comprehension tasks, assessing its effectiveness as a cost-efficient alternative to human annotation and exploring associated opportunities and challenges.
Contribution
It is the first to analyze GPT-4 as a synthetic data augmenter for QA, providing augmented datasets and evaluating performance and costs compared to human annotation.
Findings
GPT-4 can effectively augment low-resource datasets.
Synthetic datasets improve model performance in some cases.
Using GPT-4 reduces annotation costs significantly.
Abstract
Large Language Models (LLMs) have demonstrated impressive zero shot performance on a wide range of NLP tasks, demonstrating the ability to reason and apply commonsense. A relevant application is to use them for creating high quality synthetic datasets for downstream tasks. In this work, we probe whether GPT-4 can be used to augment existing extractive reading comprehension datasets. Automating data annotation processes has the potential to save large amounts of time, money and effort that goes into manually labelling datasets. In this paper, we evaluate the performance of GPT-4 as a replacement for human annotators for low resource reading comprehension tasks, by comparing performance after fine tuning, and the cost associated with annotation. This work serves to be the first analysis of LLMs as synthetic data augmenters for QA systems, highlighting the unique opportunities and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Layer Normalization · Label Smoothing · Byte Pair Encoding · Dropout · Softmax
