Increasing Diversity While Maintaining Accuracy: Text Data Generation   with Large Language Models and Human Interventions

John Joon Young Chung; Ece Kamar; Saleema Amershi

arXiv:2306.04140·cs.CL·August 11, 2023·5 cites

Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions

John Joon Young Chung, Ece Kamar, Saleema Amershi

PDF

Open Access

TL;DR

This paper investigates methods to enhance diversity in text data generated by large language models while maintaining accuracy through human interventions, aiming to improve training datasets for better model performance.

Contribution

It introduces diversification techniques and human intervention strategies, demonstrating how label correction can significantly boost data accuracy in LLM-generated datasets.

Findings

01

Diversification increases data variety but can reduce accuracy.

02

Label replacement improves model accuracy by 14.4%.

03

Human interventions can outperform few-shot learning with LLMs.

Abstract

Large language models (LLMs) can be used to generate text data for training and evaluating other models. However, creating high-quality datasets with LLMs can be challenging. In this work, we explore human-AI partnerships to facilitate high diversity and accuracy in LLM-based text data generation. We first examine two approaches to diversify text generation: 1) logit suppression, which minimizes the generation of languages that have already been frequently generated, and 2) temperature sampling, which flattens the token sampling probability. We found that diversification approaches can increase data diversity but often at the cost of data accuracy (i.e., text and labels being appropriate for the target domain). To address this issue, we examined two human interventions, 1) label replacement (LR), correcting misaligned labels, and 2) out-of-scope filtering (OOSF), removing instances that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification