Under the Surface: Tracking the Artifactuality of LLM-Generated Data
Debarati Das, Karin De Langis, Anna Martin-Boyle, Jaehyung Kim, Minhwa, Lee, Zae Myung Kim, Shirley Anugrah Hayati, Risako Owan, Bin Hu, Ritik, Parkar, Ryan Koo, Jonginn Park, Aahan Tyagi, Libby Ferland, Sanjali Roy,, Vincent Liu, and Dongyeop Kang

TL;DR
This paper investigates the quality and implications of various types of LLM-generated artificial data, revealing hidden disparities compared to human data, especially in complex tasks, and emphasizing ethical considerations in data creation.
Contribution
First comprehensive analysis aggregating diverse LLM-generated text data and evaluating its quality against human data across multiple benchmarks.
Findings
LLM-generated data can match human performance in some tasks
Significant disparities exist in complex tasks involving nuanced understanding
Highlights ethical concerns and biases in LLM-generated content
Abstract
This work delves into the expanding role of large language models (LLMs) in generating artificial data. LLMs are increasingly employed to create a variety of outputs, including annotations, preferences, instruction prompts, simulated dialogues, and free text. As these forms of LLM-generated data often intersect in their application, they exert mutual influence on each other and raise significant concerns about the quality and diversity of the artificial data incorporated into training cycles, leading to an artificial data ecosystem. To the best of our knowledge, this is the first study to aggregate various types of LLM-generated text data, from more tightly constrained data like "task labels" to more lightly constrained "free-form text". We then stress test the quality and implications of LLM-generated artificial data, comparing it with human data across various existing benchmarks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)
