The Evolution of LLM Adoption in Industry Data Curation Practices
Crystal Qian, Michael Xieyang Liu, Emily Reif, Grady Simon, Nada, Hussein, Nathan Clement, James Wexler, Carrie J. Cai, Michael Terry, Minsuk, Kahng

TL;DR
This paper examines how industry practitioners are increasingly adopting large language models (LLMs) in data curation, shifting workflows towards insights-first approaches and integrating LLM-generated datasets to handle complex unstructured data.
Contribution
It provides empirical insights into evolving LLM adoption strategies, usage scenarios, and the shift towards insights-first workflows in industry data curation practices.
Findings
Shift from heuristic to insights-first workflows supported by LLMs
Use of LLM-generated 'silver' and 'super golden' datasets alongside traditional datasets
Growing integration of LLMs in large-scale unstructured data analysis
Abstract
As large language models (LLMs) grow increasingly adept at processing unstructured text data, they offer new opportunities to enhance data curation workflows. This paper explores the evolution of LLM adoption among practitioners at a large technology company, evaluating the impact of LLMs in data curation tasks through participants' perceptions, integration strategies, and reported usage scenarios. Through a series of surveys, interviews, and user studies, we provide a timely snapshot of how organizations are navigating a pivotal moment in LLM evolution. In Q2 2023, we conducted a survey to assess LLM adoption in industry for development tasks (N=84), and facilitated expert interviews to assess evolving data needs (N=10) in Q3 2023. In Q2 2024, we explored practitioners' current and anticipated LLM usage through a user study involving two LLM-based prototypes (N=12). While each study…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsResearch Data Management Practices
