Position: Let's Develop Data Probes to Fundamentally Understand How Data Affects LLM Performance
Shiqiang Wang, Herbert Woisetschl\"ager, Hans Arno Jacobsen, Mingyue Ji

TL;DR
This paper advocates for creating synthetic data probes with specific statistical properties to systematically study how data characteristics influence large language model performance and behavior.
Contribution
It introduces the concept of data probes generated from random processes as a systematic methodology to understand data's impact on LLMs.
Findings
Proposes using synthetic data sequences to reveal data characteristics affecting LLMs.
Suggests theoretical tools like typical sets to analyze LLM behavior.
Provides a pathway for foundational insights into data's role in LLMs.
Abstract
Data is fundamental to large language models (LLMs). However, understanding of what makes certain data useful for different stages of an LLM workflow, including training, tuning, alignment, in-context learning, etc., and why, remains an open question. Current approaches rely heavily on extensive experimentation with large public datasets to obtain empirical heuristics for data filtering and dataset construction. These approaches are compute intensive and lack a principled way of understanding the essence of how specific data characteristics drive LLM behavior. In this position paper, we advocate for the need of developing systematic methodologies for generating synthetic sequences from appropriately defined random processes, with the goal that these sequences can reveal useful characteristics when they are used in one or multiple stages of the LLM workflow. We refer to such sequences as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
