Position: Let's Develop Data Probes to Fundamentally Understand How Data Affects LLM Performance

Shiqiang Wang; Herbert Woisetschl\"ager; Hans Arno Jacobsen; Mingyue Ji

arXiv:2605.18801·cs.AI·May 20, 2026

Position: Let's Develop Data Probes to Fundamentally Understand How Data Affects LLM Performance

Shiqiang Wang, Herbert Woisetschl\"ager, Hans Arno Jacobsen, Mingyue Ji

PDF

TL;DR

This paper advocates for creating synthetic data probes with specific statistical properties to systematically study how data characteristics influence large language model performance and behavior.

Contribution

It introduces the concept of data probes generated from random processes as a systematic methodology to understand data's impact on LLMs.

Findings

01

Proposes using synthetic data sequences to reveal data characteristics affecting LLMs.

02

Suggests theoretical tools like typical sets to analyze LLM behavior.

03

Provides a pathway for foundational insights into data's role in LLMs.

Abstract

Data is fundamental to large language models (LLMs). However, understanding of what makes certain data useful for different stages of an LLM workflow, including training, tuning, alignment, in-context learning, etc., and why, remains an open question. Current approaches rely heavily on extensive experimentation with large public datasets to obtain empirical heuristics for data filtering and dataset construction. These approaches are compute intensive and lack a principled way of understanding the essence of how specific data characteristics drive LLM behavior. In this position paper, we advocate for the need of developing systematic methodologies for generating synthetic sequences from appropriately defined random processes, with the goal that these sequences can reveal useful characteristics when they are used in one or multiple stages of the LLM workflow. We refer to such sequences as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.