Not Every AI Problem is a Data Problem: We Should Be Intentional About Data Scaling

Tanya Rodchenko; Natasha Noy; Nino Scherrer

arXiv:2501.13779·cs.LG·June 4, 2025

Not Every AI Problem is a Data Problem: We Should Be Intentional About Data Scaling

Tanya Rodchenko, Natasha Noy, Nino Scherrer

PDF

Open Access

TL;DR

This paper emphasizes the importance of intentional data acquisition for AI, highlighting that understanding data structure and task types can optimize scaling efforts and inform future compute paradigms.

Contribution

It introduces the idea that data quality and structure, not just quantity, should guide data scaling strategies for AI development.

Findings

01

Data structure influences task scalability

02

Not all tasks benefit equally from data scaling

03

Guidelines for targeted data acquisition

Abstract

While Large Language Models require more and more data to train and scale, rather than looking for any data to acquire, we should consider what types of tasks are more likely to benefit from data scaling. We should be intentional in our data acquisition. We argue that the shape of the data itself, such as its compositional and structural patterns, informs which tasks to prioritize in data scaling, and shapes the development of the next generation of compute paradigms for tasks where data scaling is inefficient, or even insufficient.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI)