Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

Chenkai Pan; Xinglong Xu; Yuhang Xu; Yujun Wu; Siyuan Li; Jintao Chen; Conghui He; Jingxuan Wei; Cheng Tan

arXiv:2604.24819·cs.SE·April 29, 2026

Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

Chenkai Pan, Xinglong Xu, Yuhang Xu, Yujun Wu, Siyuan Li, Jintao Chen, Conghui He, Jingxuan Wei, Cheng Tan

PDF

1 Repo

TL;DR

This paper introduces a structured, feedback-driven approach to data engineering for large language models, enabling targeted repairs and systematic improvements by mapping data lifecycle to software development processes.

Contribution

It formalizes the Programming with Data principle, linking data management to software engineering, and demonstrates its effectiveness across multiple scientific disciplines with open resources.

Findings

01

Model failures can be traced to specific data deficiencies.

02

Targeted data patches improve model performance across scales.

03

Structured knowledge enables systematic data-driven model repair.

Abstract

Reliably transferring specialized human knowledge from text into large language models remains a fundamental challenge in artificial intelligence. Fine-tuning on domain corpora has enabled substantial capability gains, but the process operates without feedback: when a model fails on a domain task, there is no method to diagnose what is deficient in the training data, and the only recourse is to add more data indiscriminately. Here we show that when a structured knowledge representation extracted from the source corpus serves as the shared foundation for both training data and evaluation, the complete data-engineering lifecycle maps onto the software development lifecycle in a precise and operative way: training data becomes source code specifying what the model should learn, model training becomes compilation, benchmarking becomes unit testing, and failure-driven data repair becomes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

openraiser/ProDa
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.