TL;DR
This paper introduces a structured, feedback-driven approach to data engineering for large language models, enabling targeted repairs and systematic improvements by mapping data lifecycle to software development processes.
Contribution
It formalizes the Programming with Data principle, linking data management to software engineering, and demonstrates its effectiveness across multiple scientific disciplines with open resources.
Findings
Model failures can be traced to specific data deficiencies.
Targeted data patches improve model performance across scales.
Structured knowledge enables systematic data-driven model repair.
Abstract
Reliably transferring specialized human knowledge from text into large language models remains a fundamental challenge in artificial intelligence. Fine-tuning on domain corpora has enabled substantial capability gains, but the process operates without feedback: when a model fails on a domain task, there is no method to diagnose what is deficient in the training data, and the only recourse is to add more data indiscriminately. Here we show that when a structured knowledge representation extracted from the source corpus serves as the shared foundation for both training data and evaluation, the complete data-engineering lifecycle maps onto the software development lifecycle in a precise and operative way: training data becomes source code specifying what the model should learn, model training becomes compilation, benchmarking becomes unit testing, and failure-driven data repair becomes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
