Automatic End-to-End Data Integration using Large Language Models
Aaron Steiner, Christian Bizer

TL;DR
This paper demonstrates that GPT-5.2 can fully automate data integration pipelines, producing comparable or better results than human-designed pipelines at a fraction of the cost.
Contribution
It introduces an end-to-end data integration approach using LLMs to generate all necessary pipeline artifacts, reducing manual effort significantly.
Findings
LLM-based pipelines achieve similar or better results than human pipelines.
Automating data integration with LLMs costs around $10 per case study.
End-to-end LLM pipelines produce datasets comparable in size and density.
Abstract
Designing data integration pipelines typically requires substantial manual effort from data engineers to configure pipeline components and label training data. While LLMs have shown promise in handling individual steps of the integration process, their potential to replace all human input across end-to-end data integration pipelines has not been investigated. As a step toward exploring this potential, we present an automatic data integration pipeline that uses GPT-5.2 to generate all artifacts required to adapt the pipeline to specific use cases. These artifacts are schema mappings, value mappings for data normalization, training data for entity matching, and validation data for selecting conflict resolution heuristics in data fusion. We compare the performance of this LLM-based pipeline to the performance of human-designed pipelines along three case studies requiring the integration of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Natural Language Processing Techniques · Topic Modeling
