Automatic End-to-End Data Integration using Large Language Models

Aaron Steiner; Christian Bizer

arXiv:2603.10547·cs.CL·March 12, 2026

Automatic End-to-End Data Integration using Large Language Models

Aaron Steiner, Christian Bizer

PDF

Open Access

TL;DR

This paper demonstrates that GPT-5.2 can fully automate data integration pipelines, producing comparable or better results than human-designed pipelines at a fraction of the cost.

Contribution

It introduces an end-to-end data integration approach using LLMs to generate all necessary pipeline artifacts, reducing manual effort significantly.

Findings

01

LLM-based pipelines achieve similar or better results than human pipelines.

02

Automating data integration with LLMs costs around $10 per case study.

03

End-to-end LLM pipelines produce datasets comparable in size and density.

Abstract

Designing data integration pipelines typically requires substantial manual effort from data engineers to configure pipeline components and label training data. While LLMs have shown promise in handling individual steps of the integration process, their potential to replace all human input across end-to-end data integration pipelines has not been investigated. As a step toward exploring this potential, we present an automatic data integration pipeline that uses GPT-5.2 to generate all artifacts required to adapt the pipeline to specific use cases. These artifacts are schema mappings, value mappings for data normalization, training data for entity matching, and validation data for selecting conflict resolution heuristics in data fusion. We compare the performance of this LLM-based pipeline to the performance of human-designed pipelines along three case studies requiring the integration of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Natural Language Processing Techniques · Topic Modeling