Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks
Jiaqi Yin, Yi-Wei Chen, Meng-Lung Lee, Xiya Liu

TL;DR
This paper introduces a scalable framework for extracting detailed schema lineage from multilingual enterprise scripts, evaluates it with a new benchmark, and demonstrates that larger language models significantly improve extraction accuracy.
Contribution
It presents a novel automated method for schema lineage extraction, a comprehensive evaluation metric, and a benchmark dataset, advancing the use of language models in enterprise data governance.
Findings
Performance improves with larger models and advanced prompting techniques.
A 32B open-source model can match GPT-series performance with minimal prompting.
The framework enhances data reproducibility and governance in complex pipelines.
Abstract
Enterprise data pipelines, characterized by complex transformations across multiple programming languages, often cause a semantic disconnect between original metadata and downstream data. This "semantic drift" compromises data reproducibility and governance, and impairs the utility of services like retrieval-augmented generation (RAG) and text-to-SQL systems. To address this, a novel framework is proposed for the automated extraction of fine-grained schema lineage from multilingual enterprise pipeline scripts. This method identifies four key components: source schemas, source tables, transformation logic, and aggregation operations, creating a standardized representation of data transformations. For the rigorous evaluation of lineage quality, this paper introduces the Schema Lineage Composite Evaluation (SLiCE), a metric that assesses both structural correctness and semantic fidelity. A…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Semantic Web and Ontologies · Software Engineering Research
