Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks

Jiaqi Yin; Yi-Wei Chen; Meng-Lung Lee; Xiya Liu

arXiv:2508.07179·cs.CL·August 12, 2025

Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks

Jiaqi Yin, Yi-Wei Chen, Meng-Lung Lee, Xiya Liu

PDF

Open Access

TL;DR

This paper introduces a scalable framework for extracting detailed schema lineage from multilingual enterprise scripts, evaluates it with a new benchmark, and demonstrates that larger language models significantly improve extraction accuracy.

Contribution

It presents a novel automated method for schema lineage extraction, a comprehensive evaluation metric, and a benchmark dataset, advancing the use of language models in enterprise data governance.

Findings

01

Performance improves with larger models and advanced prompting techniques.

02

A 32B open-source model can match GPT-series performance with minimal prompting.

03

The framework enhances data reproducibility and governance in complex pipelines.

Abstract

Enterprise data pipelines, characterized by complex transformations across multiple programming languages, often cause a semantic disconnect between original metadata and downstream data. This "semantic drift" compromises data reproducibility and governance, and impairs the utility of services like retrieval-augmented generation (RAG) and text-to-SQL systems. To address this, a novel framework is proposed for the automated extraction of fine-grained schema lineage from multilingual enterprise pipeline scripts. This method identifies four key components: source schemas, source tables, transformation logic, and aggregation operations, creating a standardized representation of data transformations. For the rigorous evaluation of lineage quality, this paper introduces the Schema Lineage Composite Evaluation (SLiCE), a metric that assesses both structural correctness and semantic fidelity. A…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Semantic Web and Ontologies · Software Engineering Research