LINEAGEX: A Column Lineage Extraction System for SQL
Shi Heng Zhang, Zhengjie Miao, Jiannan Wang

TL;DR
LINEAGEX is a lightweight Python tool that extracts and visualizes column-level data lineage from SQL queries, improving accuracy and coverage for data governance tasks.
Contribution
It introduces a novel, efficient approach to infer column lineage directly from SQL parse trees, overcoming limitations of existing systems.
Findings
High coverage and accuracy in lineage extraction
Effective handling of query ambiguities
Open source implementation available
Abstract
As enterprise data grows in size and complexity, column-level data lineage, which records the creation, transformation, and reference of each column in the warehouse, has been the key to effective data governance that assists tasks like data quality monitoring, storage refactoring, and workflow migration. Unfortunately, existing systems introduce overheads by integration with query execution or fail to achieve satisfying accuracy for column lineage. In this paper, we demonstrate LINEAGEX, a lightweight Python library that infers column level lineage from SQL queries and visualizes it through an interactive interface. LINEAGEX achieves high coverage and accuracy for column lineage extraction by intelligently traversing query parse trees and handling ambiguities. The demonstration walks through use cases of building lineage graphs and troubleshooting data quality issues. LINEAGEX is open…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Data Quality and Management · SAS software applications and methods
