Efficient Row-Level Lineage Leveraging Predicate Pushdown
Yin Lin, Cong Yan

TL;DR
PredTrace is a novel lineage inference method that leverages predicate pushdown to efficiently and accurately determine row-level lineage across diverse data pipelines, outperforming previous lazy approaches in speed and coverage.
Contribution
It introduces PredTrace, a new approach that combines the advantages of eager and lazy lineage tracking using predicate pushdown for better coverage, efficiency, and adaptability.
Findings
Achieves higher coverage on TPC-H and real-world pipelines.
Infers lineage in seconds, up to 10x faster than prior methods.
Provides precise or approximate lineage depending on intermediate result availability.
Abstract
Row-level lineage explains what input rows produce an output row through a data processing pipeline, having many applications like data debugging, auditing, data integration, etc. Prior work on lineage falls in two lines: eager lineage tracking and lazy lineage inference. Eager tracking integrates lineage tracing tightly into the operator implementation, enabling efficient customized tracking. However, this approach is intrusive, system-specific, and lacks adaptability. In contrast, lazy inference generates additional queries to compute lineage; it can be easily applied to any database, but the lineage query is usually slow. Furthermore, both approaches have limited coverage of the type of data processing pipeline supported due to operator-specific tracking or inference rules. In this work, we propose PredTrace, a lineage inference approach that achieves easy adaptation, low runtime…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVLSI and FPGA Design Techniques · Video Coding and Compression Technologies
