Towards Next Generation Data Engineering Pipelines
Kevin M. Kramer, Valerie Restat, Sebastian Strasser, Uta St\"orl, Meike Klettke

TL;DR
This paper envisions and proposes a framework for next-generation data engineering pipelines that are optimized, self-aware, and self-adapting to improve data quality and robustness.
Contribution
It introduces a conceptual framework and approaches for developing data pipelines that are optimized, self-aware, and capable of automatic adaptation.
Findings
Proposes three levels of advanced data pipelines: optimized, self-aware, and self-adapting.
Suggests methods for continuous monitoring and automatic reaction to data changes.
Lays out a roadmap for future research in resilient data engineering pipelines.
Abstract
Data engineering pipelines are a widespread way to provide high-quality data for all kinds of data science applications. However, numerous challenges still remain in the composition and operation of such pipelines. Data engineering pipelines do not always deliver high-quality data. By default, they are also not reactive to changes. When new data is coming in which deviates from prior data, the pipeline could crash or output undesired results. We therefore envision three levels of next generation data engineering pipelines: optimized data pipelines, self-aware data pipelines, and self-adapting data pipelines. Pipeline optimization addresses the composition of operators and their parametrization in order to achieve the highest possible data quality. Self-aware data engineering pipelines enable a continuous monitoring of its current state, notifying data engineers on significant changes.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Advanced Database Systems and Queries · Machine Learning and Data Classification
