Data Pipeline Quality: Influencing Factors, Root Causes of Data-related Issues, and Processing Problem Areas for Developers
Harald Foidl, Valentina Golendukhina, Rudolf Ramler, Michael Felderer

TL;DR
This paper presents a comprehensive taxonomy of 41 factors affecting data pipeline quality, investigates root causes of data issues, and identifies key problem areas for developers, aiding quality assurance and future research.
Contribution
It introduces a validated taxonomy of influencing factors, analyzes root causes of data issues, and highlights main developer concerns in data pipeline processing.
Findings
Data issues mainly caused by incorrect data types (33%)
Most questions from developers relate to data integration and ingestion (47%)
Compatibility issues are a distinct problem area in data pipelines
Abstract
Data pipelines are an integral part of various modern data-driven systems. However, despite their importance, they are often unreliable and deliver poor-quality data. A critical step toward improving this situation is a solid understanding of the aspects contributing to the quality of data pipelines. Therefore, this article first introduces a taxonomy of 41 factors that influence the ability of data pipelines to provide quality data. The taxonomy is based on a multivocal literature review and validated by eight interviews with experts from the data engineering domain. Data, infrastructure, life cycle management, development & deployment, and processing were found to be the main influencing themes. Second, we investigate the root causes of data-related issues, their location in data pipelines, and the main topics of data pipeline processing issues for developers by mining GitHub projects…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Big Data and Business Intelligence · Data Mining Algorithms and Applications
