A Multi-Layer Testing Framework for Automated Data Quality Assurance in Cloud-Native ELT Pipelines
Ismail Gargouri, Hassan Reza

TL;DR
This paper introduces a comprehensive multi-layer testing framework for enhancing data quality assurance in cloud-native ELT pipelines, combining orchestration validation, LLM-generated semantic tests, and cross-store checks.
Contribution
It presents a novel, integrated testing approach leveraging LLMs and cross-store validation to improve anomaly detection and data quality in complex ELT workflows.
Findings
LLM-augmented tests detected all injected anomalies, outperforming manual methods.
Cross-store validation confirmed exact data consistency across different systems.
The workflow completed in approximately 107 seconds, demonstrating operational efficiency.
Abstract
Ensuring data quality in cloud-native Extract-Load-Transform (ELT) pipelines is increasingly challenging due to heterogeneous data sources, evolving schemas, and multi-backend execution environments. This paper presents a unified, multi-layer testing framework that integrates orchestration-level validation, declarative dbt tests, large language model (LLM)-generated semantic tests, and cross-store consistency checking between DuckDB and Snowflake, orchestrated through Apache Airflow. Controlled anomaly-injection experiments demonstrate that a manual-only baseline detected 7 of 16 injected anomalies. In contrast, both a manually expanded comparator and the proposed LLM-augmented configuration detected all 16, representing a 128.57% relative improvement in detection rate over the baseline. Post-migration cross-store validation confirmed exact agreement across all three curated tables. Of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
