A Multi-Layer Testing Framework for Automated Data Quality Assurance in Cloud-Native ELT Pipelines

Ismail Gargouri; Hassan Reza

arXiv:2605.20500·cs.SE·May 21, 2026

A Multi-Layer Testing Framework for Automated Data Quality Assurance in Cloud-Native ELT Pipelines

Ismail Gargouri, Hassan Reza

PDF

TL;DR

This paper introduces a comprehensive multi-layer testing framework for enhancing data quality assurance in cloud-native ELT pipelines, combining orchestration validation, LLM-generated semantic tests, and cross-store checks.

Contribution

It presents a novel, integrated testing approach leveraging LLMs and cross-store validation to improve anomaly detection and data quality in complex ELT workflows.

Findings

01

LLM-augmented tests detected all injected anomalies, outperforming manual methods.

02

Cross-store validation confirmed exact data consistency across different systems.

03

The workflow completed in approximately 107 seconds, demonstrating operational efficiency.

Abstract

Ensuring data quality in cloud-native Extract-Load-Transform (ELT) pipelines is increasingly challenging due to heterogeneous data sources, evolving schemas, and multi-backend execution environments. This paper presents a unified, multi-layer testing framework that integrates orchestration-level validation, declarative dbt tests, large language model (LLM)-generated semantic tests, and cross-store consistency checking between DuckDB and Snowflake, orchestrated through Apache Airflow. Controlled anomaly-injection experiments demonstrate that a manual-only baseline detected 7 of 16 injected anomalies. In contrast, both a manually expanded comparator and the proposed LLM-augmented configuration detected all 16, representing a 128.57% relative improvement in detection rate over the baseline. Post-migration cross-store validation confirmed exact agreement across all three curated tables. Of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.