Evaluating Data Quality Tools: Measurement Capabilities and LLM Integration
Tobias Rehberger, Thomas H\"utter, Lisa Ehrlinger, Wolfram W\"o{\ss}

TL;DR
This paper evaluates six data quality tools, analyzing their measurement capabilities and how they incorporate Large Language Models, highlighting differences between open-source and proprietary solutions.
Contribution
It provides a comprehensive comparison of data quality tools and assesses their integration with LLMs based on real-world use cases.
Findings
Proprietary tools have more comprehensive measurement features.
Open-source tools offer greater flexibility but require more effort.
LLM integration is mainly limited to rule creation workflows.
Abstract
High data quality is critical for reliable analytics and operational efficiency. A growing ecosystem of tools has emerged to support data quality management, ranging from lightweight open-source libraries to comprehensive enterprise platforms. This paper evaluates six data quality tools: Great Expectations, Deequ, Evidently, Informatica, Experian, and Ataccama. The evaluation criteria cover rule definition, duplicate detection, metric aggregation, and uncertainty handling, and were derived from real-world use cases of company partners. We further examine to what extent these tools integrate Large Language Models (LLMs). Our findings show that proprietary tools offer more comprehensive measurement features and emerging LLM-based assistance, while open-source tools provide flexibility at the cost of higher implementation effort. Across all tools, LLM integration remains limited to rule…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
