Solving Data Quality Problems with Desbordante: a Demo
George Chernishev, Michael Polyntsov, Anton Chizhov, Kirill Stupakov,, Ilya Shchuckin, Alexander Smirnov, Maxim Strutovsky, Alexey Shlyonskikh,, Mikhail Firsov, Stepan Manannikov, Nikita Bobrov, Daniil Goncharov, Ilia, Barutkin, Vladislav Shalnev, Kirill Muraviev

TL;DR
Desbordante is an industrial-grade, open-source data profiler that efficiently detects data quality issues, offers explanations, and seamlessly integrates with Python, addressing limitations of existing profiling tools.
Contribution
It introduces Desbordante, a scalable, resilient, and explainable data profiling tool designed for industrial use with seamless Python integration and C++ performance optimization.
Findings
Effective typo detection, data deduplication, and anomaly detection demonstrated.
Seamless Python integration with C++ core improves performance.
Provides explanations for data pattern absences.
Abstract
Data profiling is an essential process in modern data-driven industries. One of its critical components is the discovery and validation of complex statistics, including functional dependencies, data constraints, association rules, and others. However, most existing data profiling systems that focus on complex statistics do not provide proper integration with the tools used by contemporary data scientists. This creates a significant barrier to the adoption of these tools in the industry. Moreover, existing systems were not created with industrial-grade workloads in mind. Finally, they do not aim to provide descriptive explanations, i.e. why a given pattern is not found. It is a significant issue as it is essential to understand the underlying reasons for a specific pattern's absence to make informed decisions based on the data. Because of that, these patterns are effectively rest in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data Mining Algorithms and Applications · Big Data and Business Intelligence
MethodsFocus
