Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint)
George Chernishev, Michael Polyntsov, Anton Chizhov, Kirill Stupakov,, Ilya Shchuckin, Alexander Smirnov, Maxim Strutovsky, Alexey Shlyonskikh,, Mikhail Firsov, Stepan Manannikov, Nikita Bobrov, Daniil Goncharov, Ilia, Barutkin, Vladislav Shalnev, Kirill Muraviev

TL;DR
Desbordante is a high-performance, scalable, and resilient data profiler designed for industrial use, capable of discovering, validating, and applying complex data primitives across various data types.
Contribution
It introduces an industrial-grade, open-source data profiling system with novel features like primitive validation, multi-data-type support, and integration with user-defined pipelines.
Findings
Efficient primitive discovery algorithms implemented in C++
Supports validation and explanation of primitive violations
Works with tabular, graph, and transactional data
Abstract
Pioneering data profiling systems such as Metanome and OpenClean brought public attention to science-intensive data profiling. This type of profiling aims to extract complex patterns (primitives) such as functional dependencies, data constraints, association rules, and others. However, these tools are research prototypes rather than production-ready systems. The following work presents Desbordante - a high-performance science-intensive data profiler with open source code. Unlike similar systems, it is built with emphasis on industrial application in a multi-user environment. It is efficient, resilient to crashes, and scalable. Its efficiency is ensured by implementing discovery algorithms in C++, resilience is achieved by extensive use of containerization, and scalability is based on replication of containers. Desbordante aims to open industrial-grade primitive discovery to a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Data Quality and Management · Scientific Computing and Data Management
