TL;DR
DataPrep.EDA is a new Python system for task-centric exploratory data analysis that simplifies specifying and executing EDA tasks, outperforming existing tools in speed and usability.
Contribution
It introduces a declarative, task-centric approach to EDA in Python, addressing limitations of existing libraries and enhancing scalability, usability, and customizability.
Findings
DataPrep.EDA significantly outperforms Pandas-profiling in speed.
The system improves user experience in EDA tasks.
Effective pipeline acceleration techniques were developed.
Abstract
Exploratory Data Analysis (EDA) is a crucial step in any data science project. However, existing Python libraries fall short in supporting data scientists to complete common EDA tasks for statistical modeling. Their API design is either too low level, which is optimized for plotting rather than EDA, or too high level, which is hard to specify more fine-grained EDA tasks. In response, we propose DataPrep.EDA, a novel task-centric EDA system in Python. DataPrep.EDA allows data scientists to declaratively specify a wide range of EDA tasks in different granularity with a single function call. We identify a number of challenges to implement DataPrep.EDA, and propose effective solutions to improve the scalability, usability, customizability of the system. In particular, we discuss some lessons learned from using Dask to build the data processing pipelines for EDA tasks and describe our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
