MLPrE -- A tool for preprocessing and exploratory data analysis prior to machine learning model construction
David S Maxwell, Michael Darkoh, Sidharth R Samudrala, Caroline Chung, Stephanie T Schmidt, Bissan Al-Lazikani

TL;DR
MLPrE is a scalable, flexible tool designed for preprocessing and exploratory data analysis that integrates seamlessly into larger machine learning workflows, handling diverse data formats and sizes efficiently.
Contribution
This paper introduces MLPrE, a novel, extensible tool that streamlines data preprocessing and analysis for machine learning, supporting multiple data types and integration with existing pipelines.
Findings
Successfully processed six diverse datasets demonstrating versatility.
Enabled independent processing and recombination of multiple data fields.
Prepared data for graph databases, showcasing end-to-end workflow support.
Abstract
With the recent growth of Deep Learning for AI, there is a need for tools to meet the demand of data flowing into those models. In some cases, source data may exist in multiple formats, and therefore the source data must be investigated and properly engineered for a Machine Learning model or graph database. Overhead and lack of scalability with existing workflows limit integration within a larger processing pipeline such as Apache Airflow, driving the need for a robust, extensible, and lightweight tool to preprocess arbitrary datasets that scales with data type and size. To address this, we present Machine Learning Preprocessing and Exploratory Data Analysis, MLPrE, in which SparkDataFrames were utilized to hold data during processing and ensure scalability. A generalizable JSON input file format was utilized to describe stepwise changes to that DataFrame. Stages were implemented for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
