TL;DR
DPASF is a Flink-based library that implements key data preprocessing algorithms for streaming Big Data, enabling efficient data correction, reduction, and potential accuracy improvement in real-time applications.
Contribution
It introduces a novel Flink library with six preprocessing algorithms tailored for continuous Big Data streams, filling a gap in existing static data preprocessing research.
Findings
Preprocessing reduces data size and maintains or improves accuracy.
Algorithms perform efficiently on large datasets in streaming context.
DPASF is effective for real-time data correction and feature selection.
Abstract
Data preprocessing techniques are devoted to correct or alleviate errors in data. Discretization and feature selection are two of the most extended data preprocessing techniques. Although we can find many proposals for static Big Data preprocessing, there is little research devoted to the continuous Big Data problem. Apache Flink is a recent and novel Big Data framework, following the MapReduce paradigm, focused on distributed stream and batch data processing. In this paper we propose a data stream library for Big Data preprocessing, named DPASF, under Apache Flink. We have implemented six of the most popular data preprocessing algorithms, three for discretization and the rest for feature selection. The algorithms have been tested using two Big Data datasets. Experimental results show that preprocessing can not only reduce the size of the data, but to maintain or even improve the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
