DPASF: A Flink Library for Streaming Data preprocessing

Alejandro Alcalde-Barros; Diego Garc\'ia-Gil; Salvador Garc\'ia,; Francisco Herrera

arXiv:1810.06021·cs.DB·October 16, 2018

DPASF: A Flink Library for Streaming Data preprocessing

Alejandro Alcalde-Barros, Diego Garc\'ia-Gil, Salvador Garc\'ia,, Francisco Herrera

PDF

1 Repo

TL;DR

DPASF is a Flink-based library that implements key data preprocessing algorithms for streaming Big Data, enabling efficient data correction, reduction, and potential accuracy improvement in real-time applications.

Contribution

It introduces a novel Flink library with six preprocessing algorithms tailored for continuous Big Data streams, filling a gap in existing static data preprocessing research.

Findings

01

Preprocessing reduces data size and maintains or improves accuracy.

02

Algorithms perform efficiently on large datasets in streaming context.

03

DPASF is effective for real-time data correction and feature selection.

Abstract

Data preprocessing techniques are devoted to correct or alleviate errors in data. Discretization and feature selection are two of the most extended data preprocessing techniques. Although we can find many proposals for static Big Data preprocessing, there is little research devoted to the continuous Big Data problem. Apache Flink is a recent and novel Big Data framework, following the MapReduce paradigm, focused on distributed stream and batch data processing. In this paper we propose a data stream library for Big Data preprocessing, named DPASF, under Apache Flink. We have implemented six of the most popular data preprocessing algorithms, three for discretization and the rest for feature selection. The algorithms have been tested using two Big Data datasets. Experimental results show that preprocessing can not only reduce the size of the data, but to maintain or even improve the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

elbaulp/dpasf
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.