# Enabling Smart Data: Noise filtering in Big Data classification

**Authors:** Diego Garc\'ia-Gil, Juli\'an Luengo, Salvador Garc\'ia, Francisco, Herrera

arXiv: 1704.01770 · 2017-07-31

## TL;DR

This paper introduces two scalable ensemble filtering methods to effectively remove label noise from Big Data classification datasets, thereby improving data quality and enabling more accurate knowledge discovery.

## Contribution

It proposes novel ensemble-based noise filtering algorithms specifically designed for Big Data, addressing scalability and performance challenges of traditional methods.

## Key findings

- The ensemble filters effectively reduce label noise in large datasets.
- The methods demonstrate high scalability and efficiency on Big Data classification tasks.
- Results show improved data quality leading to better classification performance.

## Abstract

In any knowledge discovery process the value of extracted knowledge is directly related to the quality of the data used. Big Data problems, generated by massive growth in the scale of data observed in recent years, also follow the same dictate. A common problem affecting data quality is the presence of noise, particularly in classification problems, where label noise refers to the incorrect labeling of training instances, and is known to be a very disruptive feature of data. However, in this Big Data era, the massive growth in the scale of the data poses a challenge to traditional proposals created to tackle noise, as they have difficulties coping with such a large amount of data. New algorithms need to be proposed to treat the noise in Big Data problems, providing high quality and clean data, also known as Smart Data. In this paper, two Big Data preprocessing approaches to remove noisy examples are proposed: an homogeneous ensemble and an heterogeneous ensemble filter, with special emphasis in their scalability and performance traits. The obtained results show that these proposals enable the practitioner to efficiently obtain a Smart Dataset from any Big Data classification problem.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1704.01770/full.md

## Figures

13 figures with captions in the complete paper: https://tomesphere.com/paper/1704.01770/full.md

## References

54 references — full list in the complete paper: https://tomesphere.com/paper/1704.01770/full.md

---
Source: https://tomesphere.com/paper/1704.01770