# Distributed Correlation-Based Feature Selection in Spark

**Authors:** Raul-Jose Palma-Mendoza, Luis de-Marcos, Daniel Rodriguez and, Amparo Alonso-Betanzos

arXiv: 1901.11286 · 2019-02-01

## TL;DR

This paper introduces Distributed CFS, a scalable and parallel feature selection algorithm implemented in Spark, capable of handling large datasets efficiently while maintaining the quality of selected features.

## Contribution

The paper presents a novel distributed implementation of the CFS feature selection algorithm in Spark, improving scalability and processing speed for big data applications.

## Key findings

- Distributed CFS outperforms non-distributed versions in speed and scalability.
- The algorithm maintains feature selection quality equivalent to the original CFS.
- It effectively handles large datasets with high-dimensional features.

## Abstract

CFS (Correlation-Based Feature Selection) is an FS algorithm that has been successfully applied to classification problems in many domains. We describe Distributed CFS (DiCFS) as a completely redesigned, scalable, parallel and distributed version of the CFS algorithm, capable of dealing with the large volumes of data typical of big data applications. Two versions of the algorithm were implemented and compared using the Apache Spark cluster computing model, currently gaining popularity due to its much faster processing times than Hadoop's MapReduce model. We tested our algorithms on four publicly available datasets, each consisting of a large number of instances and two also consisting of a large number of features. The results show that our algorithms were superior in terms of both time-efficiency and scalability. In leveraging a computer cluster, they were able to handle larger datasets than the non-distributed WEKA version while maintaining the quality of the results, i.e., exactly the same features were returned by our algorithms when compared to the original algorithm available in WEKA.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1901.11286/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/1901.11286/full.md

## References

42 references — full list in the complete paper: https://tomesphere.com/paper/1901.11286/full.md

---
Source: https://tomesphere.com/paper/1901.11286