An Information Theoretic Feature Selection Framework for Big Data under Apache Spark
Sergio Ram\'irez-Gallego, H\'ector Mouri\~no-Tal\'in, David, Mart\'inez-Rego, Ver\'onica Bol\'on-Canedo, Jos\'e Manuel Ben\'itez, Amparo, Alonso-Betanzos, Francisco Herrera

TL;DR
This paper presents a distributed feature selection framework based on Information Theoretic methods implemented on Apache Spark, enabling efficient handling of ultra-high dimensional big data with improved performance and accuracy.
Contribution
It introduces a novel parallelized framework for feature selection using Information Theoretic methods on Spark, scalable to extremely large datasets.
Findings
Outperforms sequential methods in speed and accuracy.
Effectively handles ultra-high dimensional datasets.
Demonstrates scalability on real-world big data.
Abstract
With the advent of extremely high dimensional datasets, dimensionality reduction techniques are becoming mandatory. Among many techniques, feature selection has been growing in interest as an important tool to identify relevant features on huge datasets --both in number of instances and features--. The purpose of this work is to demonstrate that standard feature selection methods can be parallelized in Big Data platforms like Apache Spark, boosting both performance and accuracy. We thus propose a distributed implementation of a generic feature selection framework which includes a wide group of well-known Information Theoretic methods. Experimental results on a wide set of real-world datasets show that our distributed framework is capable of dealing with ultra-high dimensional datasets as well as those with a huge number of samples in a short period of time, outperforming the sequential…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition · Machine Learning and Data Classification · Neural Networks and Applications
