Feature selection in high-dimensional dataset using MapReduce
Claudio Reggiani, Yann-A\"el Le Borgne, Gianluca Bontempi

TL;DR
This paper presents a distributed MapReduce implementation of the minimum Redundancy Maximum Relevance feature selection algorithm, scalable to large bioinformatics datasets, with an open-source Hadoop/Spark version demonstrated on millions of data points.
Contribution
It introduces a scalable, distributed implementation of a popular feature selection algorithm suitable for high-dimensional datasets using MapReduce frameworks.
Findings
Successfully handles datasets with millions of observations or features.
Provides an open-source implementation based on Hadoop/Spark.
Demonstrates scalability and efficiency in large-scale bioinformatics data.
Abstract
This paper describes a distributed MapReduce implementation of the minimum Redundancy Maximum Relevance algorithm, a popular feature selection method in bioinformatics and network inference problems. The proposed approach handles both tall/narrow and wide/short datasets. We further provide an open source implementation based on Hadoop/Spark, and illustrate its scalability on datasets involving millions of observations or features.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification · Data Mining Algorithms and Applications · Artificial Intelligence in Healthcare
