Feature selection in high-dimensional dataset using MapReduce

Claudio Reggiani; Yann-A\"el Le Borgne; Gianluca Bontempi

arXiv:1709.02327·cs.DC·September 8, 2017

Feature selection in high-dimensional dataset using MapReduce

Claudio Reggiani, Yann-A\"el Le Borgne, Gianluca Bontempi

PDF

Open Access 1 Repo

TL;DR

This paper presents a distributed MapReduce implementation of the minimum Redundancy Maximum Relevance feature selection algorithm, scalable to large bioinformatics datasets, with an open-source Hadoop/Spark version demonstrated on millions of data points.

Contribution

It introduces a scalable, distributed implementation of a popular feature selection algorithm suitable for high-dimensional datasets using MapReduce frameworks.

Findings

01

Successfully handles datasets with millions of observations or features.

02

Provides an open-source implementation based on Hadoop/Spark.

03

Demonstrates scalability and efficiency in large-scale bioinformatics data.

Abstract

This paper describes a distributed MapReduce implementation of the minimum Redundancy Maximum Relevance algorithm, a popular feature selection method in bioinformatics and network inference problems. The proposed approach handles both tall/narrow and wide/short datasets. We further provide an open source implementation based on Hadoop/Spark, and illustrate its scalability on datasets involving millions of observations or features.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

creggian/spark-ifs
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGene expression and cancer classification · Data Mining Algorithms and Applications · Artificial Intelligence in Healthcare