On the (In)Significance of Feature Selection in High-Dimensional Datasets

Bhavesh Neekhra; Debayan Gupta; Partha Pratim Chakrabarti

arXiv:2508.03593·cs.LG·September 22, 2025

On the (In)Significance of Feature Selection in High-Dimensional Datasets

Bhavesh Neekhra, Debayan Gupta, Partha Pratim Chakrabarti

PDF

3 Reviews

TL;DR

This study reveals that in high-dimensional datasets, small random feature subsets often match or outperform carefully selected features, questioning the importance of feature selection for predictive performance.

Contribution

The paper demonstrates that random feature subsets can perform as well as or better than selected features in high-dimensional data, challenging the assumed value of feature selection.

Findings

01

Random feature subsets match or outperform selected features in 28/30 datasets.

02

Selected features do not significantly outperform arbitrary feature sets.

03

Results highlight the need for rigorous validation of feature importance.

Abstract

Feature selection (FS) is assumed to improve predictive performance and identify meaningful features in high-dimensional datasets. Surprisingly, small random subsets of features (0.02-1%) match or outperform the predictive performance of both full feature sets and FS across 28 out of 30 diverse datasets (microarray, bulk and single-cell RNA-Seq, mass spectrometry, imaging, etc.). In short, any arbitrary set of features is as good as any other (with surprisingly low variance in results) - so how can a particular set of selected features be "important" if they perform no better than an arbitrary set? These results challenge the assumption that computationally selected features reliably capture meaningful signals, emphasizing the importance of rigorous validation before interpreting selected features as actionable, particularly in computational genomics.

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 10Confidence 4

Strengths

ML as a community needs more such papers that challenge the conventional norm. The results are quite powerful and a bit surprising. Having analyzed many of these datasets, I couldnt but wonder why others havent published such papers earlier. A biologist can have one of two reactions to the claims in this paper: "wow, this is surprising/shocking" or "the analysis is flawed because of ..."! For either of these extreme reactions, this paper may be worth accepting to provoke deeper discussions. I

Weaknesses

The paper may have flaws in the analysis but I think the authors have been honest about their analysis and opened up their code and tried to validate them independently.

Reviewer 02Rating 2Confidence 5

Strengths

The authors address an important question regarding the utility of feature selection in high-dimensional datasets, a topic with significant implications for machine learning and computational biology.

Weaknesses

The paper has several limitations that reduce the strength of its conclusions: 1) The dataset selection process is not described. The authors provide no inclusion or exclusion criteria, making it unclear how the 30 datasets were chosen. 2) The dataset pool is heavily biased toward cancer-related gene expression studies from the Gene Expression Omnibus (GEO), yet the conclusions are generalized to the entire field of feature selection. 3) Cancer and inflammation datasets are known to display larg

Reviewer 03Rating 2Confidence 3

Strengths

- Reporting results on a long list of datasets. This is common in feature selection literature. - Demonstrating with workflows implementing established methods - Proposing a metric, minimum sufficient random sample size, that interpretably evaluates the collective strength of the features of a dataset

Weaknesses

- The results do not seem to support a key claim of the paper (randomly selected features performing at least as well as cleverly selected features) - In Table 2, all values in column D are lower than the corresponding values in column A. - In Table 2, 2 out of 6 values in column E are higher than those in column A, but this is problematic: - The difference between columns D and E is that E uses an ensemble of classifiers. Column A could presumably also benefit from such ensembling.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.