Feature selection for high-dimensional integrated data

Charles Zheng; Scott Schwartz; Robert Chapkin; Raymond Carroll; Ivan; Ivanov

arXiv:1111.6283·stat.AP·November 29, 2011·SDM

Feature selection for high-dimensional integrated data

Charles Zheng, Scott Schwartz, Robert Chapkin, Raymond Carroll, Ivan, Ivanov

PDF

Open Access

TL;DR

This paper introduces a feature selection model for high-dimensional data, focusing on identifying relevant predictors related to biological systems, and compares thresholding and SVD methods through simulations and real data application.

Contribution

It proposes a novel feature selection framework distinguishing dependent and noise predictors, and evaluates two methods for small-sample accuracy in biological data contexts.

Findings

01

Thresholding and SVD methods perform well in simulations.

02

Empirical bounds on small-sample accuracy are established.

03

Methods demonstrate utility on gene expression and metagenomics data.

Abstract

Motivated by the problem of identifying correlations between genes or features of two related biological systems, we propose a model of \emph{feature selection} in which only a subset of the predictors $X_{t}$ are dependent on the multidimensional variate $Y$ , and the remainder of the predictors constitute a "noise set" $X_{u}$ independent of $Y$ . Using Monte Carlo simulations, we investigated the relative performance of two methods: thresholding and singular-value decomposition, in combination with stochastic optimization to determine "empirical bounds" on the small-sample accuracy of an asymptotic approximation. We demonstrate utility of the thresholding and SVD feature selection methods to with respect to a recent infant intestinal gene expression and metagenomics dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGene expression and cancer classification · Statistical Methods and Inference · Bayesian Methods and Mixture Models