Identification of Signal, Noise, and Indistinguishable Subsets in High-Dimensional Data Analysis
X. Jessie Jeng

TL;DR
This paper introduces a statistical framework for categorizing high-dimensional data into signal, noise, and indistinguishable subsets, aiding in efficient data analysis and follow-up studies.
Contribution
It develops a data-driven procedure that adaptively identifies the three subsets, with theoretical guarantees and practical validation.
Findings
The procedure accurately separates signals from noise under certain conditions.
It adapts to unknown signal strengths, reducing the indistinguishable subset as signals strengthen.
Validated through simulations and real genomic data application.
Abstract
Motivated by applications in high-dimensional data analysis where strong signals often stand out easily and weak ones may be indistinguishable from the noise, we develop a statistical framework to provide a novel categorization of the data into the signal, noise, and indistinguishable subsets. The three-subset categorization is especially relevant under high-dimensionality as a large proportion of signals can be obscured by the large amount of noise. Understanding the three-subset phenomenon is important for the researchers in real applications to design efficient follow-up studies. %For example, candidates belonging to the signal subset may have priority for more focused study, while those in the noise subset can be removed; and, for candidates in the indistinguishable subset, additional data may be collected to further separate weak signals from the noise. We develop an efficient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification · Algorithms and Data Compression · Genomics and Chromatin Dynamics
