Variable Selection for Multi-Source Count Data with Controlled False Discovery Rate
Shan Tang, Shanjun Mao, Shourong Ma, Falong Tan

TL;DR
This paper introduces ZIPG-SK, a novel variable selection method for multi-source count data that controls false discovery rate and improves power by modeling zero-inflation and skewness, validated through simulations and real biomedical datasets.
Contribution
We develop ZIPG-SK, a new FDR-controlled variable selection method tailored for multi-source count data using a Gaussian copula and e-value aggregation, addressing zero inflation and skewness.
Findings
ZIPG-SK outperforms existing methods in simulations.
It effectively identifies key variables in biomedical datasets.
The method provides new mechanistic insights into disease data.
Abstract
The rapid generation of complex, highly skewed, and zero-inflated multi-source count data poses significant challenges for variable selection, particularly in biomedical domains like tumor development and metabolic dysregulation. To address this, we propose a new variable selection method, Zero-Inflated Poisson-Gamma Simultaneous Knockoff (ZIPG-SK), specifically designed for multi-source count data. Our method leverages a gaussian copula based on the Zero-Inflated Poisson-Gamma (ZIPG) distribution to construct knockoffs that properly account for the properties of count data, including high skewness and zero inflation, while effectively incorporating covariate information. This framework enables the detection of common features across multi-source datasets with guaranteed false discovery rate (FDR) control. Furthermore, we enhance the power of the method by incorporating e-value…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Data-Driven Disease Surveillance · Data Stream Mining Techniques
