Variable Selection for Multi-Source Count Data with Controlled False Discovery Rate

Shan Tang; Shanjun Mao; Shourong Ma; Falong Tan

arXiv:2411.18986·stat.AP·November 11, 2025

Variable Selection for Multi-Source Count Data with Controlled False Discovery Rate

Shan Tang, Shanjun Mao, Shourong Ma, Falong Tan

PDF

Open Access

TL;DR

This paper introduces ZIPG-SK, a novel variable selection method for multi-source count data that controls false discovery rate and improves power by modeling zero-inflation and skewness, validated through simulations and real biomedical datasets.

Contribution

We develop ZIPG-SK, a new FDR-controlled variable selection method tailored for multi-source count data using a Gaussian copula and e-value aggregation, addressing zero inflation and skewness.

Findings

01

ZIPG-SK outperforms existing methods in simulations.

02

It effectively identifies key variables in biomedical datasets.

03

The method provides new mechanistic insights into disease data.

Abstract

The rapid generation of complex, highly skewed, and zero-inflated multi-source count data poses significant challenges for variable selection, particularly in biomedical domains like tumor development and metabolic dysregulation. To address this, we propose a new variable selection method, Zero-Inflated Poisson-Gamma Simultaneous Knockoff (ZIPG-SK), specifically designed for multi-source count data. Our method leverages a gaussian copula based on the Zero-Inflated Poisson-Gamma (ZIPG) distribution to construct knockoffs that properly account for the properties of count data, including high skewness and zero inflation, while effectively incorporating covariate information. This framework enables the detection of common features across multi-source datasets with guaranteed false discovery rate (FDR) control. Furthermore, we enhance the power of the method by incorporating e-value…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference · Data-Driven Disease Surveillance · Data Stream Mining Techniques