A critical comparison of handling zeros in high-dimensional compositional count data
Wenqi Tang, Kamila Fa\v{c}evicov\'a, Klaus Nordhausen, and Sara Taskinen

TL;DR
This paper reviews zero-handling strategies in high-dimensional compositional count data from sequencing, highlighting challenges, evaluating imputation methods, and discussing future directions for robust analysis.
Contribution
It systematically compares existing zero-handling methods for compositional count data, emphasizing their limitations and proposing directions for improved approaches.
Findings
Imputation strategies' performance varies with data discreteness and zero inflation.
Violations of continuity assumptions cause numerical instability and bias.
Current methods often fail to jointly address compositional constraints and zero inflation.
Abstract
The growing use of high-throughput sequencing (HTS) has enabled the large-scale production of compositional count data, driving progress in microbiome research. However, such count data are often high-dimensional, over-dispersed, and heavily zero-inflated, and they conflict with the continuity assumptions underlying log-ratio-based compositional data analysis (CoDA), creating substantial methodological challenges. This review provides an overview of zero-handling strategies in compositional data, covering zero-tolerant transformations, imputation approaches for rounded zeros, and statistical models for essential zeros. We specifically highlight the problems that arise when applying the log-ratio framework to sequencing-derived compositional count data, where violations of continuity can induce numerical instabilities and biased statistical inferences. Motivated by these issues, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
