# Compositional data modeling of high-dimensional single cell RNA-seq (CoDA-hd): its advantages over commonly used normalization approaches

**Authors:** Jinghan Huang, Sheung Chi Phillip Yam, K. S. Leung, Minghua Deng, Nelson L. S. Tang

PMC · DOI: 10.1186/s12967-025-07157-z · Journal of Translational Medicine · 2025-10-21

## TL;DR

This paper introduces a new method for analyzing single-cell RNA-seq data using compositional data modeling, which improves visualization and clustering compared to traditional normalization techniques.

## Contribution

The paper introduces CoDA-hd, a novel compositional data approach for high-dimensional scRNA-seq data with improved handling of sparsity and compatibility with downstream analyses.

## Key findings

- Innovative count addition schemes like SGM allow CoDA to be applied to high-dimensional sparse scRNA-seq data.
- CoDA LR transformations like CLR improve clustering and trajectory inference in scRNA-seq datasets.
- CLR transformations eliminate suspicious trajectories caused by dropouts in trajectory inference.

## Abstract

Compositional data analysis (CoDA) is an emerging statistical framework and has been extended to microbiome, bulk RNA-seq, and cell type proportions in single-cell RNA-seq (scRNA-seq), which typically has 50–200 components. Here, we explore the high-dimensional application of CoDA (CoDA-hd) and its various log-ratio (LR) transformations to raw count matrix of scRNA-seq which has over 20,000 components (e.g., protein coding genes). scRNA-seq matrices are typically sparse and high-dimensional. Common approaches of normalization such as log-normalization may lead to suspicious findings as previously shown for trajectory inference. Although RNA-seq is compositional data by nature, the geometry of CoDA in high-dimensional simplex is not compatible with most downstream analyses of scRNA-seq which are based on Euclidean space. In this study, we attempted to explore: (1) CoDA adaptability to scRNA-seq; (2) handling of zero data: prior-log-normalization, imputation or with specific count addition scheme; (3) transformation to Euclidean space and compatibility with downstream analyses.

Our results suggest that (1) the innovative count addition schemes (e.g., SGM) enable the application of CoDA to high dimensional sparse data (i.e., scRNA-seq); (2) log-normalized data could be transformed to CoDA LR representation; (3) CoDA LR transformations such as count-added centered-log-ratio (CLR) had some advantages in dimension reduction visualization, clustering, and trajectory inference in the tested real and simulated datasets. CLR provided more distinct and well-separated clusters in dimension reductions, improved the Slingshot trajectory inference, and eliminated the suspicious trajectory that was probably caused by the dropouts.

We therefore conclude that CoDA may be a preferred scale-free model to handle scRNA-seq data for these downstream tasks. Additionally, an R package ‘CoDAhd’ was developed for conducting CoDA LR transformations for high dimensional scRNA-seq data. The code for implementing CoDA-hd, along with some example datasets, are available at https://github.com/GO3295/CoDAhd.

The online version contains supplementary material available at 10.1186/s12967-025-07157-z.

## Full-text entities

- **Genes:** YWHAZ (tyrosine 3-monooxygenase/tryptophan 5-monooxygenase activation protein zeta) [NCBI Gene 7534] {aka 14-3-3-zeta, HEL-S-3, HEL-S-93, HEL4, KCIP-1, POPCHAS}, ACTB (actin beta) [NCBI Gene 60] {aka BKRNS, BNS, BRWS1, CSMH, DDS1, PS1TP5BP1}, PKD2 (polycystin 2, transient receptor potential cation channel) [NCBI Gene 5311] {aka APKD2, PC2, PKD4, Pc-2, TRPP2}, DCLK3 (doublecortin like kinase 3) [NCBI Gene 85443] {aka CLR, DCAMKL3, DCDC3C, DCK3}, PCSK1 (proprotein convertase subtilisin/kexin type 1) [NCBI Gene 5122] {aka BMIQ12, NEC1, PC1, PC1/3, PC3, SPC3}, GAPDH (glyceraldehyde-3-phosphate dehydrogenase) [NCBI Gene 2597] {aka G3PD, GAPD, HEL-S-162eP}, SDHA (succinate dehydrogenase complex flavoprotein subunit A) [NCBI Gene 6389] {aka CMD1GG, FP, MC2DN1, NDAXOA, PGL5, PPGL5}, UBC (ubiquitin C) [NCBI Gene 7316] {aka HMG20}
- **Diseases:** ARI (MESH:D000275), lung tumor (MESH:D008175), HD (MESH:D006816), CoDA (MESH:D058617), NMI (MESH:C537354)
- **Chemicals:** CoDA (-), DPT (MESH:C059372)
- **Cell lines:** CellBench-10X-5CL — Homo sapiens (Human), Human papillomavirus-related endocervical adenocarcinoma, Cancer cell line (CVCL_2768), -10X- — Homo sapiens (Human), Induced pluripotent stem cell (CVCL_ZD98), H9 — Homo sapiens (Human), Sezary syndrome, Cancer cell line (CVCL_1240), H1 — Homo sapiens (Human), Induced pluripotent stem cell (CVCL_HA53), H1975 — Homo sapiens (Human), Lung adenocarcinoma, Cancer cell line (CVCL_1511), HKGLR — Homo sapiens (Human), Human papillomavirus-related endocervical adenocarcinoma, Cancer cell line (CVCL_B2IE)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12539173/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12539173/full.md

## References

2 references — full list in the complete paper: https://tomesphere.com/paper/PMC12539173/full.md

---
Source: https://tomesphere.com/paper/PMC12539173