# Semi-supervised contrastive learning variational autoencoder Integrating single-cell multimodal mosaic datasets

**Authors:** Zihao Wang, Zeyu Wu, Minghua Deng

PMC · DOI: 10.1186/s12859-025-06239-5 · 2025-08-04

## TL;DR

This paper introduces scGCM, a new method for integrating complex single-cell data with missing information, improving accuracy and consistency.

## Contribution

The novel contribution is a flexible integration framework, scGCM, based on variational autoencoder for multimodal mosaic data.

## Key findings

- scGCM outperforms existing methods in clustering accuracy and data consistency.
- The framework effectively handles high dimensionality, sparsity, and batch effects in multimodal datasets.

## Abstract

As single-cell sequencing technology became widely used, scientists found that single-modality data alone could not fully meet the research needs of complex biological systems. To address this issue, researchers began simultaneously collect multi-modal single-cell omics data. But different sequencing technologies often result in datasets where one or more data modalities are missing. Therefore, mosaic datasets are more common when we analyze. However, the high dimensionality and sparsity of the data increase the difficulty, and the presence of batch effects poses an additional challenge. To address these challenges, we proposes a flexible integration framework based on Variational Autoencoder called scGCM. The main task of scGCM is to integrate single-cell multimodal mosaic data and eliminate batch effects. This method was conducted on multiple datasets, encompassing different modalities of single-cell data. The results demonstrate that, compared to state-of-the-art multimodal data integration methods, scGCM offers significant advantages in clustering accuracy and data consistency. The source code of scGCM can be accessed at https://github.com/closmouz/scCGM.

## Full-text entities

- **Genes:** CMAHP (cytidine monophospho-N-acetylneuraminic acid hydroxylase, pseudogene) [NCBI Gene 8418] {aka CMAH, CSAH}, CD8A (CD8 subunit alpha) [NCBI Gene 925] {aka CD8, CD8alpha, IMD116, Leu2, p32}, CD22 (CD22 molecule) [NCBI Gene 933] {aka SIGLEC-2, SIGLEC2}, BCR (BCR activator of RhoGEF and GTPase) [NCBI Gene 613] {aka ALL, BCR1, CML, D22S11, D22S662, PHL}, MS4A1 (membrane spanning 4-domains A1) [NCBI Gene 931] {aka B1, Bp35, CD20, CVID5, FMC7, LEU-16}, BANK1 (B cell scaffold protein with ankyrin repeats 1) [NCBI Gene 55024] {aka BANK}, KRT1 (keratin 1) [NCBI Gene 3848] {aka AEI2, CK1, EHK, EHK1, EPPK, K1}, MAP9 (microtubule associated protein 9) [NCBI Gene 79884] {aka ASAP}, NKG7 (natural killer cell granule protein 7) [NCBI Gene 4818] {aka GIG1, GMP-17, p15-TIA-1}, TCF4 (transcription factor 4) [NCBI Gene 6925] {aka CDG2T, E2-2, FCD2, FECD3, ITF-2, ITF2}
- **Diseases:** tumor (MESH:D009369), ARI (MESH:D000275)
- **Chemicals:** ADT (-)
- **Species:** Homo sapiens (human, species) [taxon 9606], Mus musculus (house mouse, species) [taxon 10090]
- **Cell lines:** ASAP-10X — Mus musculus (Mouse), Hybridoma (CVCL_C2LW)

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12323256/full.md

---
Source: https://tomesphere.com/paper/PMC12323256