# scVGAMF: a novel imputation method for scRNA-seq data by integrating linear and non-linear features

**Authors:** Zhiyuan Zhou, Wei Zhang, Xiaoying Zheng, Juan Shen, Yuanyuan Li

PMC · DOI: 10.1093/bib/bbaf562 · 2025-10-27

## TL;DR

scVGAMF is a new method for improving scRNA-seq data by combining linear and non-linear features to better handle missing gene expression data.

## Contribution

The novel integration of linear and non-linear features in scRNA-seq imputation using variational graph autoencoders and matrix factorization.

## Key findings

- scVGAMF outperforms existing methods in gene expression recovery and clustering accuracy.
- Integration of linear and non-linear features significantly improves data imputation performance.
- The method performs well on both simulated and real scRNA-seq datasets.

## Abstract

Single-cell RNA sequencing (scRNA-seq) is crucial for elucidating gene expression dynamics and cellular heterogeneity at the individual cell level, thereby advancing our understanding of transcriptional regulation across distinct cell populations. However, a significant challenge in scRNA-seq data analysis is the prevalence of dropout events, which complicate downstream analyses. Most existing imputation tools either rely solely on linear assumptions or overlook the non-linear regulatory relationships embedded in the data. To address this issue, we propose single-cell variational graph autoencoder and matrix factorization (scVGAMF), a novel imputation method that integrates both linear and non-linear features. Specifically, scVGAMF first identifies highly variable genes and partitions them into groups. Cells are then clustered by applying spectral clustering to the principal component analysis results of the representative groups. Based on the resulting submatrices, along with the gene similarity and cell–cell similarity matrices, scVGAMF employs non-negative matrix factorization to extract underlying linear features while utilizing two variational graph autoencoders to capture non-linear features. A fully connected neural network then integrates these features to predict missing values. Extensive experimental evaluations on simulated dropout datasets and real scRNA-seq data demonstrate that scVGAMF outperforms existing methods in gene expression recovery, cell clustering accuracy, differential gene identification, and pseudo-trajectory analysis. Furthermore, ablation studies confirm that the integration of both linear and non-linear features significantly enhances overall data imputation performance.

## Full-text entities

- **Genes:** DNMT3B (DNA methyltransferase 3 beta) [NCBI Gene 1789] {aka FSHD4, ICF, ICF1, M.HsaIIIB}, ERBB4 (erb-b2 receptor tyrosine kinase 4) [NCBI Gene 2066] {aka ALS19, HER4, p180erbB4}, PAX6 (paired box 6) [NCBI Gene 5080] {aka AN, AN1, AN2, ASGD5, D11S812E, FVH1}, MYCT1 (MYC target 1) [NCBI Gene 80177] {aka MTLC}, LHX1 (LIM homeobox 1) [NCBI Gene 3975] {aka LIM-1, LIM1}, LEFTY1 (left-right determination factor 1) [NCBI Gene 10637] {aka LEFTB, LEFTYB}, NODAL (nodal growth differentiation factor) [NCBI Gene 4838] {aka HTX5}
- **Diseases:** NMF (MESH:C538347)
- **Chemicals:** CX2024087 (-)
- **Species:** Homo sapiens (human, species) [taxon 9606], Mus musculus (house mouse, species) [taxon 10090]
- **Cell lines:** H1 — Homo sapiens (Human), Induced pluripotent stem cell (CVCL_HA53), H9 — Homo sapiens (Human), Sezary syndrome, Cancer cell line (CVCL_1240)

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12554638/full.md

---
Source: https://tomesphere.com/paper/PMC12554638