# scSemiPLC: a semi-supervised learning framework for annotating single-cell RNA-Seq data by generating pseudo-labels through clustering

**Authors:** QianYi Ma, LinJie Wang, Wei Li

PMC · DOI: 10.1128/msystems.00223-25 · mSystems · 2025-12-08

## TL;DR

This paper introduces scSemiPLC, a new method for automatically labeling single-cell RNA data using semi-supervised learning and clustering to improve accuracy and efficiency.

## Contribution

The novel approach generates pseudo-labels through clustering and applies consistency regularization to enhance cell annotation in scRNA-seq data.

## Key findings

- scSemiPLC outperforms existing methods in annotation accuracy and stability.
- The method effectively extracts biologically meaningful representations from single-cell data.
- It shows robustness to variations in the number of labeled cells.

## Abstract

Single-cell RNA sequencing (scRNA-seq) technology enables researchers to explore heterogeneity of diverse cell types within complex tissues at the single-cell resolution. Cell annotation, as a crucial step in scRNA-seq data analysis, provides biologically meaningful cell identity information for biological research. With the proliferation of publicly available datasets and the expansion of sequencing data scale, traditional annotation methods reliant on manual marker gene matching have become increasingly cumbersome and time-consuming. Consequently, efficient and convenient automated cell annotation methods are gradually becoming mainstream. In this paper, we propose a single-cell semi-supervised annotation training framework called scSemiPLC, which generates pseudo-labels through clustering and consistency regularization. Specifically, scSemiPLC utilizes existing label information to guide the clustering of unlabeled data. During model training, it assigns pseudo-labels to the unlabeled samples and constrains the prediction of perturbed data to be similar to the pseudo-labels. This strategy addresses the low utilization of unlabeled data caused by the fixed high threshold pseudo-labeling paradigm, offering a new approach for cell annotation in the semi-supervised learning field. Experimental results demonstrate the superior performance of scSemiPLC in annotation accuracy and stability, extraction of biologically meaningful representations, and robustness to the number of cell labels, significantly outperforming classical automatic annotation and mainstream semi-supervised learning methods.

This work proposes a novel cell annotation training framework, scSemiPLC, which significantly enhances the efficiency and accuracy of annotation by fully leveraging unlabeled data. In the semi-supervised learning component, the framework innovatively generates pseudo-labels through clustering. Subsequently, it evaluates the reliability of these pseudo-labels and assigns corresponding weights, thereby balancing both their quantity and quality. This approach provides new insights into the direction of automatic cell annotation within the realm of semi-supervised learning.

## Full-text entities

- **Genes:** Cd4 (CD4 antigen) [NCBI Gene 12504] {aka L3T4, Ly-4}
- **Diseases:** Kidney (MESH:D007674)
- **Chemicals:** scSemiGAN (-)
- **Species:** Mus musculus (house mouse, species) [taxon 10090], Homo sapiens (human, species) [taxon 9606]
- **Cell lines:** S2 — Drosophila melanogaster (Fruit fly), Spontaneously immortalized cell line (CVCL_Z232)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12817951/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12817951/full.md

## References

36 references — full list in the complete paper: https://tomesphere.com/paper/PMC12817951/full.md

---
Source: https://tomesphere.com/paper/PMC12817951