# Pattern Learning and Knowledge Distillation for Single-Cell Data Annotation

**Authors:** Ming Zhang, Boran Ren, Xuedong Li

PMC · DOI: 10.3390/biology15010002 · Biology · 2025-12-19

## TL;DR

This paper introduces PLKD, a new method for improving cell type annotation in single-cell data by using pattern learning and knowledge distillation to reduce batch effects and enhance accuracy.

## Contribution

The novel contribution is PLKD, which combines pattern-level biological insights with knowledge distillation for efficient and accurate cell type annotation.

## Key findings

- PLKD achieves high accuracy and robustness in cell type annotation while reducing batch effects.
- The method supports multi-modal cell type annotation and integration tasks efficiently.
- Knowledge distillation enables a lightweight model that is resistant to noise and fast inference.

## Abstract

Single-cell technologies allow researchers to measure gene expression at the level of individual cells, providing a powerful way to study cell identities and understand biological processes. However, differences between datasets—known as batch effects—can make accurate cell type annotation challenging. The existing deep learning methods often require heavy computation or fail to jointly address batch correction and cell type identification. In this study, we introduce PLKD, a new method that uses biologically meaningful gene groups, called patterns, together with knowledge distillation to improve cell type annotation. PLKD first learns pattern-level information using a Transformer-based Teacher model and then transfers this knowledge to a lightweight Student model. This design enables PLKD to accurately classify cell types, reduce batch effects, and provide interpretable biological insights. Our results show that PLKD achieves high accuracy and robustness while remaining efficient for large-scale datasets, offering a practical and interpretable tool for single-cell analysis.

Transferring cell type annotations from reference dataset to query dataset is a fundamental problem in AI-based single-cell data analysis. However, single-cell measurement techniques lead to domain gaps between multiple batches or datasets. The existing deep learning methods lack consideration on batch integration when learning reference annotations, which is a challenge for cell type annotation on multiple query batches. For cell representation, batch integration can not only eliminate the gaps between batches or datasets but also improve the heterogeneity of cell clusters. In this study, we proposed PLKD, a cell type annotation method based on pattern learning and knowledge distillation. PLKD consists of Teacher (Transformer) and Student (MLP). Teacher groups all input genes (features) into different gene sets (patterns), and each pattern represents a specific biological function. This design enables model to focus on biologically relevant functions interaction rather than gene-level expression that is susceptible to gaps of batches. In addition, knowledge distillation makes lightweight Student resistant to noise, allowing Student to infer quickly and robustly. Furthermore, PLKD supports multi-modal cell type annotation, multi-modal integration and other tasks. Benchmark experiments demonstrate that PLKD is able to achieve accurate and robust cell type annotation.

## Full-text entities

- **Genes:** CDK1 (cyclin dependent kinase 1) [NCBI Gene 983] {aka CDC2, CDC28A, P34CDC2}, FUT8 (fucosyltransferase 8) [NCBI Gene 2530] {aka CDGF, CDGF1}, CXCL9 (C-X-C motif chemokine ligand 9) [NCBI Gene 4283] {aka CMK, Humig, MIG, SCYB9, crg-10}, HBB (hemoglobin subunit beta) [NCBI Gene 3043] {aka CD113t-C, ECYT6, beta-globin}, TNFAIP3 (TNF alpha induced protein 3) [NCBI Gene 7128] {aka A20, AIFBL1, AISBL, OTUD7C, TNFA1P2}, IGKC (immunoglobulin kappa constant) [NCBI Gene 3514] {aka HCAK1, IGKCD, Km}, PHEX (phosphate regulating endopeptidase X-linked) [NCBI Gene 5251] {aka HPDR, HPDR1, HYP, HYP1, LXHR, PEX}, LYST (lysosomal trafficking regulator) [NCBI Gene 1130] {aka CHS, CHS1, Mauve}, SNAR-E (small NF90 (ILF3) associated RNA E) [NCBI Gene 100170220], ISG15 (ISG15 ubiquitin like modifier) [NCBI Gene 9636] {aka G1P2, IFI15, IMD38, IP17, UCRP, hUCRP}, ITIH3 (inter-alpha-trypsin inhibitor heavy chain 3) [NCBI Gene 3699] {aka H3P, ITI-HC3, SHAP}, HLA-DRA (major histocompatibility complex, class II, DR alpha) [NCBI Gene 3122] {aka HLA-DRA1}, ARIH1 (ariadne RBR E3 ubiquitin protein ligase 1) [NCBI Gene 25820] {aka ARI, HARI, HHARI, UBCH7BP}, AHSP (alpha hemoglobin stabilizing protein) [NCBI Gene 51327] {aka EDRF, ERAF}, ACACA (acetyl-CoA carboxylase alpha) [NCBI Gene 31] {aka ACAC, ACACAD, ACACalpha, ACC, ACC1, ACCA}, IL7R (interleukin 7 receptor) [NCBI Gene 3575] {aka CD127, CDW127, IL-7R-alpha, IL-7Ralpha, IL7RA, IL7Ralpha}, RGS2 (regulator of G protein signaling 2) [NCBI Gene 5997] {aka G0S8}, SEL1L3 (SEL1L family member 3) [NCBI Gene 23231] {aka Sel-1L3}, CD8A (CD8 subunit alpha) [NCBI Gene 925] {aka CD8, CD8alpha, IMD116, Leu2, p32}, MYO1E (myosin IE) [NCBI Gene 4643] {aka FSGS6, HuncM-IC, MYO1C}, CD14 (CD14 molecule) [NCBI Gene 929], KL (klotho) [NCBI Gene 9365] {aka HFTC3, KLA}, MZB1 (marginal zone B and B1 cell specific protein) [NCBI Gene 51237] {aka MEDA-7, PACAP, pERp1}, CLEC9A (C-type lectin domain containing 9A) [NCBI Gene 283420] {aka CD370, DNGR-1, DNGR1, UNQ9341}
- **Diseases:** Pan-cancer (MESH:D009369), injury to (MESH:D014947), inflammatory (MESH:D007249), esophageal carcinoma (MESH:D004938), thyroid carcinoma (MESH:D013964), uterine corpus endometrial carcinoma (MESH:D016889)
- **Chemicals:** iron (MESH:D007501), MNN (-)
- **Species:** Mus musculus (house mouse, species) [taxon 10090], Homo sapiens (human, species) [taxon 9606]
- **Cell lines:** 32,231 — Mus musculus (Mouse), Hybridoma (CVCL_L524), Mono — Homo sapiens (Human), Adult acute myelomonocytic leukemia, Transformed cell line (CVCL_WN74)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12785110/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12785110/full.md

## References

62 references — full list in the complete paper: https://tomesphere.com/paper/PMC12785110/full.md

---
Source: https://tomesphere.com/paper/PMC12785110