Prototype Guided Post-pretraining for Single-Cell Representation Learning
Sachini Weerasekara, Natasha Darras, Sagar Kamarthi, Colles Price, Jacqueline Isaacs

TL;DR
This paper introduces CellRefine, a post-pretraining method for single-cell models that uses marker-gene priors to improve downstream task performance, addressing generalization issues in gene expression data.
Contribution
CellRefine is a novel post-pretraining approach that enhances single-cell representation learning by incorporating structural priors, leading to significant performance gains.
Findings
CellRefine improves downstream performance by up to 15%.
It effectively refines the latent embedding manifold of cells.
The method addresses generalization issues in gene expression modeling.
Abstract
Single-cell representation learning (SCRL) from gene expression data offers a way to uncover the complex regulatory logic underlying cellular function. Inspired by large language models in natural language modeling, several single-cell pretrained models have recently been proposed that treat genes as tokens and cells as sentences. However, these models are fundamentally limited by the long-tailed nature of cell-type distributions and struggle to generalize under covariate shifts in gene expression data. While fine-tuning is often used to mitigate these issues, we observe that performance remains bounded. To address this challenge, we introduce CellRefine, a post-pretraining method that operates between the pretraining and fine-tuning stages of a single-cell foundation model. CellRefine uses a multi-faceted objective that incorporates marker-gene sets as structural priors to guide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
