Mix-Geneformer: Unified Representation Learning for Human and Mouse scRNA-seq Data

Yuki Nishio; Takayoshi Yamashita; Keita Ito; Tsubasa Hirakawa; Hironobu Fujiyoshi

arXiv:2507.07454·q-bio.GN·July 11, 2025

Mix-Geneformer: Unified Representation Learning for Human and Mouse scRNA-seq Data

Yuki Nishio, Takayoshi Yamashita, Keita Ito, Tsubasa Hirakawa, Hironobu Fujiyoshi

PDF

Open Access

TL;DR

Mix-Geneformer is a Transformer-based model that unifies human and mouse scRNA-seq data, enabling improved cross-species analysis and translational research through a hybrid self-supervised learning approach.

Contribution

It introduces a novel hybrid self-supervised training method and a rank-value encoding scheme for cross-species scRNA-seq data integration, outperforming existing models.

Findings

01

Achieved 95.8% accuracy in mouse kidney cell classification.

02

Successfully identified key regulatory genes validated by in vivo studies.

03

Matched or outperformed state-of-the-art models in key tasks.

Abstract

Single-cell RNA sequencing (scRNA-seq) enables single-cell transcriptomic profiling, revealing cellular heterogeneity and rare populations. Recent deep learning models like Geneformer and Mouse-Geneformer perform well on tasks such as cell-type classification and in silico perturbation. However, their species-specific design limits cross-species generalization and translational applications, which are crucial for advancing translational research and drug discovery. We present Mix-Geneformer, a novel Transformer-based model that integrates human and mouse scRNA-seq data into a unified representation via a hybrid self-supervised approach combining Masked Language Modeling (MLM) and SimCSE-based contrastive loss to capture both shared and species-specific gene patterns. A rank-value encoding scheme further emphasizes high-variance gene signals during training. Trained on about 50 million…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSingle-cell and spatial transcriptomics · Cell Image Analysis Techniques · Domain Adaptation and Few-Shot Learning