Multi-Modal and Multi-Attribute Generation of Single Cells with CFGen
Alessandro Palma, Till Richter, Hanyi Zhang, Manuel Lubetzki,, Alexander Tong, Andrea Dittadi, Fabian Theis

TL;DR
CFGen is a flow-based generative model that produces realistic, multi-modal, and multi-attribute single-cell data while preserving data discreteness, enabling improved biological data simulation and analysis.
Contribution
This work introduces CFGen, a novel flow-based model that effectively generates multi-modal, multi-attribute single-cell data while maintaining data discreteness, addressing limitations of prior models.
Findings
CFGen accurately recovers biological data characteristics.
It enhances rare cell type augmentation and batch correction.
Demonstrates effectiveness across diverse datasets.
Abstract
Generative modeling of single-cell RNA-seq data is crucial for tasks like trajectory inference, batch effect removal, and simulation of realistic cellular data. However, recent deep generative models simulating synthetic single cells from noise operate on pre-processed continuous gene expression approximations, overlooking the discrete nature of single-cell data, which limits their effectiveness and hinders the incorporation of robust noise models. Additionally, aspects like controllable multi-modal and multi-label generation of cellular data remain underexplored. This work introduces CellFlow for Generation (CFGen), a flow-based conditional generative model that preserves the inherent discreteness of single-cell data. CFGen generates whole-genome multi-modal single-cell data reliably, improving the recovery of crucial biological data characteristics while tackling relevant generative…
Peer Reviews
Decision·ICLR 2025 Poster
- Adapting flow matching for single-cell data generation is a novel contribution. - The proposed framework CFGen can be easily adapted for different uni- and multi-modal scenarios, as long as there are modality-specific autoencoders with a common latent space.
- scVI should be included as a baseline in Figure 2 because scVI accounts for overdispersion and zero inflation, whereas the current baselines in Figure 2 (scDiffusion and scGAN) do not. - For downstream applications that rely on conditional generation, it is unclear how the classifier guidance strength is determined. - Quantitative results are lacking when evaluating the compositional classifier guidance in Section 5.3. The change in MMD and WD with respect to the target distribution when incre
1. The paper addresses an important problem in single-cell data generation by generating raw count values, and further extending this to multimodal generation. 2. The paper is well-written, and the authors convey major limitations of their model clearly. 3. The results show that CFGen is able to capture characteristics of the training dataset and generate single cell data with similar statistical properties. 4. They also show the effectiveness of generating rare cell-types to improve classificat
1. Fig 3. is not really clear to me. Firstly, I suggest adding contrasting colors for points representing generated and real data. Secondly, what are the red points representing? I also suggest perhaps adding a quantitative metric (perhaps a oracle model that predicts the attributes) as well. 2. I also suggest removing the bars from Fig. 2b as they make it hard to observe the overlapping density curves which are easier to infer from. 3. For Sec 5.2, it might be worthwhile to also add a compariso
- The authors nicely demonstrate practical applications of their method such as data augmentation in rare cell types, improving downstream classification, and performing batch correction. - The idea to extend flow matching for generation with multiple attributes is interesting and important for single-cell data. - The paper is well-written, the related work is appropriately referenced, and the experimental setup is detailed.
- The authors do not discuss the computational complexity of the proposed method. A more detailed breakdown of computational requirements, including training and sampling times for the proposed method and the baselines, would improve the paper. - One important task in single-cell data analysis is gene expression imputation, where missing or zero-inflated gene expression values are inferred to provide a more complete view of cellular states. It is unclear from the paper whether CFGen can effect
Code & Models
Videos
Taxonomy
TopicsCell Image Analysis Techniques · Single-cell and spatial transcriptomics · Model-Driven Software Engineering Techniques
MethodsSparse Evolutionary Training
