CHAMMI-75: Pre-training multi-channel models with heterogeneous microscopy images
Vidit Agrawal, John Peters, Tyler N. Thompson, Mohammad Vali Sanian, Chau Pham, Nikita Moshkov, Arshad Kazi, Aditya Pillai, Jack Freeman, Byunguk Kang, Samouil L. Farhi, Ernest Fraenkel, Ron Stewart, Lassi Paavolainen, Bryan A. Plummer, Juan C. Caicedo

TL;DR
This paper introduces CHAMMI-75, a diverse dataset of multi-channel microscopy images, enabling the development of adaptable models for cellular morphology across various imaging modalities.
Contribution
The creation of CHAMMI-75 dataset and demonstration of channel-adaptive models that generalize across heterogeneous microscopy images.
Findings
Training with CHAMMI-75 improves multi-channel bioimaging performance.
High diversity in microscopy modalities enhances model robustness.
Channel-adaptive models can be reused across different biological studies.
Abstract
Quantifying cell morphology using images and machine learning has proven to be a powerful tool to study the response of cells to treatments. However, models used to quantify cellular morphology are typically trained with a single microscopy imaging type. This results in specialized models that cannot be reused across biological studies because the technical specifications do not match (e.g., different number of channels). Here, we present CHAMMI-75, an open access dataset of heterogeneous, multi-channel microscopy images from 75 diverse biological studies. We curated this resource from publicly available sources to investigate cellular morphology models that are channel-adaptive and can process any microscopy image type. Our experiments show that training with CHAMMI-75 can improve performance in multi-channel bioimaging tasks primarily because of its high diversity in microscopy…
Peer Reviews
Decision·ICLR 2026 Poster
- The dataset has unmatched scale and heterogeneity for multi-channel microscopy, which supports broad generalization. - The benchmarks cover diverse, realistic tasks and include novel channel combinations, which enhances external validity. - The scaling analysis is careful and practical, which offers guidance on data size, model size, and multi-channel strategy. - The SSL results are consistently strong across tasks, which validates CHAMMI-75 as a useful pre-training resource. And the release p
- The work offers limited methodological novelty, which centers contributions on data and benchmarking. - The comparison to SubCell mixes training regimes and model sizes, which hinders clean conclusions about capability gaps. - The LLM-assisted metadata extraction lacks error quantification, which raises concerns about label noise in curation.
The dataset’s large scale is impressive and would be a great unified resource for other researchers. The results for downstream tasks provide real value. The authors also added metadata information, specifically creating 22 metadata fields. Additionally, they segmented the nucleus of 1.8 billion cells using CellPose, which is very useful for the community. The benchmarks across different microscopy tasks demonstrate the generalizability of the proposed models. The statement from the authors to m
* While multiple benchmarks are used, the analysis could be strengthened by deeper biological interpretation of learned representations e.g., probing whether CHAMMI-75 pre-training improves biological feature disentanglement or transfer to unseen modalities * The paper acknowledges missing and inconsistent metadata, but the implications for training robustness and domain bias are missing/underexplored. * The evaluation seems to be focusing primarily on fluorescence microscopy. Brightfield data
1. The CHAMMI-75 dataset integrates 75 biological projects and 18 imaging platforms, providing unprecedented scale and diversity for microscopy SSL. This makes it a valuable benchmark for future research. 2. The authors conduct careful cleaning, redundancy reduction, and balanced sampling, improving representativeness and reproducibility. 3. The paper includes a thorough comparison of SSL methods and analyzes performance scaling with dataset size and channel configurations, validating the datase
1. The paper’s main contribution lies in data curation and benchmarking. The experimental pipeline relies heavily on existing models without proposing new architectures or training strategies tailored to multi-channel adaptation. 2. Although the dataset is multi-source and multi-channel, the study does not analyze how factors such as number of channels, organism diversity, or imaging modality affect representation learning. This limits understanding of what drives cross-domain generalization. 3.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCell Image Analysis Techniques · Single-cell and spatial transcriptomics · AI in cancer detection
