A Benchmark for Virus Infection Reporter Virtual Staining in Fluorescence and Brightfield Microscopy
Maria Wyrzykowska, Gabriel della Maggiora, Nikita Deshpande, Ashkan Mokarian, Artur Yakimovich

TL;DR
This paper introduces a benchmark and datasets for virtual staining of virus-infected cells in microscopy images, aiming to improve detection accuracy and detail.
Contribution
The work introduces a new benchmark and datasets for virus infection reporter virtual staining (VIRVS), enabling better machine learning approaches.
Findings
A benchmark and datasets were created for virus infection virtual staining using diverse viruses and imaging methods.
U-Net and pix2pix models were tested for VIRVS, showing the feasibility of regressive and generative approaches.
The study defines a new interdisciplinary challenge at the intersection of data science and virology.
Abstract
Detecting virus-infected cells in light microscopy requires a reporter signal commonly achieved by immunohistochemistry or genetic engineering. While classification-based machine learning approaches to the detection of virus-infected cells have been proposed, their results lack the nuance of a continuous signal. Such a signal can be achieved by virtual staining. Yet, while this technique has been rapidly growing in importance, the virtual staining of virus-infected cells remains largely uncharted. In this work, we propose a benchmark and datasets to address this. We collate microscopy datasets, containing a panel of viruses of diverse biology and reporters obtained with a variety of magnifications and imaging modalities. Next, we explore the virus infection reporter virtual staining (VIRVS) task employing U-Net and pix2pix architectures as prototypical regressive and generative models.…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4- —501100002347Bundesministerium für Bildung und Forschung (Federal Ministry of Education and Research)
- —501100006478Bundesministerium der Verteidigung (Federal Ministry of Defence)
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCell Image Analysis Techniques · Image Processing Techniques and Applications · Molecular Biology Techniques and Applications
Background & Summary
Animal and human cells and tissues appear achromatic under a conventional brightfield light microscope, making individual components hard to distinguish. Staining of such samples using conventional or fluorescent dyes serves an important role in understanding biochemical mechanisms, disease diagnosis, or tracking pathogen transmission^1,2^. Not only does staining improve the contrast for microscopic examination, but functional approaches like fluorescent immunohistochemistry allow for selective highlighting of the meaningful molecules in the sample^3,4^. Some staining techniques require treating samples with various buffers, reagents and dyes that selectively bind to cellular components like lipids, proteins, carbohydrates or nucleic acids^4,5^. Others employ genetic engineering to incorporate fluorescent proteins into the specimen^6^. However, such sample preparation is complex and demands significant time and resources, as well as expertise and can potentially harm delicate samples or bias the results^7^. To address this, the scientific community has explored virtual staining algorithms rooted in deep neural networks (DNN) that produce labelled images from label-free microscopy^8,9^. A multitude of histological stain types has been successfully replicated^8^, often leveraging Convolutional Neural Networks (CNNs) as a staple technique in Computer Vision. Unlike classification, functional staining of proteins of interest akin to fluorescence microscopy allows visualising more subtle phenotypic nuances in biological images, including the shape and intensity of the signal. These features bear crucial biomedical information, including severity and multiplicity of infection.
Virtual staining can be formulated as either a regression or generation problem. Regression tasks produce deterministic outputs minimising the pixel-wise error, while generative models aim to learn the data distribution producing stochastic outputs^10^. This implies that, while regression allows for more precise control over the pixel-to-pixel conversion, generative models can produce more diverse and realistic images. A popular choice for solving regression tasks is the U-Net^11^ architecture based on the encoder-decoder structure^8,9,12,13^. For the image generation task in virtual staining, the generative adversarial networks (GANs) have been dominant^8,9,12,13^. In a GAN architecture^14^, one network called the generator learns to generate the desired output images, while a second network -the discriminator -learns to distinguish between the GT images and the images produced by the generator. Examples include virtual staining of autopsy tissue^12^, as well as multi-staining^13^
While both regression and generation-based virtual staining have been successfully applied to Digital Pathology^8,9,13^ and microscopy^15,16^, a systematic benchmark for virtual staining of virus-infected cells has thus far not been established. At the same time, the classification of cells infected by nucleotropic viruses like human adenovirus (HAdV) and herpes simplex virus (HSV) based on an unspecific nuclear stain has been previously demonstrated^17,18^. This suggests that the morphological inductive biases relevant to virus infection of cells are present in the images of infected and uninfected cell nuclei. Image classification alone does not provide a similar amount of nuance as virtual staining. For example, infected cell morphology and the intensity of the fluorescent reporter expressed during virus infection correlate with the duration and severity of infection^19^. Building models and benchmarks for virtual infection staining can significantly advance single-cell biology and aid in pathogen research or differential diagnostics.
To address this, we propose the Virus Infection Reporter Virtual Staining (VIRVS) benchmark, consisting of infected cell microscopy datasets and metrics on typical models for regression and generation tasks. The datasets consist of high-content multi-channel fluorescence and brightfield microscopy of 4 different cell types infected by 5 different viruses spanning 2 different levels of magnification. Specifically, the VIRVS benchmark includes Human Adenovirus (HAdV), Herpes simplex virus 1 (HSV), vaccinia virus (VACV), influenza A virus (IAV) and human rhinovirus (RV) (Fig. 1). HAdV is a non-enveloped virus with an icosahedral capsid replicating exclusively in the nucleus^20^. HSV is another example of a nucleotropic virus^21^ we added to VIRVS. However, unlike HAdV, HSV is enveloped. VACV, IAV and RV replicate in the cytoplasm. VACV is a prototypic member of poxviruses - a family of large, enveloped double-stranded DNA viruses^22,23^. Together, the panel of the viruses used in VIRVS spans a variety of viruses with various tropisms, biology and lifecycles.Fig. 1. Schematic depiction of dataset sources and composition including Human Adenovirus (HAdV), Herpes simplex virus 1 (HSV), vaccinia virus (VACV), influenza A virus (IAV) and human rhinovirus (RV). HAdV is non-eveloped, approximately 90 nm in diameter and has a double-stranded DNA genome of around 36 kbp, replicates in the nucleus^42,43^. HSV is enveloped with an icosahedral capsid, 155 to 240 nm in diameter^44^, has a double-stranded linear DNA genome that is approximately 152 kbp in length^21^. VACV is enveloped, its genome is approximately 190 kbp and the dimensions of the brick-shaped capsid-containing virions are roughly 360 × 270 × 250 nm^45^. IAV is enveloped, with an elliptical-shaped capsid of approximately 80 - 120 nm in diameter^46^, has a segmented negative-sense RNA genome and replicates in the cytoplasm^47^. RV is non-enveloped, 30-40 nm in size, replicates in the cell cytoplasm and has an icosahedral capsid, as well as a single-stranded positive-sense RNA genome of 7.2 to 8.5 kb in length^48,49^.
The multi-channel images used in the VIRVS benchmark are well registered and contain specific virus infection fluorescence signal used as ground truth (GT, Y) and unspecific fluorescence or brightfield signals used as input (X). The microscopy images were collated using the data from published works^17,24,25^. The aim of the virtual staining models is to reproduce the GT as prediction \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{Y}$$\end{document} , being as close as possible to Y. To demonstrate the performance of regression models, we employed a widely used U-Net architecture^11^. As a representative generative model, we employed the pix2pix^26^. We evaluated whether regression or generation are the most adequate approaches in addressing this task, performing experiments on the aforementioned datasets spanning a broad range of characteristics and pathogens.
Methods
Data Sources
The raw data for this work was collected in the context of references respective to viruses VACV^19^, HAdV^17^, HSV, IAV, RV^25^. Images of HSV, IAV, RV specifically are available under CC BY 4.0 license from the Image Data Resource (idr.openmicroscopy.org) with the accession numbers idr0130, idr0128, idr0129, respectively. Images of HAdV and VACV are available upon request to the corresponding author of references^17,19^ and are shared here upon acquiring explicit permission.
VIRVS Dataset Collation
To establish the VIRVS benchmark we have collated datasets containing multi-channel high-content microscopy of virus-infected cells^17,24,25^. These datasets include virus-specific reporter channel, for example green fluorescence protein expressed by the virus-infected cells, and an unspecific channel allowing to visualise all cells in the field of view. In some datasets the unspecific channel includes nuclear staining, in other cases brightfield microscopy. Virus-specific reporter channel was reserved as GT, while unspecific was used as input (see Fig. 1).
Specifically, to devise a dataset for HAdV we employed data from^17^. In this dataset the nuclei channel represents Hoechst 33342 staining and the viral channel indicates GFP concentration. Additionally, brightfield microscopy images of the cell monolayer were available. Infection was performed in the A549 human lung carcinoma-derived cell line. In the case of VACV we obtained the data from^24^, where GFP expression from a transgenic VACV indicated infection in the monkey kidney BSC40 cell line, while the brightfield modality allowed to visualise all the cells. As several adjacent fields of view were acquired per well, we performed an overlap-based image stitching ensuring broader resampling. HAdV and VACV datasets were obtained at 10x and 4x magnification respectively. In both cases microscopy was performed in a live-cell imaging mode, where images were obtained as a time-lapse series, visualising the progression of infection across multiple time points. To avoid bias, we have selected only visually distinct timepoints. Specifically, these were 49 (87 hours post infection) for HAdV and 100 (50h), 108 (54h) and 115 (57.5h) for VACV.
Data for the IAV, RV and HSV was obtained from a small molecule screening dataset performed at 4x magnification^25^. HeLa Ohio Geneva cell line was used for RV, and A549 for IAV and HSV. Unlike HAdV and VACV, these datasets were obtained from fixed end-point assays. To avoid introducing morphological variations from the small molecule perturbations, we used exclusively control measurements containing virus-infected cells under mock conditions (dimethyl sulfoxide treatment). All images were obtained at 4x magnification and contained nuclear and virus channels. The source of virus signal in RV and HSV was GFP, while in the case of IAV the signal was obtained immunohistochemically using an anti-nucleoprotein (α-NP) antibody (see^25^ for further details).
Notably, while HAdV and HSV are nucleotropic, affecting the nuclei phenotype visibly in microscopy images, VACV, IAV, and RV replicate in the cell cytoplasm. Given that only a nuclear channel is available as input for IAV and RV we presumed the inductive bias to be weaker or absent. Together, these datasets represent a wide range of imaging and infection characteristics, constituting a comprehensive benchmark of the models’ capabilities. Detailed descriptions of the number of images and resolution are presented in Table 1. For each dataset, we employed a data processing pipeline, resulting in training, validation, and test splits (see Methods section).Table 1. Final sizes of the datasets, including size of an individual image and number of images in training, validation and testing set for HAdV, VACV, HSV, IAV and RV data.DatasetOriginal image sizeTraining samplesValidation samplesTesting samplesHAdV^17^2048 × 204841111859VACV^24^5948 × 6048661812HSV^25^2160 × 216067219296IAV^25^2048 × 2048761218109RV^25^2048 × 2048761218109
Data Preparation
Each of VACV plaques was imaged to produce 9 files per channel, that need to be stitched to recreate the whole plaque. To achieve this, multiview-stitcher toolbox has been used. The stitching was first performed on the third channel, representing the brightfield microscopy image of the samples. Then, the parameters found for this channel were used to stitch the rest of the channels. VACV dataset represents a timelapse, from which timesteps 100, 108 and 115 have been selected to produce the data then used in the experiments. Images have been centre-cropped to 5948 × 6048 to match the size of the smallest image in the dataset (rounded down to the closest multiple of 2). The data was additionally manually filtered to remove the samples that constituted only uninfected cells (C02, C07, D02, D07, E02, E07, F02, F07).
HAdV dataset is also a time-lapse, from which only the last timestep (49th) has been selected.
For the rest of the datasets (HSV, IAV, RV) only the negative control data was used, which was selected in the following way: from the data collected at the University of Zürich, from the screen samples only the first 2 columns were selected and from the ZPlates and prePlates samples only the first 12 columns.
All of the datasets were divided into training, validation and test holdouts in 0.7:0.2:0.1 ratios, using random seed 42 to ensure reproducibility. Evaluation of splits with varied random seeds did not show significant differences, suggesting adequate representation (Supplementary Table S1). For the time-lapse data, it was ensured that the same sample from different timesteps only exists in one of the holdouts, to prevent information leakage and ensure fair evaluation. All of the samples were normalised to [−1, 1] range, by subtracting the 3rd percentile and dividing by the difference between percentile 99.8 and 3, clipping to [0, 1] and scaling to [−1, 1] range. For the brightfield channel of HAdV, percentiles 0.1 and 99.9 were used. These cutoff points were selected based on the analysis of the histograms of the values attained by the data, to make the best use of the available data range. Specific values used for the normalisation are summarised in Table 2 in Related/alternate identifiers.Table 2. Percentile values used to normalise the ground truth (GT), nuclear stain and brightfield channels for HAdV, VACV, HSV, IAV and RV datasets.DatasetGTGTNuclear stainNuclear stainBrightfieldBrightfieldHAdV0.01040.07410.00350.01780.0320.4089VACV0.00671.0——0.49561.0HSV0.00330.19260.00740.2907——IAV0.00910.17690.0110.2156——RV0.00820.17120.00860.2193——
To prepare the cell nuclei masks, the Cellpose model with pre-trained weights cyto3 has been used on the fluorescence channel. The diameter was set to 7 for all the datasets except for HAdV, for which the automatic estimation of the diameter was employed. Cell masks were prepared using Cellpose with pre-trained weights cyto3, with a diameter set to 70 on brightfield images stacked with fluorescence nuclei signal.
The data preparation can be reproduced by first downloading the datasets and then running scripts that are located in the ‘scripts/data_processing’ directory of the VIRVS repository https://github.com/casus/virvs, first modifying the paths in them:
- for HAdV data: ‘preprocess_hadv.py’
- for VACV data: ‘stitch_vacv.py’ + ‘preprocess_vacv.py’
- for the rest of the viruses: ‘preprocess_other.py’
- for the rest of the viruses: to prepare Cellpose predictions: ‘prepare_cellpose_preds.py’ (for cells) and ‘prepare_cellpose_preds_nuc.py’ (for nuclei).
Model architecture
Pix2pix model was implemented as described in the original paper^26^, based on official Tensorflow pix2pix tutorial, licensed under the Apache 2.0 License. The architecture of the U-Net model^27^ was identical to the one used as the generator in pix2pix.
Cellpose masks preparation
To prepare the cell nuclei masks, the CellPose^28^ model with pre-trained weights cyto3^29^ have been used on the fluorescence channel. The diameter was set to 7 for all the datasets except for HAdV, for which the automatic estimation of the diameter was employed. To prepare the cell nuclei masks, the CellPose model with pre-trained weights cyto3 has been used on the fluorescence channel. Cell masks were prepared using Cellpose with pre-trained weights cyto3 with a diameter set to 70 on brightfield images stacked with fluorescence nuclei signal.
Training setup
The size of the input expected by both of the models is 256 × 256 px. Random 256 × 256 crops of the data of the original size were used for training. For the model trained on both modalities of HAdV data, they were provided to the model as two separate channels. For both of the models, random jitter as suggested as an augmentation in pix2pix paper^26^ was applied, in addition to random up and down and left and right flips with probability p = 0.5.
The hyperparameters used for training come from the pix2pix paper; Adam optimiser^30^ with learning rate 2e-4 and β1 momentum parameter set to 0.5. The batch size was set to 8.
The loss function used for pix2pix was the one suggested in the paper, using 0.5 as the weight of the discriminator loss. For UNet mean absolute error (MAE) was used.
All the models were trained for 100000 steps. A step is defined as a gradient descent update after processing one batch.
A random seed equal to 42 was used to ensure reproducibility. All experiments were conducted on NVIDIA RTX A100 and V100 GPU devices available in Hemera cluster of the HZDR.
Evaluation setup
Models have been evaluated on the test set. For evaluation purposes, we used the whole images, center-cropped to the closest multiple of 256, as required by the model architecture. The batch size used for the evaluation is 1. Random seeds used during the evaluation were 42 for U-Net and 42, 43, 44 for pix2pix.
During evaluation, we used the batch normalisation statistics of the test batch for both of the models. This approach, coined as Instance Normalisation, was originally used in pix2pix, but also suggested for U-Net applied to biomedical data in^31^. Additionally, for pix2pix the dropout layers are activated, as recommended in the pix2pix paper^26^.
Metrics used to evaluate the performance are mean square error (MSE)^32^, structural similarity index (SSIM)^33^ and peak signal-to-noise ratio (PSNR)^34^ and are implemented as follows:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\,{\rm{MSE}}\,(Y,\widehat{Y})=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}{({\widehat{Y}}_{i}-{Y}_{i})}^{2}$$\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\,{\rm{SSIM}}\,(Y,\widehat{Y})=\frac{(2{\mu }_{Y}{\mu }_{\widehat{Y}}+{C}_{1})(2{\sigma }_{Y\widehat{Y}}+{C}_{2})}{({\mu }_{Y}^{2}+{\mu }_{\widehat{Y}}^{2}+{C}_{1})({\sigma }_{Y}^{2}+{\sigma }_{\widehat{Y}}^{2}+{C}_{2})}$$\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\,{\rm{PSNR}}\,(Y,\widehat{Y})=10\,{\log }_{10}\left(\frac{MAX{(Y)}^{2}}{\,{\rm{MSE}}\,(Y,\widehat{Y})}\right)$$\end{document}where μY and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mu }_{\widehat{Y}}$$\end{document} are the average of Y and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{Y}$$\end{document} , \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\sigma }_{Y}^{2}$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\sigma }_{\widehat{Y}}^{2}$$\end{document} are the variances, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\sigma }_{Y\widehat{Y}}$$\end{document} is the covariance of Y and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{Y}$$\end{document} , and C1, C2 are constants as described in the original implementation of SSIM^33^ and MAX_Y_ is the maximum possible data range in image, in our case 2.
The classification is performed through the following steps:
- Cell nuclei are segmented using Cellpose.
- For each segmented nucleus, we calculate the fraction of pixels in the GFP or αNP channel (either from ground truth or produced by the trained model) that have values higher than − 0.9 for HAdV or − 0.8 for other viruses. If this fraction exceeds 0.5, the nucleus is considered infected. These threshold values were selected based on a careful visual examination.
- A mask is generated for each image. Background pixels (where no nucleus was segmented by Cellpose) and pixels of healthy nuclei are labelled with 0, while pixels of infected nuclei are labelled with 1.
These masks are created based on ground truth data, as well as predictions from the pix2pix and U-Net models. The overall accuracy is then calculated using the following formula:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\,{\rm{Accuracy}}\,=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}{\mathbb{I}}({\widehat{M}}_{i}={M}_{i})$$\end{document}where M is the mask created based on ground truth data, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{M}$$\end{document} is the mask created based on the model’s prediction, n is the size of the image, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbb{I}}(\cdot )$$\end{document} is the indicator function that equals 1 when its argument is true and 0 otherwise.
This, which we include for completeness sake, can be misleadingly high. This is because most of the pixels in the mask image typically do not correspond to any objects and are set to 0, regardless of the model’s prediction. For this reason, we exclude regions of the data without any objects from the calculation of the other metrics and accuracy should be interpreted alongside these other metrics.
Based on the masks, a confusion matrix for the binary classification of cell nuclei is created on a per-pixel basis, excluding pixels that do not belong to any cell nucleus. The number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) are then used to calculate the following metrics:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\,{\rm{IoU}}\,=\frac{TP}{TP+FN+FP}$$\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\,{\rm{F1}}\,=\frac{2TP}{2TP+FP+FN}$$\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\,{\rm{Precision}}\,=\frac{TP}{TP+FP}$$\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\,{\rm{Recall}}\,=\frac{TP}{TP+FN}$$\end{document}Data Record
The raw and processed VIRVS data is deposited in RODARE^35^.
Dataset Organisation
For each virus (HADV, VACV, IAV, RV and HSV), we provide the processed data in a separate directory, divided into three subdirectories: ‘train’, ‘val’ and ‘test’, containing the proposed data split. Each of the subfolders contains two npy files: ‘x.npy’ and ‘y.npy’, where ‘x.npy’ contains the fluorescence or brightfield signal (both for HAdV, as separate channels) of the cells or nuclei and ‘y.npy’ contains the viral signal. The data is already processed as described in the Data preparation section.
Additionally, Cellpose masks are made available for the test data in a separate ‘masks’ directory. For each virus except for VACV, there is a subdirectory ‘test’ containing nuclei masks (‘nuc.npy’) for the samples from test split. For HAdV cell masks are also available (‘cell.npy’).
To enable connecting the npy arrays and the original data, metadata filename_to_index.json files with a mapping between the indexes of npy arrays and the filenames of images are supplied along with each npy file (https://github.com/casus/virvs). For timelapse data, the metadata indicate the original name of the file (with the channel suffix removed) and the timestep used, i.e. 180301-VA-dE3B-TC-6_G02_s29_49 means that the matching index in the numpy array was created based on timestep 49 of HAdV sample 180301-VA-dE3B-TC-6_G02_s29_w1, _w2 and _w4. For non-timelapse data, the metadata matches the original name of the file with the channel suffix removed.
To facilitate linking the npy arrays with the original data, metadata files named filename_to_index.json are provided alongside the splits. These files contain mappings between the indices in the npy arrays and the corresponding image filenames. For timelapse data, the metadata specifies the original filename (with the channel suffix removed, as the array contains data from all significant channels) and the timestep used. Example: 180301-VA-dE3B-TC-6_G02_s29_49 indicates that the corresponding index in the npy array was derived from timestep 49 of the HADV sample: 180301-VA-dE3B-TC-6_G02_s29_w1, _w2, and _w4. For non-timelapse data, the metadata matches the original filename (with the channel suffix removed).
Technical Validation
VIRVS Benchmark for Human Adenovirus
To validate the suitability of the dataset^35^ for deep learning prediction of the virtual staining, we tested the state-of-the-art virtual staining architectures. Specifically, we trained and evaluated U-Net^11^ and pix2pix^26^ models. For HAdV, we initially trained models on fluorescence microscopy images of nuclei. Subsequently, we trained separate models on combined fluorescence and brightfield images (HAdV (2ch)) to investigate the inductive bias introduced by cell shape visible only there. This approach was motivated by the observation that virus infection causes a cytopathic effect (CPE) - a visible change in cell morphology^36–38^. Furthermore, given that the GFP signal in all the datasets is localised in the cytoplasm, the quality of the resulting staining images is likely to be influenced by the cell shape. Additionally, we trained models solely on transmission light images, but the results were significantly worse, indicating that nuclear morphology is important for identifying the HAdV infection. This aligns with the understanding that HAdV, being a nucleotropic virus, affects the morphology of infected nuclei.
Figure 2 illustrates examples of predictions made by the pix2pix and U-Net models for the HAdV data. Consistent with previous work, the Hoechst nuclear signal of infected cells appears visually distinct (Fig. 2a, zoomed-in insets), suggesting the presence of a strong inductive bias. Model prediction results suggested that while both architectures commit similar errors, the pix2pix predictions are visually closer to the GT, capturing texture, shapes, and detail with greater fidelity.Fig. 2. Examples of predictions made by the pix2pix and U-Net models for the HAdV data. The top row (a) displays the nuclear channel, GT and the virtual stains predicted by pix2pix and U-Net trained only on the nuclear channel. The bottom row (b) shows input with the nuclear channel superimposed on the brightfield channel (HAdV (2ch)), GT and the corresponding virtual stains predicted by the models trained on both channels. The scale bar is 500 μm. Insets show respective zoomed-in regions. The colour code depicted in the colour bar.
To evaluate the model performance quantitatively, we computed image-wide metrics commonly used in image-to-image tasks, including mean square error (MSE), peak signal-to-noise ratio (PSNR)^39^, and structural similarity index (SSIM)^33^. To understand the models’ behaviour in a more comprehensive way, we additionally employed a rule-based approach to classify cell nuclei as either infected or healthy (see Methods section). We report classification and segmentation metrics such as IoU, F1, accuracy, precision and recall based on the results, summarised in the Table 3. Furthermore, in the Supplement, we report additional results: MSE, PSNR and SSIM calculated separately for the background and foreground sections of the images in Supplementary Table S2, IoU, F1, accuracy, precision and recall calculated for whole cells for HAdV data in Supplementary Table S3 and means and standard deviations calculated for every metric across the data points in Supplementary Tables S4 and S5.Table 3. Performance of U-Net and pix2pix models, trained on HAdV, HAdV (2ch), VACV, HSV, IAV and RV data. Metrics that we report are MSE, PSNR, SSIM, IoU, F1, accuracy, precision and recall. For pix2pix, we report the mean and standard deviation of each metric based on 3 runs.DatasetModelMSE (↓)PSNR (↑)SSIM (↑)IoU (↑)F1 (↑)Accuracy (↑)Precision (↑)Recall (↑)HAdVpix2pix0.034 ± 5.0e-0524.214 ± 7e-030.870 ± 3.0e-050.300 ± 7.0e-04****0.437 ± 6.0e-040.959 ± 4.0e-050.503 ± 7.0e-040.473 ± 1.0e-03U-Net0.03125.3310.8990.2680.4040.963****0.5570.357HAdV (2ch)pix2pix0.032 ± 4.0e-05****24.824 ± 1.1e-020.892 ± 1.0e-040.239 ± 8.0e-04****0.364 ± 1.0e-030.960 ± 8.0e-050.524 ± 1.0e-03****0.322 ± 2.0e-03U-Net0.03224.7670.8950.2350.3600.9610.5220.306VACVpix2pix0.083 ± 4.0e-0521.087 ± 1.5e-020.825 ± 5.0e-05—————U-Net0.05724.5630.860—————HSVpix2pix0.016 ± 9.0e-0727.514 ± 2.0e-030.917 ± 1.0e-050.660 ± 2.0e-040.794 ± 1.0e-040.985 ± 6.0e-060.862 ± 3.0e-050.741 ± 3.0e-04U-Net0.01029.6730.9380.7270.841****0.9880.8580.830IAVpix2pix0.054 ± 1.0e-0522.266 ± 4.0e-030.875 ± 2.0e-050.249 ± 1.0e-04****0.391 ± 9.0e-050.967 ± 2.0e-050.485 ± 3.0e-040.341 ± 3.0e-04U-Net0.04124.3740.9050.1740.2900.971****0.6550.190RVpix2pix0.045 ± 6.0e-0621.668 ± 1.0e-040.863 ± 1.0e-050.219 ± 2.0e-040.357 ± 3.0e-040.963 ± 2.0e-050.347 ± 4.0e-040.379 ± 3.0e-04U-Net0.02424.7380.9170.2940.4480.9780.6140.358
In our analysis, the U-Net model showed superior results for the HAdV dataset when trained on a single nuclear channel, with a lower MSE, and a higher PSNR, SSIM, accuracy and precision compared to pix2pix. However, for the combination of brightfield and nuclear channels (HAdV (2ch)), pix2pix slightly outperformed U-Net in most metrics, despite having a marginally lower SSIM and accuracy. These findings suggest that pix2pix tends to overestimate the GFP signal, leading to higher recall but worse or comparable precision, while U-Net delivers better accuracy, especially when excluding the brightfield channel.
Together with the visual examination, the results suggest that while pix2pix provided more realistic-looking cell shapes, this did not significantly improve the metrics such as MSE, PSNR or SSIM of the models for HAdV, which remained close to each other. The difference is more noticeable when metrics such as IoU, precision and recall are incorporated. Notably, the CPE in the brightfield images was not very pronounced at this stage of infection, which might have limited the inductive bias and introduced unnecessary noise, disrupting U-Net’s performance.
VIRVS Benchmark for Vaccinia Virus
To further explore if CPE in brightfield microscopy can provide a robust inductive bias for virtual staining we evaluated the performance of U-Net and pix2pix in predicting virus signal of VACV, known to induce a strong CPE^40^. Noteworthy, the CPE is visible prior to the GFP signal, as it takes time to express and accumulate GFP molecules in the infected cell. Figure 3 illustrates examples of predictions made by the pix2pix and U-Net for the VACV data. Both perform well on this dataset, producing signal comparable to the GT. U-Net underestimated both the magnitude and the extent of the signal (Fig. 3, dashed lines in the zoomed-in insets) compared to GT, while pix2pix overestimated the magnitude, overestimating the extent only slightly. Unsurprisingly, the predictions of both architectures suffered from the artefacts introduced by the stitching procedure, however, since they occurred in predictions of both architectures to a comparable degree, we assumed they did not significantly impact our analysis.Fig. 3. Examples of predictions made by the pix2pix and U-Net models for the VACV data. Starting from the left, the brightfield channel and GT are visualised, followed by pix2pix and U-Net predictions. Insets show respective zoomed-in regions. The dashed green line indicates the approximate edge of the GFP signal. Dashed white - an approximate edge of the cytopathic effect. The colour code is depicted in the colour bar. Scale bar 2000 μm.
Looking at the metrics (Table 3), the U-Net model outperformed pix2pix. The U-Net achieved a lower MSE, higher PSNR, and higher SSIM, indicating superior overall performance. This may be attributed to the overestimation of the signal magnitude committed by the pix2pix. We do not report other metrics due to the lack of the fluorescence nuclei signal for VACV hampering the accurate segmentation in the dense cell monolayer.
VIRVS Benchmark for HSV, IAV and RV
Finally, to explore potential inductive bias in the nuclear signal of cells infected with cytoplasm-replicating IAV and RV compared to nucleotropic HSV, we trained U-Net and pix2pix on a dataset derived from a small molecule screen^25^. Figure 4a-c illustrates examples of predictions for the HSV, IAV and RV respectively. Notably, the nuclear staining of HSV data displayed strong intensity irregularities corresponding to the GT GFP signal, confirming the presence of a strong inductive bias. Unsurprisingly, both U-Net and pix2pix performed equally well visually on this task.Fig. 4. Examples of predictions made by the pix2pix and U-Net models for the HSV, IAV and RV data. In each row, starting from the left, the brightfield channel and GT are visualised, followed by pix2pix and U-Net predictions. The top row (a) displays the HSV data, the middle row (b) shows the IAV data, and the bottom row (c) presents the RV data. Scale bar 1000 μm.
Interestingly, IAV nuclear signal showed some irregularities corresponding to the virus signal as well, suggesting the presence of some inductive bias. As the intensity of the nuclear stain is directly connected to the cell cycle and given the role of IAV in host cell cycle arrest^41^, it is tempting to speculate that the inductive bias facilitating the prediction may be the host DNA content. Curiously, while both pix2pix and U-Net demonstrated some visual overlap with the signal in the GT, their performance was weak. This manifested as blurry, sporadic patches generated by the U-Net and multiple false positive pixels (hallucinations) generated by the pix2pix. At the same time, the performance of the models on RV proved to be consisting mostly of false positives. This likely suggested that in the absence of a strong inductive bias, both U-Net and pix2pix hallucinated the results. An alternative explanation is that both U-Net and pix2pix lacked either expressive capacity and a future foundation model could success where they failed.
In our evaluation of the HSV, IAV, and RV metrics (see Table 3), U-Net consistently outperformed pix2pix in terms of image prediction metrics. For the HSV dataset, U-Net achieved a better MSE, PSNR and SSIM, indicating superior similarity to GT images. It also achieved a high IoU, F1, accuracy and recall, with slightly worse precision than pix2pix. Analogously, in the IAV dataset, U-Net’s MSE, PSNR, SSIM, accuracy and precision surpassed pix2pix’s. On the other hand, pix2pix achieved significantly higher IoU, F1 and recall, suggesting that pix2pix may have overestimated the αNP signal in contrast to U-Net’s underestimation. For the RV dataset, U-Net continued to excel. Only recall was higher for pix2pix.
Consistent with our expectations, the nucleotropic HSV proved to be the easiest to predict based on the nuclear signal, reaching the best metrics overall. For U-Net, the performance of the RV was higher than the IAV according to all the metrics, except for the precision, contrarily to the visual impression, that suggested high hallucination rate, exhibiting that these metrics are vulnerable to hallucinations. RV may have outperformed IAV in these metrics due to the high false positive rate in RV predictions.
Detecting virus-infected cells in microscopy can help unveil novel host-pathogen interactions in the context of Cell Biology^17^ and facilitate the discovery of novel antivirals in the context of high-content screening^24,25^. In this work, we demonstrate that the emerging field of deep-learning-based Virtual Staining can facilitate the detection of virus-infected cells from seemingly irrelevant signal. Compared to detection by classification^17^, the ability to generate a pixel-level prediction provides a greater level of nuance including visual guidance to severity and duration of infection (see Fig. 3). Furthermore, we collate the datasets, baselines, and metrics into a benchmark VIRVS that is readily available for the community to download and improve upon.
Virtual Staining can be formulated as either regression or image generation. Using state-of-the-art architectures for these tasks, we demonstrate that in the case of VIRVS, both tasks are applicable provided sufficient inductive bias is present in the data. In our case, the generative model tended to yield more realistic-looking results, yet it often hallucinated where the signal wasn’t present in the GT. The regression model, in turn, always produced less realistic results with more conservative predictions. Overall, the stronger the inductive bias the better were the results of both generative and regression models.
From the metric perspective, the values of MSE, PSNR and SSIM suggest that overall U-Net demonstrated favourable performance across all viruses and image resolutions, except the HAdV (2ch) data, where pix2pix achieved comparable results. It is tempting to speculate, that this can be attributed to more conservative predictions of U-Net and a higher false-positive rate of pix2pix. Additional metrics, such as IoU, F1, accuracy, precision and recall demonstrated more nuance between the models and enabled more detailed comparison. Notably, performance on HSV images ranked highest overall, whereas VACV ranked lowest. Moreover, in contrast to the visual impression, pix2pix on VACV data performed the worst overall in all the metrics. Additionally, while pix2pix attained lower scores and sometimes hallucinated, its outputs appeared more realistic, detailed, and sharp compared to U-Net’s, which can be observed in HAdV or IAV predictions. This suggests that the commonly used metrics, such as MSE or PSNR may not always capture all the nuances of the virtual staining. The inclusion of metrics such as IoU, precision or recall could help alleviate this issue but computing these requires segmentation-level annotations. To thoroughly evaluate the model’s behaviour, one should consider all of these metrics in conjunction. This can be further enhanced by examining them at a finer level of detail, such as analysing them separately for the background and foreground of the image, as suggested in the Supplement. Additionally, a visual examination of the predictions should be incorporated to provide a more comprehensive understanding. Only then it is possible to gain a deep understanding of the performance.
While accuracy is of utmost importance for this task, overly cautious modelling or blurring out details to achieve the lowest error is insufficient. On the other hand, sacrificing truthfulness for more visually appealing predictions is unsound. Future works should focus on striking a balance between the advantages and shortcomings of the regressive and generative approach to this task and strive to find meaningful evaluation techniques to assess it. The VIRVS benchmark introduced in this work will be instrumental in these efforts. Through it, we demonstrated the feasibility of virtual staining for virus infection in cells for the first time. We hope that introducing these benchmark datasets together with our baseline characterisation will spur interest in exploring virtual staining in the context of viral infections.
Supplementary information
Supplementary information to the manuscript: A Benchmark for Virus Infection Reporter Virtual Staining in Fluorescence and Brightfield Microscopy
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Pommerville, J. C.Fundamentals of microbiology (Jones & Bartlett Publishers, 2013).
- 2Moreno, V., Smith, E. A. & Piña-Oviedo, S. Fluorescent immunohistochemistry. Immunohistochemistry and Immunocytochemistry: Methods and Protocols 131–146 (2022).10.1007/978-1-0716-1948-3_934859403 · doi ↗ · pubmed ↗
- 3Mok, H. & Yakimovich, A. Click chemistry-based labeling of poxvirus genomes. Vaccinia Virus: Methods and Protocols 209–220 (2019).10.1007/978-1-4939-9593-6_1331240680 · doi ↗ · pubmed ↗
- 4Beveridge, T. J., Lawrence, J. R. & Murray, R. G. Sampling and staining for light microscopy. Methods for general and molecular microbiology 19–33 (2007).
- 5Wiedenmann, J., D’Angelo, C. & Nienhaus, G. U. Fluorescent proteins: Nature’s colorful gifts for live cell imaging. Fluorescent Proteins II: Application of Fluorescent Protein Technology 3–33 (2012).
- 6Bai, B., Yang, X. & Li, Y. e. a. Deep learning-enabled virtual histological staining of biological samples. Light: Science & Applications 12, (2023).10.1038/s 41377-023-01104-7PMC 998174036864032 · doi ↗ · pubmed ↗
- 7Rivenson, Y., Wang, H. & Wei, Z. e. a. Virtual histological staining of unlabelled tissue-autofluorescence images via deep learning. Nature Biomedical Engineering 3, (2019).10.1038/s 41551-019-0362-y 31142829 · doi ↗ · pubmed ↗
- 8Ng, A. & Jordan, M. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. Advances in neural information processing systems 14, (2001).
