Missing MRI Pulse Sequence Synthesis using Multi-Modal Generative   Adversarial Network

Anmol Sharma; Ghassan Hamarneh

arXiv:1904.12200·eess.IV·October 3, 2019

Missing MRI Pulse Sequence Synthesis using Multi-Modal Generative Adversarial Network

Anmol Sharma, Ghassan Hamarneh

PDF

TL;DR

This paper introduces a multi-modal GAN that synthesizes missing MRI pulse sequences by leveraging available sequences, reducing scan time and handling incomplete data for improved diagnosis and analysis.

Contribution

A novel multi-input, multi-output GAN architecture that effectively synthesizes missing MRI sequences using multi-modal data in a single forward pass.

Findings

01

Outperforms existing methods quantitatively and qualitatively.

02

Successfully synthesizes all missing sequences in various missing data scenarios.

03

Validated on two brain MRI datasets with four sequences each.

Abstract

Magnetic resonance imaging (MRI) is being increasingly utilized to assess, diagnose, and plan treatment for a variety of diseases. The ability to visualize tissue in varied contrasts in the form of MR pulse sequences in a single scan provides valuable insights to physicians, as well as enabling automated systems performing downstream analysis. However many issues like prohibitive scan time, image corruption, different acquisition protocols, or allergies to certain contrast materials may hinder the process of acquiring multiple sequences for a patient. This poses challenges to both physicians and automated systems since complementary information provided by the missing sequences is lost. In this paper, we propose a variant of generative adversarial network (GAN) capable of leveraging redundant information contained within multiple available sequences in order to generate one or more…

Tables5

Table 1. TABLE I: Comparison between P2P , pGAN and MI-GAN . Values in boldface represent best performance values. Reported values are mean ± plus-or-minus \pm std.

Model	MSE	PSNR	SSIM
${P2P}_{T_{1}}$	0.0135 $\pm$ 0.0044	22.1168 $\pm$ 2.1001	0.8864 $\pm$ 0.0180
${pGAN}_{T_{1}}$	0.0107 $\pm$ 0.0048	23.8645 $\pm$ 2.8851	0.8992 $\pm$ 0.0203
${MI-GAN}_{T_{1}}$	0.0052 $\pm$ 0.0026	26.6057 $\pm$ 1.3801	0.9276 $\pm$ 0.0118
${P2P}_{T_{2}}$	0.0050 $\pm$ 0.0019	25.0606 $\pm$ 1.2020	0.8931 $\pm$ 0.0176
${pGAN}_{T_{2}}$	0.0050 $\pm$ 0.0033	25.4511 $\pm$ 1.6773	0.9008 $\pm$ 0.0250
${MI-GAN}_{T_{2}}$	0.0049 $\pm$ 0.0041	26.1233 $\pm$ 2.6630	0.9078 $\pm$ 0.0324

Table 2. TABLE II: Comparison with unimodal method REPLICA and multimodal method MM-Synthesis. The reported values are mean squared error (MSE). Boldface values represent lowest values of the three methods for a particular scenario.

Scenarios	REPLICA	MM-Synthesis	MM-GAN
$T_{1}$ $T_{2}$ DW			(Proposed)
- - ✓	0.278 $\pm$ 0.09	0.285 $\pm$ 0.13	0.210 $\pm$ 0.057
- ✓ -	0.374 $\pm$ 0.16	0.321 $\pm$ 0.12	0.279 $\pm$ 0.055
- ✓ ✓	0.235 $\pm$ 0.08	0.214 $\pm$ 0.09	0.182 $\pm$ 0.033
✓ - -	0.301 $\pm$ 0.11	0.249 $\pm$ 0.09	0.281 $\pm$ 0.071
✓ - ✓	0.225 $\pm$ 0.08	0.198 $\pm$ 0.02	0.191 $\pm$ 0.039
✓ ✓ -	0.271 $\pm$ 0.12	0.214 $\pm$ 0.08	0.254 $\pm$ 0.066
✓ ✓ ✓	0.210 $\pm$ 0.08	0.171 $\pm$ 0.06	0.182 $\pm$ 0.041
Mean	0.271 $\pm$ 0.10	0.236 $\pm$ 0.08	0.226 $\pm$ 0.046

Table 3. TABLE III: Performance on BraTS2018 High Grade Glioma (HGG) Cohort

$T_{1}$ $T_{2}$ $T_{1 c}$ $T_{2 f}$	MSE	PSNR	SSIM
Scenarios
- - - ✓	0.0143 $\pm$ 0.0086	23.196 $\pm$ 4.2908	0.8973 $\pm$ 0.0668
- - ✓ -	0.0072 $\pm$ 0.0065	24.524 $\pm$ 4.0671	0.8984 $\pm$ 0.0726
- - ✓ ✓	0.0060 $\pm$ 0.0061	25.863 $\pm$ 3.2218	0.9166 $\pm$ 0.0339
- ✓ - -	0.0102 $\pm$ 0.0065	23.469 $\pm$ 4.1744	0.9074 $\pm$ 0.0680
- ✓ - ✓	0.0136 $\pm$ 0.0048	22.900 $\pm$ 2.1989	0.9156 $\pm$ 0.0260
- ✓ ✓ -	0.0073 $\pm$ 0.0070	24.792 $\pm$ 2.9524	0.9140 $\pm$ 0.0311
- ✓ ✓ ✓	0.0091 $\pm$ 0.0053	24.173 $\pm$ 3.2754	0.9228 $\pm$ 0.0190
✓ - - -	0.0072 $\pm$ 0.0056	24.879 $\pm$ 3.8216	0.9091 $\pm$ 0.0651
✓ - - ✓	0.0073 $\pm$ 0.0041	26.189 $\pm$ 2.1337	0.9264 $\pm$ 0.0328
✓ - ✓ -	0.0040 $\pm$ 0.0032	26.150 $\pm$ 1.8470	0.9107 $\pm$ 0.0275
✓ - ✓ ✓	0.0017 $\pm$ 0.0026	28.678 $\pm$ 2.3290	0.9349 $\pm$ 0.0262
✓ ✓ - -	0.0068 $\pm$ 0.0041	25.242 $\pm$ 2.0339	0.9175 $\pm$ 0.0275
✓ ✓ - ✓	0.0098 $\pm$ 0.0066	24.372 $\pm$ 2.2792	0.9239 $\pm$ 0.0375
✓ ✓ ✓ -	0.0033 $\pm$ 0.0040	26.397 $\pm$ 1.9733	0.9150 $\pm$ 0.0275
mean $\pm$ std	0.0082 $\pm$ 0.0054	24.789 $\pm$ 2.8999	0.9120 $\pm$ 0.0401

Table 4. TABLE IV: Performance on BraTS2018 Low Grade Glioma (LGG) Cohort

$T_{1}$ $T_{2}$ $T_{1 c}$ $T_{2 f}$	MSE	PSNR	SSIM
Scenarios
- - - ✓	0.0092 $\pm$ 0.0037	24.7832 $\pm$ 3.3197	0.8758 $\pm$ 0.0651
- - ✓ -	0.0072 $\pm$ 0.0032	25.7012 $\pm$ 2.9797	0.8925 $\pm$ 0.0486
- - ✓ ✓	0.0040 $\pm$ 0.0022	26.9985 $\pm$ 2.4075	0.9179 $\pm$ 0.0178
- ✓ - -	0.0108 $\pm$ 0.0044	23.6527 $\pm$ 3.4296	0.8811 $\pm$ 0.0560
- ✓ - ✓	0.0129 $\pm$ 0.0054	22.9238 $\pm$ 2.6554	0.8873 $\pm$ 0.0337
- ✓ ✓ -	0.0084 $\pm$ 0.0036	24.7581 $\pm$ 2.2131	0.8984 $\pm$ 0.0144
- ✓ ✓ ✓	0.0061 $\pm$ 0.0035	25.9841 $\pm$ 2.1926	0.9288 $\pm$ 0.0137
✓ - - -	0.0120 $\pm$ 0.0063	23.6018 $\pm$ 3.8153	0.8908 $\pm$ 0.0509
✓ - - ✓	0.0109 $\pm$ 0.0040	23.8408 $\pm$ 2.1715	0.9028 $\pm$ 0.0244
✓ - ✓ -	0.0102 $\pm$ 0.0030	23.9202 $\pm$ 2.0746	0.8792 $\pm$ 0.0178
✓ - ✓ ✓	0.0057 $\pm$ 0.0046	25.6005 $\pm$ 2.7909	0.9030 $\pm$ 0.0303
✓ ✓ - -	0.0128 $\pm$ 0.0048	22.1330 $\pm$ 1.8389	0.8885 $\pm$ 0.0164
✓ ✓ - ✓	0.0120 $\pm$ 0.0025	22.4980 $\pm$ 1.1684	0.9086 $\pm$ 0.0253
✓ ✓ ✓ -	0.0113 $\pm$ 0.0040	23.0852 $\pm$ 1.5142	0.8692 $\pm$ 0.0224
mean $\pm$ std	0.0095 $\pm$ 0.0039	24.2487 $\pm$ 2.4694	0.8946 $\pm$ 0.0312

Table 5. TABLE V: MM-GAN Performance variation with respect to number of sequences missing for HGG and LGG cohort.

Dataset	Missing	MSE	PSNR	SSIM
HGG	1	0.0539 $\pm$ 0.0215	29.1162 $\pm$ 0.9716	0.9268 $\pm$ 0.0115
	2	0.0602 $\pm$ 0.0170	28.9064 $\pm$ 0.8095	0.9223 $\pm$ 0.0090
	3	0.0752 $\pm$ 0.0125	28.0801 $\pm$ 0.4110	0.9087 $\pm$ 0.0071
LGG	1	0.1296 $\pm$ 0.0485	26.1362 $\pm$ 2.8872	0.9080 $\pm$ 0.0383
	2	0.1499 $\pm$ 0.0208	25.5641 $\pm$ 1.3161	0.8976 $\pm$ 0.0157
	3	0.1914 $\pm$ 0.0358	24.8987 $\pm$ 0.8363	0.8732 $\pm$ 0.0205

Equations4

θ_{G}^{*} = θ_{G} arg min λ (1 - λ) L_{1} (G (X_{z} ∣ θ_{G}), X_{r}) + L_{2} (D (X_{i}, X_{r} ∣ θ_{D}), L_{a r}) .

θ_{G}^{*} = θ_{G} arg min λ (1 - λ) L_{1} (G (X_{z} ∣ θ_{G}), X_{r}) + L_{2} (D (X_{i}, X_{r} ∣ θ_{D}), L_{a r}) .

θ_{D}^{*} = θ_{D} arg min + L_{2} (D (X_{r}, X_{r} ∣ θ_{D}), L_{a r}) L_{2} (D (X_{r}, X_{i} ∣ θ_{D}), L_{r}) .

θ_{D}^{*} = θ_{D} arg min + L_{2} (D (X_{r}, X_{r} ∣ θ_{D}), L_{a r}) L_{2} (D (X_{r}, X_{i} ∣ θ_{D}), L_{r}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Missing MRI Pulse Sequence Synthesis using Multi-Modal Generative Adversarial Network

Anmol Sharma, Ghassan Hamarneh This work was partially supported by the NSERC-CREATE Bioinformatics 2018-2019 Scholarship.Anmol Sharma and Ghassan Hamarneh are with the Medical Image Analysis Laboratory, School of Computing Science, Simon Fraser University, Canada. e-mail: {asa224, hamarneh}@sfu.ca

Abstract

Magnetic resonance imaging (MRI) is being increasingly utilized to assess, diagnose, and plan treatment for a variety of diseases. The ability to visualize tissue in varied contrasts in the form of MR pulse sequences in a single scan provides valuable insights to physicians, as well as enabling automated systems performing downstream analysis. However many issues like prohibitive scan time, image corruption, different acquisition protocols, or allergies to certain contrast materials may hinder the process of acquiring multiple sequences for a patient. This poses challenges to both physicians and automated systems since complementary information provided by the missing sequences is lost. In this paper, we propose a variant of generative adversarial network (GAN) capable of leveraging redundant information contained within multiple available sequences in order to generate one or more missing sequences for a patient scan. The proposed network is designed as a multi-input, multi-output network which combines information from all the available pulse sequences and synthesizes the missing ones in a single forward pass. We demonstrate and validate our method on two brain MRI datasets each with four sequences, and show the applicability of the proposed method in simultaneously synthesizing all missing sequences in any possible scenario where either one, two, or three of the four sequences may be missing. We compare our approach with competing unimodal and multi-modal methods, and show that we outperform both quantitatively and qualitatively.

Index Terms:

generative adversarial networks, multi-modal, missing modality, pulse sequences, MRI, synthesis.

I Introduction

Medical imaging forms the backbone of the modern healthcare systems, providing means to assess, diagnose, and plan treatments for a variety of diseases. Imaging techniques like computed tomography (CT), magnetic resonance imaging (MRI), X-Rays have been in use for over many decades. Magnetic resonance imaging (MRI) out of these is particularly interesting in the sense that a single MRI scan is a grouping of multiple pulse sequences, each of which provides varying tissue contrast views and spatial resolutions, without the use of radiation. These sequences are acquired by varying the spin echo and repetition times during scanning, and are widely used to show pathological changes in internal organs and muscoskeletal system. Some of the commonly acquired sequences are $T_{1}$ -weighted, $T_{2}$ -weighted, $T_{1}$ -with-contrast-enhanced ( $T_{1c}$ ), and $T_{2}$ -fluid-attenuated inversion recovery ( $T_{2flair}$ ), though there exist many more [1].

A combination of sequences provide both redundant and complimentary information to the physician about the imaged tissue, and certain diagnosis are best performed when a particular sequence is observed. For example, $T_{1}$ and $T_{2flair}$ sequences provide clear delineations of the edema region of tumor in case of glioblastoma, $T_{1c}$ provides clear demarcation of enhancing region around the tumor used as an indicator to assess growth/shrinkage, and $T_{2flair}$ sequence is used to detect white matter hyperintensities for diagnosing vascular dementia (VD) [2].

In clinical settings, however, it is common to have MRI scans acquired using varying protocols, and hence varying sets of sequences per patient. Sequences which are routinely acquired may be unusable or missing altogether due to scan corruption, artifacts, incorrect machine settings, allergies to certain contrast agents and limited available scan time [3, 4, 5]. This phenomenon is problematic for many downstream data analysis pipelines that assume presence of a certain set of pulse sequences to perform their task. For instance, most of the segmentation methods [6, 7, 8, 9, 10] proposed for brain MRI scans depend implicitly on the availability of a certain set of sequences in their input in order to perform the task. Most of these methods are not designed to handle missing inputs, and hence may fail in the event where some or most of the sequences may be absent.

Modifying existing pipelines in order to handle missing sequences is hard, and may lead to performance degradation. Also, the option of redoing a scan to acquire the missing/corrupted sequence is impractical due to the expensive nature of the acquisition, longer wait times for patients with non-life-threatening cases, need for registration between old and new scans, and rapid changes in anatomy of area in-between scan times due to highly active abnormalities such as glioblastoma. Hence there is a clear advantage in retrieving any missing sequence or an estimate thereof, without having to redo the scan or changing the downstream pipelines.

To this end, we propose a multi-modal generative adversarial network (MM-GAN) which is capable of synthesizing missing sequences by combining information from all available sequences. The proposed method exhibits the ability to synthesize, with high accuracy, all the required sequences which are deemed missing in a single forward pass through the network. The term “multi-modal” simply refers to the fact that the GAN can take multiple-modalities of available information as input, which in this case represents different pulse sequences. Similar to the input being multi-modal, our method generates multi-modal output containing synthesized versions of the missing sequences. Since most of the downstream analysis pipelines commonly target $C=4$ pulse sequences $S$ = { $T_{1}$ , $T_{1c}$ , $T_{2}$ , $T_{2flair}$ } as their input [11, 12, 7], we design our method around the same number of sequences, although we note that our method can be generalized to any number $C$ and set $S$ of sequences. The input to our network is a 4-channel (corresponding to $C=4$ sequences) 2D axial slice, where a zero image is imputed for channels corresponding to missing sequences. The output of the network is a 4-channel 2D axial slice, in which the originally missing sequences are synthesized by the network.

The rest of the paper is organized as follows: Section II presents a review of the MR sequence synthesis literature. Section III provides an overview of the key contributions of this work. Section IV presents the proposed method in detail. Section V provides details about the method implementation, datasets used, as well as outlines experimental setup for the current work. Section VI discusses the results and observations for the proposed method, and finally the paper is concluded in Section VII.

II Related Work

There has been an increased amount of interest in developing methods for synthesizing MR pulse sequences [13, 14, 15, 16, 17, 18, 19, 20, 2, 21, 22, 5, 23, 24, 25, 26]. We present a brief overview of previous work in this field by covering them in two sections: Unimodal, where both the input and output of the system is a single pulse sequence (one-to-one); and multimodal, where methods are able to leverage multiple input sequences to synthesize a single (many-to-one) or multiple sequences (many-to-many).

II-A Unimodal Synthesis

In unimodal synthesis (one-to-one), a common strategy includes building an atlas or a database that maps intensity values between given sequences. Jog et al. [15] used a bagged ensemble of regression trees trained from an atlas. The training data $(\mathcal{A}_{1},\mathcal{A}_{2})$ consisted of multiple image patches $\mathcal{A}_{1}$ around a voxel $i$ in a source sequence, and a single intensity value at the same voxel in a target sequence, as $\mathcal{A}_{2}$ . The use of image patches to predict the intensity value of a single voxel in output sequence allows representing many-to-many relationship between intensity values of input and target sequences. Ye et al. [14] propose an inverse method, which performs a local patch-based search in a database for every voxel in the target pulse sequence. Once the patches are found, they are “fused” together using a data-driven regularization approach. Another atlas based method was proposed in [19] where $T_{2}$ whole-head sequence (including skull, eyes etc.) is synthesized from the available $T_{1}$ images. The synthesized $T_{2}$ sequence is used to correct distortion in diffusion-weighted MR images by using it as template for registration, in the absence of a real $T_{2}$ sequence. Yawen et al. [20] leverage joint dictionary learning (JDL) for synthesizing any unavailable MRI sequence from available MRI data. JDL is performed by minimizing the inconsistency between statistical distributions of the dictionary codes for input MRI sequences while preserving the geometrical structure of the input image.

Supervised machine learning and deep learning (DL) based methods have also been employed in sequence synthesis pipelines. A 3D continuous-valued conditional random field (CRF) is proposed in [18] to synthesize $T_{2}$ images from $T_{1}$ . The synthesis step is encoded as a maximum a-posterior (MAP) estimate of Gaussian distribution parameters built from a learnt regression tree. Nguyen et al. [27] was one of the first to employ DL in the form of location-sensitive deep network (LSDN) for sequence synthesis. LSDN predicts the intensity value of the target voxel by using voxel-centered patches extracted from an input sequence. The network models the responses of hidden nodes as a product of feature and spatial responses. Similarly, Bowleset et al. [2] generate “pseudo-healthy” images by performing voxel-wise kernel regression instead of deep networks to learn local relationships between intensities in $T_{1}$ and $T_{2flair}$ sequences of healthy subjects. Since most of the methods were based on local features in the form of patches and did not leverage global features of the input sequence, Sevetlidis et al. [21] proposed an encoder-decoder style deep neural network trained layer-wise using restricted Boltzmann machine (RBM) based training. The method utilized global context of the input sequence by taking a full slice as input. Recently, Jog et al. [22] propose a random forest based method that learns intensity mapping between input patches centered around a voxel extracted from a single pulse sequence, and the intensity of corresponding voxel in target sequence. The method utilized multi-resolution patches by building a Gaussian pyramid of the input sequence. Yu et al. [26] propose a unimodal GAN architecture to synthesize missing pulse sequences in a one-to-one setting. The approach uses an edge detection module that tries to preserve the high-frequency edge features of the input sequence, in the synthesized sequence. Recently, Ul Hassan Dar et al. [28] propose to use a conditional GAN to synthesize missing MR pulse sequences in a unimodal setting for two sequences $T_{1}$ and $T_{2}$ .

II-B Multimodal Synthesis

Multimodal synthesis has been a relatively new and unexplored avenue in MR synthesis literature. One of the first multi-input, single-output (many-to-one) method was proposed by Jog et al. [16]; a regression based approach to reconstruct $T_{2flair}$ sequence using combined information from $T_{1}$ , $T_{2}$ , and proton density (PD) sequences. Reconstruction is performed by a bagged ensemble of regression trees predicting the $T_{2flair}$ voxel intensities. Chartsias et al. [5] were one of the first to propose a multi-input, multi-output (many-to-many) encoder-decoder based architecture to perform many-to-many sequence synthesis, although their multimodal method is tested only using a single-output ( $T_{2flair}$ ) (many-to-one setting). Their network is trained using a combination of three loss functions, and uses a feature fusion step in the middle that separates the encoders and decoders. Olut et al. [23] present a GAN based framework to generate magnetic resonance angiography (MRA) sequence from available $T_{1}$ , and $T_{2}$ sequences. The method uses a novel loss function formulation, which preserves and reproduces vascularities in the generated images. Although for a different application, Mehta et al. [24] proposed a multi-task, multi-input, multi-output 3D CNN that outputs a segmentation mask of the tumor, as well as a synthesized version of $T_{2flair}$ sequence. The main aim remains to predict tumor segmentation mask from three available sequences $T_{1}$ , $T_{2}$ , and $T_{1c}$ , and no quantitative results for $T_{2flair}$ synthesis using $T_{1}$ , $T_{2}$ , and $T_{1c}$ are provided.

Though all the methods discussed above propose a multi-input method, none of the methods have been proposed to synthesize multiple missing sequences (multi-output), and in one single pass. All three methods [16], [5], and [24] synthesize only one sequence (either $T_{2flair}$ or $T_{2}$ , many-to-one setting) in the presence of varying number of input sequences, while [23] only synthesizes MRA using information from multiple inputs (many-to-one). Although the work presented in [23] is close to our proposed method, theirs is not a truly multimodal network (many-to-many), since there is no empirical evidence that their method will generalize to multiple scenarios. Similarly, the framework proposed in [5] can theoretically work in a many-to-many setting, but no empirical results are given to demonstrate its scalability and applicability in a variety of different scenarios, as we do in this work. The authors briefly touch upon this by adding a new decoder to already trained many-to-one network, but do not explore it any further. To the best of our knowledge, we are the first to propose a method that is capable of synthesizing multiple missing sequences using a combination of various input sequences (many-to-many), and demonstrate the method on the complete set of scenarios (i.e., all combinations of missing sequences).

The main motivation for most synthesis methods is to retain the ability to meaningfully use some downstream analysis pipelines like segmentation or classification despite the partially missing input. However, there have been efforts by researchers working on those analysis pipelines to bypass any synthesis step by making the analysis methods themselves robust to missing sequences. Most notably, Havaei et al. [3] and Varsavsky et al. [4] provide methods for tumor segmentation using brain MRI that are robust to missing sequences [3], or to missing sequence labels [4]. Although the methods bypass the requirement of having a synthesis step before actual downstream analysis, the performance of these robust versions of analysis pipelines often do not match the state-of-the-art performance of other non-robust methods in the case when all sequences are present. This is due to the fact that the methods not only have to learn how to perform the task (segmentation/classification) well, but also to handle any missing input data. This two-fold objective for a single network raises a trade-off between robustness and performance.

III Contributions

The following are the key contributions of this work:

We propose the first empirically validated multi-input multi-output MR pulse sequence synthesizer capable of synthesizing missing pulse sequences using any combination of available sequences as input without the need for tuning or retraining of models, in a many-to-many setting. 2. 2.

The proposed method is capable of synthesizing any combination of target missing sequences as output in one single forward pass, and requires only a single trained model for synthesis. This provides significant savings in terms of computational overhead during training time compared to training multiple models in the case of unimodal and multi-input single-output methods. 3. 3.

We propose to use implicit conditioning (IC), a combination of three design choices, namely imputation in place of missing sequences for input to generator, sequence-selective loss computation in the generator, and sequence-selective discrimination. We show that IC improves overall quantitative synthesis performance of generator compared to the baseline approach without IC. 4. 4.

To the best of our knowledge, we are the first to incorporate curriculum learning based training for GAN by varying the difficulty of examples shown to the network during training. 5. 5.

Through experiments, we show that we outperform both the current state-of-art in unimodal (REPLICA [22] and pGAN [28]), as well as the multi-input single-output synthesis (MM-Synthesis [5]) method. We also set up new benchmarks on a complete set of scenarios using the BraTS2018 dataset.

IV Methodology

IV-A Background

Generative adversarial networks (GANs) were first proposed by Goodfellow et al. [29] in order to generate realistic looking images. A GAN is typically built using a combination of two networks: generator ( $\mathcal{G}$ ) and discriminator ( $\mathcal{D}$ ). The generator network is tasked with generating realistic data, typically by learning a mapping from a random vector $z$ to an image $I$ , $\mathcal{G}:z\rightarrow I$ , where $I$ is said to belong to the generator’s distribution $p_{\mathcal{G}}$ . The discriminator $\mathcal{D}:I\rightarrow t$ maps its input $I$ to a target label $t\in\{0,1\}$ , where $t=0$ if $I\in p_{\mathcal{G}}$ , i.e. a fake image generated by $\mathcal{G}$ and $t=1$ if $I\in p_{r}$ where $p_{r}$ is the distribution of real images. A variant of GANs, called conditional-GAN (cGAN) [30], proposes a generator that learns a mapping from a random vector $z$ and a class label $y$ to an output image $I\in p_{\mathcal{G}}$ , $\mathcal{G}:(z,y)\rightarrow I$ . Another variant of cGAN called Pix2Pix [31] develops a GAN in which the generator learns a mapping from an input image $x\in p_{r}$ to output image $I\in p_{\mathcal{G}}$ , $\mathcal{G}:x\rightarrow I$ , and the discriminator learns a mapping from two input images, $x_{1}$ and $x_{2}$ , to $T$ , $\mathcal{D}:(x_{1},x_{2})\rightarrow T$ . $x_{1}$ and $x_{2}$ may belong to either $p_{r}$ (real) or $p_{\mathcal{G}}$ (fake). The output $T$ in this case is a not a single class label, but a binary prediction tensor representing whether each $N\times N$ patch in the input image is real or fake [31].

A GAN is trained in an adversarial setting, where the generator (parameterized by $\theta_{\mathcal{G}}$ ) is trained to synthesize realistic output that can “fool” the discriminator into classifying them as real, and the discriminator (parameterized by $\theta_{\mathcal{D}}$ ) is trained to accurately distinguish between real data and fake data synthesized by the generator. GAN input/outputs can be images [31], text [32] or even music [33]. Both the generator and discriminator act as adversaries to each other, hence the training formulation forces both networks to continuously get better at their tasks. GANs found tremendous success in a variety of different tasks, ranging from face-image synthesis [34], image stylization [35], future frame prediction in videos [36], text-to-image synthesis [32] and synthesizing scene images using scene attributes and semantic layout [37]. GANs have also been utilized in medical image analysis [38], particularly for image segmentation [39, 40, 41], normalization [42], synthesis [23, 26, 28] as well as image registration [43].

IV-B Proposed Method

We propose a variant of Pix2Pix architecture [31] called Multi-Modal Generative Adversarial Network (MM-GAN) for the task of synthesizing missing MR pulse sequences in a single forward pass while leveraging all available sequences. The following subsections would outline the detailed architecture of our model.

IV-B1 Generator

The generator of the proposed method is a UNet [44], which has proven useful in a variety of segmentation and synthesis tasks due to its contracting and expanding paths in the form of encoder and decoder subnetworks. The architecture is illustrated in Figure 1. The convolution kernel sizes for each layer in the generator is set to $4\times 4$ . The generator network is a combination of UNetUp and UNetDown blocks. The input to the generator is a 2D axial slice from a patient scan with $C=4$ channels representing four pulse sequences, and spatial size of $256\times 256$ pixels. The network is designed with a fixed input size of 4-channels, where channel $C=0,1,2,\text{and }3$ corresponds to $T_{1}$ , $T_{2}$ , $T_{1c}$ , and $T_{2flair}$ , respectively. Hence for any $C$ -sequence 3D scan, the proposed method works sequentially on $C$ -channel 2D axial slices. In order to synthesize missing sequences, the channels corresponding to each missing sequence are imputed with zeros. The imputed version (along with the real sequences) becomes the input to the generator and is represented by $X_{z}$ . For instance, if sequences $T_{1}$ and $T_{2}$ are missing, channels $C=0$ and $C=1$ in the input image are imputed with a zero image of size $256\times 256$ . The output of the generator is given by $\mathcal{G}(X_{z}|\theta_{\mathcal{G}})$ and is of the same size as the input. Due to design, the generator always outputs 4 channels, however, as we outline in the subsequent text the output channels corresponding to the existing real sequences are not used for loss computation and are replaced with the real sequences before relaying them as input to the discriminator. During training the ground truth image $X_{r}$ , short for “real”, which is of the same size as $X_{z}$ contains all ground truth sequences at their respective channel indices. We use the term “image” for a single 2D slice with 4 channels.

MM-GAN observes both an input image and imputed zeros $z$ in the form of $X_{z}$ , in contrast to vanilla Pix2Pix where the generator is conditioned just by an observed image $x$ . The reasons behind this design choice are discussed in subsection IV-B3. We also investigate different imputation strategies in Suppl. Mat. Section I-A, and found that zero based imputation performs the best quantitatively.

To optimize $\theta_{\mathcal{G}}$ , our generator adopts the general form of the generator loss in Pix2Pix, which is a combination of a reconstruction loss $\mathcal{L}_{1}$ and an adversarial loss $\mathcal{L}_{2}$ used to train the generator to fool the discriminator, i.e.

[TABLE]

To calculate $\mathcal{L}_{1}$ , we select synthesized sequences from $\mathcal{G}(X_{z}|\theta_{\mathcal{G}})$ , that were originally missing, and compute the L1 norm of the difference between the synthesized versions of the sequence and the available ground truth from $X_{r}$ . Mathematically, given the set $K$ containing the indices of missing sequences (e.g. $K=\{0,2\}$ when $T_{1}$ and $T_{1c}$ are missing) in the current input, we calculate $\mathcal{L}_{1}$ only for the sequences that are missing ( $k=0,2$ ), and sum the values.

To calculate $\mathcal{L}_{2}$ , we compute the squared L2 norm of the difference between the discriminator’s predictions $\mathcal{D}(X_{i},X_{r}|\theta_{\mathcal{D}}))$ and a dummy ground truth tensor $L_{ar}$ of the same size as the output of $\mathcal{D}$ . In order to encourage the generator to synthesize sequences that confuse or “fool” the discriminator into predicting they are real, we set all entries of $L_{ar}$ to ones, masquerading all generated sequences as real. $X_{i}$ is introduced in the next section.

The choice of L1 as a reconstruction loss term for the generator is motivated by its ability to prevent too much blurring in the final synthesized sequences, as compared to using an L2 loss (similar to [31]).

IV-B2 Discriminator

We use the PatchGAN architecture [31] for the discriminator part of our MM-GAN. PatchGAN architecture learns to take into account the local characteristics of its input, by predicting a real/fake class for every $N\times N$ patch of its input, compared to classic GANs where the discriminator outputs a single real/fake prediction for the whole input image. This encourages the generator to synthesize images not just with proper global features (shape, size), but also with accurate local features (texture, distribution, high-frequency details). In our case we set $N=16$ .

The discriminator is built using four blocks followed by a zero padding layer and a final convolutional layer (Figure 1). The convolutional kernel sizes, stride and padding is identical to the values used in the generator (subsection IV-B2). Due to the possibility of having a varying number of sequences missing, instead of providing just the synthesized sequences and their real counterparts as input to the discriminator, we first create a modified version of $\mathcal{G}(X_{z}|\theta_{\mathcal{G}})$ by dropping the reconstructed sequences that were originally present, and replacing them with the original sequences from $X_{r}$ . The modified version of $\mathcal{G}(X_{z}|\theta_{\mathcal{G}})$ is represented by $X_{i}$ , short for “imputed”. The input to the discriminator is a concatenation of $X_{i}$ and $X_{r}$ . This is also illustrated in Figure 1.

The discriminator is trained to output a 2D patch of size $16\times 16$ pixels, with 4 channels corresponding to each sequence. In order to supervise the discriminator during training, we use a 4-channel 2D image based target, in which each channel corresponds to a sequence. More specifically, given missing sequences $K$ (e.g., $K=\{0,2\}$ , $T_{1}$ and $T_{1c}$ missing), the target (i.e. ground truth) variable for $\mathcal{D}$ is $L_{r}^{k}=\{0.0\;\text{(fake)}\;\text{if}\;k\in K,\;\text{else}\;1.0\;\text{(real)}\}$ . Note that $L_{r}^{k}$ is a 2D tensor of size $16\times 16$ (since each 256 $\times$ 256 image is divided into 16 $\times$ 16 patches) yet the assignment of 0.0 or 1.0 represents an assignment to the whole $16\times 16$ $L_{r}^{k}$ tensor (since the whole image is either real or fake and not patch-specific). This is also illustrated in Figure 1.

Between the output of discriminator $\mathcal{D}(X_{i},X_{r}|\theta_{\mathcal{D}})$ and $L_{r}$ , an L2 loss is computed. The final discriminator loss becomes:

[TABLE]

This is equivalent to a least-squares GAN since the loss function incorporates an L2 loss.

IV-B3 Implicit conditioning

Due to the inherent design of deep learning architectures, the input as well as output of a convolutional neural network model has to have a fixed channel dimension. In our use case however, both the input and output channel dimensions vary (since the number of available sequences can vary). In order to address this problem, we propose a combination of three design choices, which we collectively refer to as implicit conditioning (IC). In IC, the varying input channels problem is solved by imputing a fixed value (zeros) to the input channels where the sequences are missing. For the problem of generator output size being fixed in channel dimension, one possible approach can be to synthesize all four input sequences. The loss function can be calculated between four ground truth sequences, and the four synthesized sequences. However, this poses a challenge for the generator, as its burdened with generating all sequences, including the reconstruction of the ones that were provided as input. In order to address this, we proposed selective loss computation in $\mathcal{G}$ , where the loss is only calculated between the ground truth sequences that were missing, and the corresponding output channels of the generators. In conjunction, we also propose selective discrimination in $\mathcal{D}$ , which ensured stability during training by preventing the discriminator from overpowering the generator. We also show that IC-based training outperforms the baseline training methodology of generating and penalizing inaccurately synthesizing all sequences (Suppl. Mat. Section I-B). The design choices are individually summarized below.

Input imputation: The input $X_{z}$ of the generator always contains an imputed value ( $z=$ zeros) in place of the missing sequences which acts as a way to condition the generator and informs which sequence(s) to synthesize.

Selective loss computation in $\mathcal{G}$ : In conjunction, the $\mathcal{L}_{1}(\mathcal{G})$ loss that is computed only between the synthesized sequences for the generator, and then backpropagated, allows the generator to align itself towards only synthesizing the actual missing sequences properly while ignoring the performance in synthesizing the ones that were already present.

Selective discrimination in $\mathcal{D}$ : Imputing real sequences at the generator output (i.e. $X_{i}$ ) before providing it as discriminator input forces the discriminator to accurately learn to delineate only between the synthesized sequences and their real counterparts. Since the generator loss function also has a term that tries to fool the discriminator, this allows selective backpropagation into the generator where it is penalized only for incorrectly synthesizing the missing sequences, and not for incorrectly synthesizing the sequences that were already present. This relieves the generator of the difficult task of synthesizing all sequences in the presence of some sequences.

IV-B4 Curriculum learning

In order to train our proposed method we use a curriculum learning (CL) [45] based approach. In CL based training, the network is initially shown easier examples followed by increasingly difficult examples as the training progresses. We hypothesized that CL can benefit in training of MM-GAN due to an ordering in the level of difficulty across the various scenarios that the network has to handle. If some cases are “easier” than others, it might be useful if the easier cases are shown first to the network in order to allow the network to effectively learn when ample supervision is available. As the network trains, “harder” cases can be introduced so that the network adapts without diverging. In our work, scenarios with 1 sequence missing are considered “easy”, followed by a “moderate” set of scenarios with 2 missing sequences, and lastly, the scenarios with 3 missing sequences are considered “hard”. We adopted this ordering in our work, and showed the network easier examples first, followed by moderate and finally hard examples. After a threshold of 30 epochs, we show every scenario with uniform probability.

V Experimental Setup

In this section we describe different aspects of the experiments that are performed in this work.

V-A Datasets

In order to validate our method we use brain MRI datasets from two sources, namely the Ischemic Stroke Lesion Segmentation Challenge 2015 (ISLES2015) [46] and the Multimodal Brain Tumor Segmentation Challenge 2018 (BraTS2018) [47].

1) ISLES2015 dataset is a publicly available database with multi-spectral MR images [46]. We choose the sub-acute ischemic stroke lesion segmentation (SISS) cohort of patients, which contains 28 training and 36 testing cases. The patient scans are skull stripped using BET2 [48], and resampled to an isotropic resolution of $1$ mm3. Each scan consists of four sequences namely $T_{1}$ , $T_{2}$ , DWI, and $T_{2flair}$ , and are rigidly co-registered to the $T_{2flair}$ sequence using elastix tool-box [49]. More information about the preprocessing steps can be found in the original publication [46]. We use 22 patients from the SISS training set for experiments.

2) BraTS2018 consists of a total of 285 patient MR scans acquired from 19 different institutions, divided into two cohorts: glioblastoma/high grade glioma (GBM/HGG) and low grade glioma (LGG). The patient scans contains four pulse sequence $T_{1}$ , $T_{2}$ , $T_{1c}$ , and $T_{2flair}$ . All scans are resampled to $1$ mm3 isotropic resolution using a linear interpolator, skull stripped, and co-registered with a single anatomical template using rigid registration model with mutual information similarity metric. Detailed preprocessing information can be found in [47]. In order to demonstrate our method’s ability in synthesizing sequences with both high grade and low grade glioma tumors present, we use a total of 210 patients from HGG and 75 patients from LGG cohort. 195 patients are reserved for training for HGG cohort, while 65 are used for training in LGG experiments. For validation, we use 5 patients for both HGG and LGG cohorts. In order to test our trained models, we use 10 patients from HGG cohort (due to larger data available), while we report results using 5 patients for LGG cohort as testing.

V-B Preprocessing

Each patient scan is normalized by dividing each sequence by its mean intensity value. This ensures that distribution of intensity values is preserved [5]. Normalization by mean is less sensitive to high or low intensity outliers as compared to min-max normalization procedures, which can be greatly exacerbated by the presence of just a single high or low intensity voxel in a sequence. This is especially common in the presence of various pathologies, like tumors as in BraTS2018 datasets, which tend to have very high intensity values in some sequences ( $T_{2}$ , $T_{2flair}$ ) and recessed intensities in others ( $T_{1}$ , $T_{1c}$ ). In practice, we observed this for BraTS2018 HGG cohort, where some voxels had an unusually high intensity value due to a pathology. On performing min-max normalization to scale intensities between [0,1], we found that the presence of very high intensity voxel squashed the pixel range to always lie very close to zero. This artificially bumped the performance numbers for the generator since most voxels lied close to zero, and hence the generator could synthesize images with intensity values close to zero, and achieve a low L1 score easily. On the other hand, mean normalization was relatively unaffected due to a large number of voxels in a defined range ( 0-4000). The mean value was not strongly affected by the presence of one or more high/low intensity voxels. We also tested the method internally with zero mean and unit variance based standardization, and found the results to be at par with mean normalization. In order to crop out the brain region from each sequence, we calculate the largest bounding box that can accommodate each brain in the whole dataset, and then use the coordinates to crop each sequence in every patient scan. The final cropped size of a single patient scan with all sequences contains 148 axial slices of size $194\times 155$ . Each slice in every sequence is resized to a spatial resolution of $256\times 256$ , using bilinear interpolation, in order to maintain compatibility with UNet architecture of the generator. We note that avoiding resampling twice (once during registration performed by the original authors of the dataset, and once during resampling to $256\times 256$ in this work) may preserve some information in the scans that may otherwise be lost. However, it is a necessary preprocessing step in order to maintain compatibility with various network architectures that we utilize, which includes inherent assumptions that the input size would be a power of two to allow successive contracting and expanding steps used in many encoder-decoder style architectures. In order to fully avoid the second resampling step (to $256\times 256$ in this work), a different network architecture may be used without the encoder-decoder setup, though the performance of those networks may not be at par with the modern encoder-decoder style networks as established in synthesis field [28, 31, 50]

V-C Benchmark Methods

We compare our method with three competing methods, one unimodal and two multimodal. The unimodal (single-input, single-output, one-to-one) method we compare against is pGAN [28], while the multimodal (many-to-one) models being REPLICA [22] (in a multi-input setting), and that of Chartsias et al. [5], called MM-Synthesis hereafter. Both pGAN and MM-Synthesis were recently published (2019 and 2018), and they outperform all other methods before them (MM-Synthesis outperforms LSDN [27], Modality Propagation [14], and REPLICA [22], while pGAN outperforms both REPLICA and MM-Synthesis in one-to-one synthesis). To the best of our knowledge, we did not find any other methods that claimed to outperform either pGAN or MM-Synthesis, and so decided to evaluate our method against them.

For comparison with pGAN [28], we reimplement the method using the open source code provided with the publication, and train both pGAN and our method on a randomly chosen subset of data from BRATS2018 LGG cohort. We also compare with a standard baseline which is a vanilla Pix2Pix [31] model trained and tested in a one-to-one setting. For our multimodal (many-to-one) experiments, we report mean squared error (MSE) results for both REPLICA and MM-Synthesis directly from [5], as we recreate the exact same testbed for comparison with MM-GAN, as used in MM-Synthesis. We adopt the same testing strategy (5-fold cross validation), database (ISLES2015), and scenarios (7 scenarios where $T_{2flair}$ is always missing and is the only one that is synthesized). As highlighted in [5], the multi-input version of REPLICA required seven models each for each of the seven valid scenarios in many-to-one setting synthesizing $T_{2flair}$ sequence. MM-Synthesis and our proposed MM-GAN only required a single multimodal (many-to-one) network which generalized to all seven scenarios. For our final extended set of experiments, we demonstrate the effectiveness of our method in a multi-input multi-output (many-to-many) setting, where we perform testing on the HGG and LGG cohorts of BRaTS2018 dataset for which we report results of all 14 valid scenarios (=16 $-$ 2, as scenario when all sequences are missing/present are invalid for our experiments) instead of just 7. The results showcase our method’s generalizability on different use-cases with varying input and output subsets of sequences, and different difficulty levels. We use a fixed order of sequences ( $T_{1}$ , $T_{2}$ , $T_{1c}$ , $T_{2flair}$ ) throughout this paper, and represent each scenario as a 4-bit string, where a zero (0) represents the absence of a sequence at that location, while a one (1) represents its presence.

V-D Training and Implementation Details

In order to optimize our networks, we use Adam [51] optimizer with learning rate $\eta=0.0002$ , $\beta_{1}=0.5$ and $\beta_{2}=0.999$ . Both the generator and discriminator networks are initialized with weights sampled from a Gaussian distribution with $\mu=0,\sigma=0.02$ .

We perform four experiments, first for establishing that multi-input synthesis is better than single-input (many-to-one vs one-to-one respectively), second for $T_{2flair}$ synthesis using multiple inputs (many-to-one) (called MISO, short for multi-input single-output) using ISLES2015 dataset to compare with REPLICA and MM-Synthesis. The third set of experiments encompasses validation of multiple key components proposed throughout this paper, in terms of their contribution towards overall network performance. We test different imputation strategies ( $z=\{average,noise,zeros\}$ ), as well as the effect of curriculum learning (CL) and implicit conditioning (IC). These are included in the supplementary materials accompanying this manuscript, in Suppl. Mat. Section I. The final set of experiments pertain to multimodal synthesis (MIMO, short for multi-input multi-output), which sets a new benchmark for many-to-many synthesis models using BraTS2018 HGG and LGG cohorts. We refer to the second and fourth experiments as MISO and MIMO, respectively, hereafter. We use a batch size of 4 slices to train models, except for MISO, where we use batch size of 2. We train the models for 30 epochs in MISO and 60 epochs for MIMO sets of experiments, with no data augmentation. Both the generator and discriminator networks are initialized with weights sampled from Gaussian distribution with $\mu=0,\sigma=0.02$ .

We choose $\lambda=0.9$ for the generator loss given in equation 1, while we multiply the discriminator loss by 0.5 which essentially slows down the rate at which discriminator learns compared to generator. During each epoch, we alternate between a single gradient descent step on the generator, and one single step on the discriminator.

For our MIMO experiments, we use the original PatchGAN [31] discriminator. However for our MISO experiments, due to lack of training data, we used a smaller version of the PatchGAN discriminator with just two discriminator blocks, followed by a zero padding and final convolution layer. Also, random noise was added to both $X_{r}$ and $X_{i}$ inputs of the discriminator in MISO experiments. This was done to reduce the complexity of the discriminator to prevent it from overpowering the generator, which we observed when original PatchGAN with no noise imputation in its inputs was used for this set of experiments. The generator’s final activation was set to ReLU for MIMO and linear for MISO experiments due to the latter having negative intensity values for some patients.

For our implementation we used Python as our main programming language. We implemented Pix2Pix architecture in PyTorch. The computing hardware consisted of an i7 CPU with 64 GB RAM and GTX1080Ti 12 GB VRAM GPU. Throughout our experiments we use random seeding in order to ensure reproducibility of our experiments. For our MIMO experiments, we use curriculum learning by raising the difficulty of scenarios every 10 epochs (starting from one missing sequence) that are shown to the network until epoch 30 (shown examples with three missing sequences), after which the scenarios are shown randomly with uniform probability until epoch 60. For MISO experiments we train the model without curriculum learning, and show all scenarios with uniform probability to the network for 30 epochs. MM-GAN takes an average time of 0.1536 $\pm$ 0.0070 seconds per patient as it works in constant time at test-time w.r.t number of sequences missing.

V-E Evaluation Metrics

Evaluating the quality of synthesized images should ideally take into account both the quantitative aspect (per pixel synthesis error) as well as qualitative differences mimicking human perception. In order to cover this spectrum, we report results using three metrics, namely mean squared error (MSE), peak signal-to-noise ratio (PSNR) and structural similarity index metric (SSIM). The MSE is given as $\frac{1}{n}\sum_{i=1}^{N}(y_{i}-y_{i}^{{}^{\prime}})^{2}$ where $y_{i}$ is the original sequence and $y_{i}^{{}^{\prime}}$ is the synthesized version. MSE however depends heavily on the scale of intensities, hence for fair comparison, similar intensity normalization procedures were followed. In this work, we adopt the normalization procedure used in [5]. We report all results except in Section VI-B after normalizing both ground truth and synthesized image in range [0, 1]. We do this in order to maintain consistency across the study, and allow all future methods to easily compare with our reported values regardless of the standardization/normalization procedure used in network training. We note that the generator successfully learns to synthesize images that lie in the same normalized range as the ground truth and input training images, and hence there is no need for re-normalization after synthesis. Re-normalization in our case was only applied before evaluation to ensure fair comparison for current and future works. For Section VI-B, in order to directly compare with the results reported in [5], we report results without re-normalizing. In order to still provide a normalization agnostic metric, we report PSNR, which takes into account both the MSE and the largest possible intensity value of the image, given as: $10\log_{10}\left(I_{max}^{2}/\text{MSE}\right)$ , where $I_{max}$ is the maximum intensity value that the image supports, which depends on the datatype. We also report SSIM, which tries to capture the human perceived quality of images by comparing two images. SSIM is given as: $\frac{(2\mu_{x}\mu_{y}+c_{1})(2\sigma_{xy}+c_{2})}{(\mu_{x}^{2}+\mu_{y}^{2}+c_{1})(\sigma_{x}^{2}+\sigma_{y}^{2}+c_{2})}$ , where $x,y$ are two images to be compared, and $\mu=$ mean intensity, $\sigma^{2}=$ variance of image, and $\sigma_{xy}=$ covariance of $x,y$ .

VI Results and Discussion

In this section we present the results for our experiments validating our method and comparison with competing unimodal (one-to-one) (Section VI-A), and multi-input single-output (many-to-one) methods (Section VI-B) methods. Finally we present benchmark results in multi-input multi-output (MIMO) synthesis in Section VI-C.

VI-A Single-Input VS Multi-Input Synthesis

In order to understand the performance difference between using a single sequence versus using information from multiple sequences to synthesize a missing sequence, we set up an experiment evaluating the two approaches. Our hypothesis for this experiment is that multiple sequences provide complimentary information about the imaged tissue, and hence should be objectively better than just using one sequence for the task of synthesizing a missing one. We set up an experiment to compare multi-input single-output model with two single-input single-output models in two tasks, namely synthesizing missing $T_{1}$ and $T_{2}$ sequences respectively. For single-input single-output models, we set up a Pix2Pix [31] model as baseline, called P2P. We also compare with a state-of-art single-input single-output synthesis method called pGAN [28]. pGAN adopts it’s generator architecture from [50] and proposes a combination of L1 reconstruction loss and perceptual loss using VGG16 as the loss-network in a generative adversarial network framework. The discriminator for pGAN was adopted from [31]. We use the official pGAN implementation111https://github.com/icon-lab/pGAN-cGAN for training and testing on the pGAN (k=3) model. Finally, for our multi-input single-input model, we implement a multi-input single-output variant of our proposed MM-GAN called MI-GAN (multi-input GAN).

We call the baseline Pix2Pix models $\text{P2P}_{T_{1}}$ (synthesizing $T_{1}$ from $T_{2}$ ) and $\text{P2P}_{T_{2}}$ (synthesizing $T_{2}$ from $T_{1}$ ). Similarly, pGAN models are named as $\text{pGAN}_{T_{1}}$ and $\text{pGAN}_{T_{2}}$ . For our multi-input variants, the two variants of MI-GAN are named as $\text{MI-GAN}_{T_{1}}$ and $\text{MI-GAN}_{T_{2}}$ , which synthesize $T_{1}$ from ( $T_{2}$ , $T_{1c}$ , $T_{2flair}$ ) and $T_{2}$ from ( $T_{1}$ , $T_{1c}$ , $T_{2flair}$ ) respectively. For P2P and MI-GAN models training was performed for 60 epochs, using consistent set of network hyperparameters used throughout this paper. For pGAN, training was performed as outlined in the original paper [28] for 100 epochs, with k=3 slices as input to the network. All networks were trained on 70 patients from LGG cohort of BraTS2018 dataset, and tested on 5 patients. Although input normalization between the P2P/MI-GAN and pGAN differ, all metrics were calculated after normalizing both ground truth and synthesized image in range [0, 1]. We also perform Wilcoxon signed-rank tests across all test patients and report p-values wherever the performance difference is statistically significant ( $p<$ 0.05).

Table I presents MSE, PSNR and SSIM results for all three model architectures and their variants. We observe that both variants of MI-GAN outperform both P2P and pGAN models in all metrics.

Comparing $\text{MI-GAN}_{T_{1}}$ and the baseline $\text{P2P}_{T_{1}}$ , $\text{MI-GAN}_{T_{1}}$ outperformed by 61.48% in terms of MSE ( $p<$ 0.05), 20.29% in PSNR ( $p<$ 0.05) and 4.64% in SSIM ( $p<$ 0.05). $\text{MI-GAN}_{T_{1}}$ also outperformed the state-of-art single-input single-output model $\text{pGAN}_{T_{1}}$ , in all metrics, with improvements of 51.40% in MSE ( $p<$ 0.01), 11.48% in PSNR ( $p<$ 0.05) and 3.15% in SSIM ( $p<$ 0.05).

Similarly for $T_{2}$ synthesis, $\text{MI-GAN}_{T_{2}}$ outperforms both $\text{P2P}_{T_{2}}$ , $\text{pGAN}_{T_{2}}$ . With respect to $\text{P2P}_{T_{2}}$ , $\text{MI-GAN}_{T_{2}}$ performs better by 2% in MSE, 4.24% in PSNR and 1.64% in terms of SSIM. Compared to $\text{pGAN}_{T_{2}}$ , $\text{MI-GAN}_{T_{2}}$ shows improvement of 2% in MSE, 2.64% in PSNR and 0.77% in SSIM.

These improvements of MI-GAN over P2P and pGAN models can be attributed to the availability of multiple sequences as input, which the network utilizes to synthesize missing sequences. The qualitative results showing axial slices from a test patient are provided in Suppl. Mat. Figure S1 in which red arrow points to the successful synthesis of tumor regions in the case of MI-GAN, which was possible due to tumor specific information present in the available three sequences about the various tumor sub-regions (edema, enhancing and necrotic core) in the input sequences, which is not available in its entirety to the single-input single-output methods. We also notice that MI-GAN performs consistently for a single patient, without showing significant deviation from the ground truth intensity distributions in both $T_{1}$ and $T_{2}$ .

Superior quantitative and qualitative results showing MI-GAN outperforming P2P and pGAN reinforce the hypothesis that using multiple input sequences for the synthesis of a missing sequence is objectively better than using just one input sequence. Moreover, using multi-input methods reduces the number of required models by an order of magnitude, where for a multi-input single-output (many-to-one) only 4 models would be required, compared to 12 for single-input single-output (one-to-one) model ( $C(C-1)$ when $C=$ number of sequences $=4$ ). A multi-input multi-output (many-to-many) model which we explore in this work, improves this further by just requiring a single model to perform all possible synthesis tasks for a given $C$ , leading to enormous computational savings during training time.

VI-B $T_{2flair}$ * Synthesis (MISO)*

In this second set of experiments we train our MM-GAN model to synthesize $T_{2flair}$ sequence in the presence of a varied number of input sequences (one, two or three). Contrasting from the MI-GAN models, this model is trained to generalize on number of different scenarios depending on the available input sequences. In this case, the number of valid scenarios are 7. We perform validation on the ISLES2015 dataset in order to directly compare with REPLICA [22] and MM-Synthesis [5]. The quantitative results are given in Table II. We note that the proposed MM-GAN (0.226 $\pm$ 0.046) clearly outperforms REPLICA’s unimodal synthesis models (0.271 $\pm$ 0.10) in all scenarios, as well as MM-Synthesis (0.236 $\pm$ 0.08) in majority (4/7) scenarios. Our method also demonstrates an overall lower MSE standard deviation throughout testing (ranging between [0.03, 0.07], compared to REPLICA [0.08, 0.16] and MM-Synthesis [0.02, 0.13]) in all scenarios but one ( $T_{2}$ missing). The qualitative results for ISLES2015 are shown in Figure 2. Compared to MM-Synthesis (from qualitative results shown in their original paper [5]), our results are objectively sharper, with lower blurring artifacts. MM-GAN also preserves high frequency details of the synthesized sequence, while MM-Synthesis and REPLICA seem to miss most of these details. We request the readers to refer to the original MM-Synthesis [28] manuscript’s Figures 5 and 6 for comparison with our proposed MM-GAN’s qualitative results given in Figure 3 of the current manuscript. Qualitatively from Figure 2, MM-GAN follows the intensity distribution of the real $T_{2flair}$ sequence in its synthesized version of $T_{2flair}$ .

We found that using CL based learning did not help in MISO experiments, as the presence of more sequences does not necessarily increase the amount of information available. For example, the presence of both $T_{1}$ and $T_{2}$ does not result in better $T_{2flair}$ synthesis (MSE 0.2541) compared to the presence of DW alone (MSE 0.2109). This is because, for every missing sequence, there tends to be some “highly informative” sequences that, if absent, reduces the synthesis performance by a larger margin. On the other hand, the presence of these highly informative sequences can dramatically boost performance, even in cases where no other sequence is present. Due to this, the assumption that leveraging a higher number of present sequences implies an easier case (i.e. more accurate synthesis) does not hold, and thus it becomes problematic to rank scenarios based on how easy they are, which renders CL useless in this case. Globally (for all valid scenarios, presented in next subsection), however, this assumption tends to hold due to the complex nature of interactions between sequences. CL helps tremendously in achieving a stable training of the network in the subsequent experiments (MIMO). For MISO, every scenario was shown to the network with uniform probability, throughout training.

VI-C Multimodal Synthesis (MIMO)

We present results for our experiments on BRaTS2018’s HGG and LGG cohorts in Table III and IV. We set $z=$ zeros for imputation, and train the networks with implicit conditioning (IC) and curriculum learning (CL). In this experiment we train our proposed MM-GAN model on all 14 valid scenarios, in order to synthesize any missing sequence from any number or combination of available sequences. We observe that the proposed MM-GAN model performs consistently well when synthesizing just one sequence, with high overall SSIM ( $>$ 0.90 in most cases except one in LGG), PSNR ( $>$ 22.0) values, and low MSE ( $<$ 0.015). As more sequences start missing, the task of synthesizing missing sequences gets harder. During the initial epochs of training, MM-GAN tends to learn the general global structure of the brain, without considering the local level details. This seems to be enough for the generator to fool the discriminator initially. However, as the training progresses and the discriminator becomes stronger, the generator is forced to learn the local features of the slice, which includes small details, especially the boundaries between the grey and white matter visible in the sequence. The qualitative results shown in Figure 3 show how MM-GAN effectively synthesizes the missing sequence in various scenarios, while preserving high frequency details that delineate between grey and white matter of the brain, as well as recreating the tumor region in the frontal lobe by combining information from available sequences. The synthesis of the tumor in the final images depend heavily on the available sequences. For example, the contrast sequence $T_{1c}$ provides clear delineation of enhancing ring-like region around the necrotic mass, which is an important indicator of the size of the tumor. Presence of $T_{1}$ and/or $T_{2flair}$ sequence leads to improved synthesis of edema features. The contrast sequence $T_{1c}$ provides unique information about the enhancing region around the tumor, which is usually not visible in any other sequence. Qualitatively, the $T_{2}$ sequence does not seem to directly aid in synthesizing a particular region of tumor well, but coupled with other available sequences, it helps in better synthesis of tumor mass in the final synthesized slice (Figure 3).

As shown in Figure 4, we also observe that the method fills up lost details as can be seen in $T_{2flair}$ sequence. The original ground truth sequence has the frontal lobe part cut off, probably due to patient movement or miss-registration. However MM-GAN recreates that part by using information from the available sequences. Another interesting side-effect of our approach is visible in $T_{2flair}$ synthesis, where the synthesized versions of $T_{2flair}$ exhibit higher quality details (Figure 4) than the original sequence, which was acquired in a very low resolution. This effect is the consequence of the method using high-resolution input sequences (all sequences except $T_{2flair}$ are acquired at higher resolution) to synthesize the missing $T_{2flair}$ sequence. This also suggests that our method may be used for improving or upscaling resolution of available sequences. However we do not investigate this further here, and leave it as future work. We found that normalizing sequences with mean value is easier to train with, and naturally supports final layer activations like ReLU.

We observe that the generators perform really well when they are constrained using a non-linear activation function at the final layer. However in the case of MISO, the limitation of the normalization type (dividing by mean value of sequence) used in MM-Synthesis prevents us from using a non-linear activation at the end of generator. This is due to the fact that some patients’ data in SISS cohort from ISLES2015 contain negative intensity values, which after normalization stay negative. It can be seen that the MSE values reported in Table II tend to be higher than the ones reported in Tables III and IV, due to the latter set of experiments using ReLU activation at the end of generator.

Although MM-GAN observes different scenarios and hence different fake sequences in each iteration, which may affect stability during training, we did not observe any unstable behaviour during the training process. The use of implicit conditioning (IC) assisted in ensuring stable training of networks by making the task challenging for the discriminator, preventing it from overpowering the generator, which in turn lead to the generator converging to a good local minima.

We also observe that the proposed method shows graceful degradation as the number of sequences missing start increasing, which is apparent both qualitative and quantitatively in Figure 3 and Table V. For instance, in HGG experiments, compared to having one sequence missing, the performance of MM-GAN drops on average by 27.1%, 2.7% and 0.7% in MSE, PSNR and SSIM respectively for scenarios where two sequences are missing. For scenarios where three sequences are missing, the performance drops on average by 39.1%, 7.2% and 2.2% in terms of MSE, PSNR and SSIM respectively compared to one sequence missing, and 29.3%, 4.6% and 1.5% when compared to scenarios where two sequences are missing. We observe that the method holds up well in generating sequences with high fidelity in terms of PSNR and SSIM even in harder scenarios where multiple sequences may be missing.

Qualitatively, we also investigated the question as to which sequences are the most valuable for the synthesis for each of the four sequences in BraTS2018 HGG cohort. For every sequence that is synthesized, we list a ranking based upon our investigation of the results for each of the remaining sequences. For synthesizing $T_{2flair}$ , we found that $T_{1c}$ sequence, followed by $T_{2}$ and $T_{1}$ sequences were important. This is also apparent in Figure 3, where the removal of $T_{1c}$ in column (c) lead to the synthesized sequence missing necrotic part of tumor completely, while the removal of $T_{1}$ (columns (b) and (f)) and $T_{2}$ (columns (b) and (d)) did not affect the performance dramatically. For the synthesis of $T_{1c}$ , we found that $T_{2}$ sequence held the highest significance, followed by $T_{1}$ and $T_{2flair}$ (comparing columns (b) with (d), (f), (c)). This is also evident from row 3 column (d) in Figure 3, when removal of $T_{2}$ sequence lead to increased blurring artifacts in the synthesized version of $T_{1c}$ , which were not as pronounced when $T_{1}$ or $T_{2flair}$ were removed. For $T_{2}$ synthesis, we found that $T_{1c}$ sequence contributed the most towards accurate synthesis (comparing columns (b) and (d)), with $T_{1}$ sequence also playing an important role (columns (b) and (f)), lastly followed by $T_{2flair}$ . Finally for $T_{1}$ synthesis, we found that $T_{1c}$ was the most important sequence (columns (b) and (d)) enabling accurate synthesis, followed closely by $T_{2}$ (columns (b) and (f)) and $T_{2flair}$ (columns (b) and (c)).

Due to MM-GAN being a single unified model (MM-GAN), it relieves the end-user from the difficult task of choosing the right model for synthesis during inference time. For instance, in the case where sequences ( $T_{1}$ , $T_{2}$ , $T_{1c}$ ) are present, and $T_{2flair}$ to be synthesized, a single-input single-output method would have three networks capable of synthesizing $T_{2flair}$ from $T_{1}$ , $T_{2}$ and $T_{1c}$ respectively. The decision as to which network should be chosen for this problem is hard, since each unimodal network would provide trade-offs in terms synthesis quality, especially in tumorous areas where individual sequences do not provide full information. This decision problem is mitigated in multi-input models (MM-Synthesis and MI-GAN), but there still exists the computational overhead during training time in order to train multiple models for each output sequence (total 4 for both MM-Synthesis and MI-GAN). MM-GAN on the other hand, is completely multimodal and only requires training for just one model, which provide computational savings during training time by eliminating the need for training multiple models (if number of sequences $C$ =4, then 12 models in case of unimodal, 4 models in case of multi-input multi-output architectures).

VII Conclusion

We propose a multi-modal generative adversarial network (MM-GAN) capable of synthesizing missing MR pulse sequences using combined information from the available sequences. Most approaches so far in this domain had been either unimodal, or partially multi-modal (multi-input, single-output). We present a truly multi-modal method that is multi-input and multi-output, generalizing to any combination of available and missing sequences. The synthesis process runs in a single forward pass of the network regardless of the number of sequences missing, and the run time is constant w.r.t number of missing sequences.

The first variant of our proposed MM-GAN, called MI-GAN outperformed the unimodal version pGAN in all three metrics (Table I). We also show that MM-GAN outperforms the best multimodal synthesis method REPLICA [22], as well as MM-Synthesis [5] in multi-input single-input synthesis of $T_{2flair}$ sequence (Table II), and produces objectively sharper and more accurate results. In another set of experiments, we train our method on BraTS2018 dataset to set up a new benchmark in terms of MSE, PSNR and SSIM (Tables III and IV), and show qualitative results for the same (Figure 3). MM-GAN performance degrades as a function of number of missing sequences in Table V but exhibits robustness in maintaining high PSNR and SSIM values even in harder scenarios. Finally, we show that our method is capable of filling in details missing from the original ground truth sequences, and also capable of improving quality of the synthesized sequences (Figure 4).

Although our approach qualitatively and quantitatively performs better than all other competing methods, we note that it has problems in synthesizing the enhancing subregion in $T_{1c}$ sequence properly. This, however, is expected since $T_{1c}$ sequence contains highly specific information about the enhancing region of the tumor that is not present in any other sequences. An inherent limitation of all synthesis methods stems from the fact that MR sequences provide both redundant and unique information. This creates challenges for all synthesis methods, unimodal and multimodal alike. Unimodal methods provide a one-to-one mapping between sequences, but each such model (12 total for 4 sequences) would raise tradeoffs between the synthesis accuracy. For instance, in the experiments given in Section VI.B, we found that, in terms of overall sequence synthesis performance, synthesizing $T_{2flair}$ from DWI tends to be more accurate (MSE 0.2109) than synthesizing from $T_{1}$ (MSE 0.2813) or $T_{2}$ (MSE 0.2799). This reinforces the fact that there are some inherent characteristics to each sequence, which can only be faithfully synthesized if another sequence that more or less captures similar characteristics is present. The sequences provide complementary visual information for a human reader, though there are underlying correlations imperceptible to the naked eye, since they all originate due to common underlying physics and from the same subject. Multi-input methods like ours can exploit the correlations between available sequences, and synthesize the missing sequence by leveraging information from all input sequences. This is evident from the quantitative results in Tables II, III, IV, V, and summarized in Table V where more available sequences allow better synthesis of missing ones. For future work, we note that the inherent design of our method is 2D, and an extension of the work which can take either 2.5D or 3D images into account may perform better both quantitatively and qualitatively. Another area of investigation would be to explore the up-scaling capabilities of the MM-GAN, where given a low-quality ground truth scan with missing scan areas, the method can generate a higher quality version with filled in missing details. It would also be interesting to test MM-GAN by deploying it as part of the pipeline for downstream analysis, for example segmentation. This natural placement in the pipeline would allow the downstream methods to become robust to missing pulse sequences. Compared to HeMIS [3] and PIMMS [4], this can be another approach to make segmentation algorithms robust to missing pulse sequences.

VIII Acknowledgements

The authors would like to thank NVIDIA Corporation for donating a Titan X GPU. This research was enabled in part by support provided by WestGrid (www.westgrid.ca) and Compute Canada (www.computecanada.ca). We thank the anonymous reviewers for their insightful feedback that resulted in a much improved paper.

Bibliography51

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] E. F. Jackson, L. E. Ginsberg, D. F. Schomer, and N. E. Leeds, “A review of MRI pulse sequences and techniques in neuroimaging,” Surgical Neurology , vol. 47, no. 2, pp. 185–199, Feb 1997.
2[2] C. Bowles et al. , “Pseudo-healthy image synthesis for white matter lesion segmentation,” in Simulation and Synthesis in Medical Imaging . Springer International Publishing, Oct 2016, pp. 87–96.
3[3] M. Havaei, N. Guizard, N. Chapados, and Y. Bengio, “He MIS: Hetero-Modal Image Segmentation,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016 . Springer International Publishing, 2016, pp. 469–477.
4[4] T. Varsavsky et al. , “PIMMS: Permutation Invariant Multi-modal Segmentation,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support . Springer International Publishing, 2018, pp. 201–209.
5[5] A. Chartsias, T. Joyce, M. V. Giuffrida, and S. A. Tsaftaris, “Multimodal MR Synthesis via Modality-Invariant Latent Representation,” IEEE Transactions on Medical Imaging , vol. 37, no. 3, pp. 803–814, Mar 2018.
6[6] H. Chen et al. , “Vox Res Net: Deep voxelwise residual networks for brain segmentation from 3D MR images,” Neuro Image , vol. 170, pp. 446–455, 2018.
7[7] S. Bakas et al. , “2017 International MICCAI Bra TS Challenge,” 2017.
8[8] F. Isensee et al. , “Brain Tumor Segmentation and Radiomics Survival Prediction : Contribution to the BRATS 2017 Challenge,” in Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries . Springer International Publishing, 2018, pp. 287–297.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Missing MRI Pulse Sequence Synthesis using Multi-Modal Generative Adversarial Network

Abstract

Index Terms:

I Introduction

II Related Work

II-A Unimodal Synthesis

II-B Multimodal Synthesis

III Contributions

IV Methodology

IV-A Background

IV-B Proposed Method

IV-B1 Generator

IV-B2 Discriminator

IV-B3 Implicit conditioning

IV-B4 Curriculum learning

V Experimental Setup

V-A Datasets

V-B Preprocessing

V-C Benchmark Methods

V-D Training and Implementation Details

V-E Evaluation Metrics

VI Results and Discussion

VI-A Single-Input VS Multi-Input Synthesis

VI-B T2flairT_{2flair}T2flair​* Synthesis (MISO)*

VI-C Multimodal Synthesis (MIMO)

VII Conclusion

VIII Acknowledgements

VI-B $T_{2flair}$ * Synthesis (MISO)*