A Distance Map Regularized CNN for Cardiac Cine MR Image Segmentation
Shusil Dangi, Cristian Linte, and Ziv Yaniv

TL;DR
This paper introduces a multi-task learning regularization framework for cardiac MRI segmentation that improves accuracy and generalization by incorporating pixel-wise distance map regression into CNNs.
Contribution
The authors propose a novel distance map regularizer added to CNNs for cardiac MRI segmentation, enhancing performance without increasing model complexity.
Findings
Improved segmentation accuracy with average dice coefficients of 0.84 and 0.91.
Enhanced cross-dataset generalization with up to 42% improvement in myocardium Dice coefficient.
Regularizer improves robustness of CNNs for cardiac MRI segmentation.
Abstract
Cardiac image segmentation is a critical process for generating personalized models of the heart and for quantifying cardiac performance parameters. Several convolutional neural network (CNN) architectures have been proposed to segment the heart chambers from cardiac cine MR images. Here we propose a multi-task learning (MTL)-based regularization framework for cardiac MR image segmentation. The network is trained to perform the main task of semantic segmentation, along with a simultaneous, auxiliary task of pixel-wise distance map regression. The proposed distance map regularizer is a decoder network added to the bottleneck layer of an existing CNN architecture, facilitating the network to learn robust global features. The regularizer block is removed after training, so that the original number of network parameters does not change. We show that the proposed regularization method…
| Time | #Parameters () | |||||
|---|---|---|---|---|---|---|
| Train (min/epoch) | Test (ms/volume) | |||||
| ACDC | LVSC | ACDC | LVSC | Train | Test | |
| SegNet | 2.49 | 14.91 | 70 | 67 | 2.96 | 2.96 |
| USegNet | 2.41 | 14.49 | 70 | 67 | 3.75 | 3.75 |
| UNet | 2.65 | 15.50 | 72 | 68 | 4.10 | 4.10 |
| DMR-SegNet | 4.44 | 20.57 | 70(157) | 63(94) | 3.56 | 2.96 |
| DMR-USegNet | 4.84 | 19.03 | 73(158) | 65(96) | 4.35 | 3.75 |
| DMR-UNet | 4.85 | 21.16 | 75(160) | 67(97) | 4.70 | 4.10 |
| End Diastole (ED) | End Systole (ES) | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SN | DMR SN | USN | DMR USN | UNet | DMR UNet | SN | DMR SN | USN | DMR USN | UNet | DMR UNet | |
| Dice (%) | 91.1 | 91.7∗∗ | 91.5 | 92.0∗∗ | 91.6 | 92.2∗∗ | 87.3 | 88.0∗ | 87.7 | 88.7∗∗ | 87.2 | 88.8∗ |
| Jaccard (%) | 84.0 | 85.1∗∗ | 84.7 | 85.5∗∗ | 85.0 | 85.9∗∗ | 78.1 | 79.3∗ | 78.7 | 80.3∗∗ | 78.3 | 80.4∗ |
| MSD (mm) | 0.55 | 0.53∗ | 0.58 | 0.52∗ | 0.54 | 0.53∗ | 0.92 | 0.85 | 0.92 | 0.84 | 1.08 | 0.83 |
| HD (mm) | 10.26 | 9.87 | 10.26 | 9.67 | 10.03 | 9.52 | 11.33 | 10.31∗ | 11.66 | 10.91 | 12.61 | 10.96∗ |
| End Diastole (ED) | End Systole (ES) | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SN | DMR SN | USN | DMR USN | UNet | DMR UNet | SN | DMR SN | USN | DMR USN | UNet | DMR UNet | |
| Dice (%) | 91.1 | 91.7∗∗ | 91.5 | 92.0∗∗ | 91.6 | 92.2∗∗ | 87.3 | 88.0∗ | 87.7 | 88.7∗∗ | 87.2 | 88.8∗ |
| Jaccard (%) | 84.0 | 85.1∗∗ | 84.7 | 85.5∗∗ | 85.0 | 85.9∗∗ | 78.1 | 79.3∗ | 78.7 | 80.3∗∗ | 78.3 | 80.4∗ |
| MSD (mm) | 0.55 | 0.53∗ | 0.58 | 0.52∗ | 0.54 | 0.53∗ | 0.92 | 0.85 | 0.92 | 0.84 | 1.08 | 0.83 |
| HD (mm) | 10.26 | 9.87 | 10.26 | 9.67 | 10.03 | 9.52 | 11.33 | 10.31∗ | 11.66 | 10.91 | 12.61 | 10.96∗ |
| Correlation Coefficient | Bias+LOA | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SN | DMR SN | USN | DMR USN | UNet | DMR UNet | SN | DMR SN | USN | DMR USN | UNet | DMR UNet | |
| LV EF | 0.939 | 0.947 | 0.944 | 0.970 | 0.962 | 0.963 | 1.00 (13.15) | 0.31 (12.44) | 0.58 (12.57) | -0.42 (9.24) | 0.31 (10.41) | 0.40 (10.40) |
| RV EF | 0.874 | 0.871 | 0.866 | 0.895 | 0.856 | 0.870 | 1.04 (17.40) | 1.77 (17.34) | 0.85 (17.40) | 0.38 (15.42) | 0.09 (18.94) | 0.29 (18.30) |
| Myo Mass | 0.948 | 0.970 | 0.958 | 0.973 | 0.933 | 0.978 | 3.10 (32.94) | -0.43 (25.17) | 0.35 (29.65) | 0.21 (23.89) | 2.85 (37.39) | 0.80 (21.75) |
| End Diastole (ED) | End Systole (ES) | EF | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LV | RV | Myo | LV | RV | Myo | LV | RV | |||||||||
| Dice | HD | Dice | HD | Dice | HD | Corr | Dice | HD | Dice | HD | Dice | HD | Corr | Corr | Corr | |
| Baumgartner[43] | 0.96 | 6.53 | 0.93 | 12.67 | 0.89 | 8.70 | 0.982 | 0.91 | 9.17 | 0.88 | 14.69 | 0.90 | 10.64 | 0.983 | 0.988 | 0.851 |
| Khened[44] | 0.96 | 8.13 | 0.94 | 13.99 | 0.89 | 9.84 | 0.990 | 0.92 | 8.97 | 0.88 | 13.93 | 0.90 | 12.58 | 0.979 | 0.989 | 0.858 |
| Isensee[45] | 0.97 | 7.38 | 0.95 | 10.12 | 0.90 | 8.72 | 0.989 | 0.93 | 6.91 | 0.90 | 12.14 | 0.92 | 8.67 | 0.985 | 0.991 | 0.901 |
| DMR-UNet | 0.96 | 6.05 | 0.94 | 9.52 | 0.89 | 7.92 | 0.989 | 0.92 | 8.16 | 0.88 | 13.05 | 0.91 | 8.39 | 0.987 | 0.989 | 0.851 |
| End Diastole (ED) | End Systole (ES) | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SN | DMR SN | USN | DMR USN | UNet | DMR UNet | SN | DMR SN | USN | DMR USN | UNet | DMR UNet | |
| Dice (%) | 82.2 | 83.0∗ | 82.5 | 83.2∗∗ | 83.1 | 83.6 | 83.5 | 84.2 | 83.8 | 84.3∗ | 84.3 | 84.6 |
| Jaccard (%) | 70.0 | 71.1∗ | 70.4 | 71.5∗∗ | 71.3 | 72.0 | 71.9 | 72.9 | 72.4 | 73.0∗ | 73.0 | 73.5 |
| MSD (mm) | 0.78 | 0.74 | 0.79 | 0.72∗ | 0.74 | 0.70 | 0.81 | 0.77 | 0.77 | 0.78 | 0.74 | 0.75 |
| HD (mm) | 13.20 | 13.14 | 13.67 | 13.12 | 12.98 | 12.80 | 12.96 | 12.96 | 12.71∗ | 13.71 | 13.08 | 12.51 |
| Mass (Corr) | 0.908 | 0.937 | 0.923 | 0.938 | 0.917 | 0.936 | 0.921 | 0.935 | 0.929 | 0.926 | 0.939 | 0.922 |
| Mass(gram) (Bias+LOA) | 2.56 (35.25) | -0.92 (29.52) | 3.91 (32.48) | 3.34 (29.08) | 2.88 (33.75) | 0.06 (29.92) | 5.48 (32.49) | 1.96 (29.58) | 5.04 (30.92) | 5.49 (31.49) | 5.18 (28.79) | 2.56 (32.18) |
| Method | SA/FA | Jaccard | Sensitivity | Specificity | PPV | NPV |
|---|---|---|---|---|---|---|
| AU [47] | SA | 0.84 (0.17) | 0.89 (0.13) | 0.96 (0.06) | 0.91 (0.13) | 0.95 (0.06) |
| CNR [46] | SA | 0.77 (0.11) | 0.88 (0.09) | 0.95 (0.04) | 0.86 (0.11) | 0.96 (0.02) |
| FCN [9] | FA | 0.74 (0.13) | 0.83 (0.12) | 0.96 (0.03) | 0.86 (0.10) | 0.95 (0.03) |
| DFCN [44] | FA | 0.74 (0.15) | 0.84 (0.16) | 0.96 (0.03) | 0.87 (0.10) | 0.95 (0.03) |
| DMR-UNet | FA | 0.74 (0.16) | 0.85 (0.16) | 0.95 (0.03) | 0.86 (0.10) | 0.95 (0.03) |
| AO [48] | SA | 0.74 (0.16) | 0.88 (0.15) | 0.91 (0.06) | 0.82 (0.12) | 0.94 (0.06) |
| SCR [49] | FA | 0.69 (0.23) | 0.74 (0.23) | 0.96 (0.05) | 0.87 (0.16) | 0.89 (0.09) |
| INR [50] | FA | 0.43 (0.10) | 0.89 (0.17) | 0.56 (0.15) | 0.50 (0.10) | 0.93 (0.09) |
| End Diastole (ED) | End Systole (ES) | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SN | DMR SN | USN | DMR USN | UNet | DMR UNet | SN | DMR SN | USN | DMR USN | UNet | DMR UNet | |
| Dice (%) | 70.4 | 73.3∗∗ | 68.3 | 76.6∗∗ | 72.3 | 76.7∗∗ | 68.0 | 71.9∗∗ | 65.5 | 74.9∗∗ | 69.7 | 76.4∗∗ |
| Jaccard (%) | 55.6 | 58.9∗∗ | 53.6 | 62.9∗∗ | 58.0 | 63.1∗∗ | 53.3 | 58.1∗∗ | 50.8 | 61.5∗∗ | 55.5 | 63.1∗∗ |
| MSD (mm) | 2.68 | 2.07∗∗ | 3.33 | 1.80∗∗ | 2.46 | 1.80∗∗ | 3.56 | 2.93∗∗ | 4.19 | 2.58∗∗ | 3.49 | 2.35∗∗ |
| HD (mm) | 25.01 | 22.44∗∗ | 26.93 | 20.33∗∗ | 24.61 | 20.16∗∗ | 25.96 | 22.62∗∗ | 27.37 | 21.67∗∗ | 25.68 | 20.98∗∗ |
| End Diastole (ED) | End Systole (ES) | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SN | DMR SN | USN | DMR USN | UNet | DMR UNet | SN | DMR SN | USN | DMR USN | UNet | DMR UNet | |
| Dice (%) | 70.4 | 73.3∗∗ | 68.3 | 76.6∗∗ | 72.3 | 76.7∗∗ | 68.0 | 71.9∗∗ | 65.5 | 74.9∗∗ | 69.7 | 76.4∗∗ |
| Jaccard (%) | 55.6 | 58.9∗∗ | 53.6 | 62.9∗∗ | 58.0 | 63.1∗∗ | 53.3 | 58.1∗∗ | 50.8 | 61.5∗∗ | 55.5 | 63.1∗∗ |
| MSD (mm) | 2.68 | 2.07∗∗ | 3.33 | 1.80∗∗ | 2.46 | 1.80∗∗ | 3.56 | 2.93∗∗ | 4.19 | 2.58∗∗ | 3.49 | 2.35∗∗ |
| HD (mm) | 25.01 | 22.44∗∗ | 26.93 | 20.33∗∗ | 24.61 | 20.16∗∗ | 25.96 | 22.62∗∗ | 27.37 | 21.67∗∗ | 25.68 | 20.98∗∗ |
| End Diastole (ED) | End Systole (ES) | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SN | DMR SN | USN | DMR USN | UNet | DMR UNet | SN | DMR SN | USN | DMR USN | UNet | DMR UNet | |
| Dice (%) | 69.5 | 78.4∗∗ | 62.5 | 80.1∗∗ | 62.1 | 80.2∗∗ | 57.7 | 77.6∗∗ | 51.9 | 79.3∗∗ | 50.3 | 79.1∗∗ |
| Jaccard | 56.5 | 66.3∗∗ | 49.3 | 68.2∗∗ | 49.3 | 68.5∗∗ | 45.4 | 65.3∗∗ | 40.1 | 67.3∗∗ | 38.8 | 67.1∗∗ |
| MSD (mm) | 4.92 | 1.77∗∗ | 6.75 | 1.30∗∗ | 6.29 | 1.59∗∗ | 9.59 | 2.53∗∗ | 13.27 | 2.35∗∗ | 10.97 | 2.52∗∗ |
| HD (mm) | 26.04 | 17.06∗∗ | 29.08 | 13.93∗∗ | 29.50 | 14.16∗∗ | 35.13 | 19.25∗∗ | 39.60 | 18.77∗∗ | 37.44 | 19.58∗∗ |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A Distance Map Regularized CNN for Cardiac Cine MR Image Segmentation
Shusil Dangi, Cristian A. Linte, and Ziv Yaniv This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.Shusil Dangi is with the Center for Imaging Science, Rochester Institute of Technology, Rochester, NY USA. E-mail: [email protected] A. Linte is with the Biomedical Engineering and Center for Imaging Science, Rochester Institute of Technology, Rochester NY USA. Email: [email protected] Yaniv is with the National Institute of Allergy and Infectious Diseases, Bethesda MD USA and MSC LLC., Rockville MD USA. E-mail: [email protected]
Abstract
Cardiac image segmentation is a critical process for generating personalized models of the heart and for quantifying cardiac performance parameters. Several convolutional neural network (CNN) architectures have been proposed to segment the heart chambers from cardiac cine MR images. Here we propose a multi-task learning (MTL)-based regularization framework for cardiac MR image segmentation. The network is trained to perform the main task of semantic segmentation, along with a simultaneous, auxiliary task of pixel-wise distance map regression. The proposed distance map regularizer is a decoder network added to the bottleneck layer of an existing CNN architecture, facilitating the network to learn robust global features. The regularizer block is removed after training, so that the original number of network parameters does not change. We show that the proposed regularization method improves both binary and multi-class segmentation performance over the corresponding state-of-the-art CNN architectures on two publicly available cardiac cine MRI datasets, obtaining average dice coefficient of 0.840.03 and 0.910.04, respectively. Furthermore, we also demonstrate improved generalization performance of the distance map regularized network on cross-dataset segmentation, showing as much as 42% improvement in myocardium Dice coefficient from 0.560.28 to 0.800.14.
Index Terms:
Magnetic resonance imaging (MRI), Heart Segmentation, Convolutional Neural network, Regularization
I Introduction
Magnetic Resonance Imaging (MRI) is the standard-of-care imaging modality for non-invasive cardiac diagnosis, due to its high contrast sensitivity to soft tissue, good image quality, and lack of exposure to ionizing radiation. Cine cardiac MRI enables the acquisition of high resolution two-dimensional (2D) anatomical images of the heart throughout the cardiac cycle, capturing the full cardiac dynamics via multiple 2D + time short axis acquisitions spanning the whole heart.
Segmentation of the heart structures from these images enables measurement of important cardiac diagnostic indices such as myocardial mass and thickness, left/right ventricle (LV/RV) volumes and ejection fraction. Furthermore, high-quality personalized heart models can be generated for cardiac morphology assessment, treatment planning, as well as, precise localization of pathologies during an image-guided intervention. Manual delineation is the standard cardiac image segmentation approach, which is not only time consuming, but also susceptible to high inter- and intra-observer variability. Hence, there is a critical need for semi-/fully-automatic methods for cardiac cine MRI segmentation. However, the MR imaging artifacts such as bias fields, respiratory motion, and intensity inhomogeneity and fuzziness, render the segmentation of heart structures challenging. Fig. 1 shows a reference segmentation and the results of our automatic segmentation method.
A comprehensive review of cardiac MR segmentation techniques can be found in [1, 2]. These techniques can be classified based on the amount of prior knowledge used during segmentation. First, the no-prior based methods rely solely on the image content to segment the heart structures based on intensity thresholds, and edge- and/or region-information. Hence these methods are often ineffective for the segmentation of ill-defined boundary regions. Second, the deformable models such as active contours and level-set methods incorporate weak-prior information regarding the smoothness of the segmented boundaries; similarly, graph theoretical models assume connectivity between the neighboring pixels providing piece-wise smooth segmentation results. Third, the Active shape and appearance models and Atlas-based methods impose very strong-prior information regarding the geometry of the heart structures and sometimes are too restricted by the training set. These weak-/strong-prior based methods may overcome segmentation challenges in ill-defined boundary regions but, nevertheless, at a high computational cost. Lastly, Machine Learning based methods aim to predict the probability of each pixel in the image belonging to the foreground/background class based on either patch-wise or image-wise training. These methods are able to produce fast and accurate segmentation, provided the training set captures the population variability.
In the context of deep learning, Long et al. [3] proposed the first fully convolutional network (FCN) for semantic image segmentation, exploiting the capability of Convolutional Neural Networks (CNNs) [4, 5, 6] to learn task-specific hierarchical features in an end-to-end manner. However, their initial adoption in the medical domain was challenging, due to the limited availability of medical imaging data and associated costly manual annotation. These challenges were later circumvented by patch-based training, data augmentation, and transfer learning techniques [7, 8].
Specifically, in the context of cardiac image segmentation, Tran [9] adapted a FCN architecture for segmentation of various cardiac structures from short-axis MR images. Similarly, Poudel et al. [10] proposed a recurrent FCN architecture to leverage inter-slice spatial dependencies between the 2D cine MR slices. Avendi et al. [11] reported improved accuracy and robustness of the LV segmentation by using the output of a FCN to initialize a deformable model. Further, Oktay et al. [12] pre-trained an auto-encoder network on ground-truth segmentations and imposed anatomical constraints into a CNN network by adding *-*loss between the auto-encoder representation of the output and the corresponding ground-truth segmentation. Several modifications to the FCN architecture and various post-processing schemes have been proposed to improve the semantic segmentation results as summarized in [13].
To improve the generalization performance of neural networks, various regularization techniques have been proposed. These include parameter norm penalty (e.g. weight decay [14]), noise injection [15], dropout [16], batch normalization [17], adversarial training [18], and multi-task learning (MTL) [19]. In this paper we focus on MTL-based network regularization. When a network is trained on multiple related tasks, the inductive bias provided by the auxiliary tasks causes the model to prefer a hypothesis that explains more than one task. This helps the network ignore task-specific noise and hence focus on learning features relevant to multiple tasks, improving the generalization performance [19]. Furthermore, MTL reduces the Rademacher complexity [20] of the model (i.e. its ability to fit random noise), hence reducing the risk of overfitting. An overview of MTL applied to deep neural networks can be found in [21].
MTL has been widely employed in computer vision problems due to the similarity between various tasks being performed. A FCN architecture with a common encoder and task specific decoders was proposed in [22] to perform joint classification, detection, and semantic segmentation, targeting real-time applications such as autonomous driving. A similar single-encoder-multiple-decoder architecture described in [23] performs semantic segmentation, depth regression, and instance segmentation, simultaneously. The architecture was further expanded by [24] to automatically learn the weights for each task based on its uncertainty, obtaining state-of-the-art results.
In the context of medical image analysis, Moeskops et al. [25] demonstrated the use of MTL for joint segmentation of six tissue types from brain MRI, the pectoral muscle from breast MRI, and the coronary arteries from cardiac Computed Tomography Angiography (CTA) images, with performance equivalent to networks trained on individual tasks. Similarly, Valindria et al. [26] employed a MTL framework to improve the performance for multi-organ segmentation from CT and MR images, exploring various encoder-decoder network architectures. Specific to the cardiac MR applications, Xue et al. [27] proposed a network capable of learning multi-task relationship in a Bayesian framework to estimate various local/global LV indices for full quantification of the LV. Similarly, Dangi et al. [28] performed joint segmentation and quantification of the LV myocardium using the learned task uncertainties to weigh the losses, improving upon the state-of-the-art results. Most of these MTL methods in medical image analysis aim to perform various clinically relevant tasks simultaneously. However, the focus of this work is on improving the segmentation performance of various FCN architectures using MTL as a network regularizer.
We propose to use the rich information available in the distance map of the segmentation mask as an auxiliary task for the image segmentation network. Since each pixel in the distance map represents its distance from the closest object boundary, this representation is redundant and robust compared to the per-pixel image label used for semantic segmentation. Furthermore, the distance map represents the shape and boundary information of the object to be segmented. Hence, training the segmentation network on the additional task of predicting the distance map is equivalent to enforcing shape and boundary constraints for the segmentation task; hence the name distance map regularized convolutional neural network.
Related work to ours include [29], which take an image and its semantic segmentation as input and predict the distance transform of the object instances, such that, thresholding the distance map yields the instance segmentation. Similarly, [30] represent the boundary of the object instances using a truncated distance map, which is used to refine the instance segmentation result. However, unlike these methods, our goal is not to perform instance segmentation, but to refine the semantic segmentation result using the distance map as an auxiliary task. The most closely related work to ours is presented in [31] for segmentation of building footprints from satellite images using a MTL framework. In their study, the truncated distance map is predicted at the end of the decoder network and is further used to refine the boundary of the predicted segmentation, resulting in increased model complexity. Unlike that work, we impose a global shape constraint at the bottleneck layer of FCN architectures, using MTL as a network regularizer without increasing the model complexity. The proposed model is customized towards cardiac MRI image segmentation, as we accommodate for slices containing no foreground pixels (in apical and basal regions). Furthermore, we demonstrate better generalization performance of the proposed network with improved cross-dataset segmentation results.
*Contributions: * In this work, we propose to impose shape and boundary constraints in a CNN framework to accurately segment the heart chambers from cardiac cine MR images. We impose soft-constraints by including a distance map prediction as an auxiliary task in a MTL framework. We extensively evaluate our proposed model on two publicly available cardiac cine MRI datasets. We demonstrate that the addition of a distance map regularization block improves the segmentation performance of three FCN architectures, without increasing the model complexity and inference time. We employ a task uncertainty-based weighing scheme to automatically learn the weights for the segmentation and distance map regression tasks during training, and show that this method improves segmentation performance over the fixed equal-weighting scheme. Additionally, we show that the proposed regularization technique improves the segmentation performance in the challenging apical and basal slices, as well as across several different pathological heart conditions. This improvement is also reflected on the computed clinical indices important for cardiac health diagnosis. Finally, we demonstrate better generalization ability using the proposed regularization technique with significantly improved cross-dataset segmentation performance, without tuning the network to a new data distribution.
II Methods and Materials
II-A CNN for Semantic Image Segmentation
Let be the input intensity image and be the corresponding image segmentation, with representing a set of class labels, and representing the image domain. The task of a CNN based segmentation model, with weights , is to learn a discriminative function that models the underlying conditional probability distribution . The output of a CNN model is passed through a softmax function to produce a probability distribution over the class labels, such that, the function can be learned by maximizing the likelihood:
[TABLE]
where represents the ’th element of the vector . In practice, the negative log-likelihood is minimized to learn the optimal CNN model weights, . This is equivalent to minimizing the cross-entropy loss of the ground-truth segmentation, , with respect to the softmax of the network output, .
A typical FCN architecture (Fig. 2) for image segmentation consists of an encoder and a decoder network. The encoder network includes multiple pooling (max/average pooling) layers applied after several convolution and non-linear activation layers (e.g. Rectified linear unit (ReLU) [32]). It encodes hierarchical features important for the image segmentation task. To obtain per-pixel image segmentation, the global features obtained at the bottleneck layer need to be up-sampled to the original image resolution using the decoder network. The up-sampling filters can either be fixed (e.g. nearest-neighbor or bilinear upsampling), or can be learned during the training (deconvolutional layer). The final output of a decoder network is passed to a softmax classifier to obtain a per-pixel classification.
In a SegNet [33] (Fig. 2a) architecture, the decoder produces sparse feature maps by up-sampling its inputs using the pooling indices transferred from its encoder. These sparse feature maps are then convolved with a trainable filter bank to obtain dense feature maps, and are finally passed through a softmax classifier to produce per-pixel image segmentation. Since the decoder in the SegNet architecture uses only the global features obtained at the bottleneck layer of the encoder, the high frequency details in the segmentation are lost during the up-sampling process.
The U-Net architecture [34] (Fig. 2b) introduced skip connections, by concatenating output of encoder layers at different resolutions to the input of the decoder layers at corresponding resolutions, hence preserving the high frequency details important for accurate image segmentation. Furthermore, the skip connections are known to ease the network optimization [35] by introducing multiple paths for backpropagation of the gradients, hence, mitigating the vanishing/exploding gradient problem. Similarly, skip connections also allow the network to learn lower level details in the outer layers and focus on learning the residual global features in the deeper encoder layers. Hence, the U-Net architecture is able to produce excellent segmentation results using limited training data with augmentation, and has been extensively used in medical image segmentation.
We observed that learned deconvolution filters in the original U-Net architecture can be replaced by a SegNet-like decoder to form a hybrid architecture with reduced network parameters. We refer to this modified architecture as U-SegNet (Fig 2e) throughout this paper, and use it as one of the baseline FCN architectures.
II-B Distance Map Regularization Network
The distance map of a binary segmentation mask can be obtained by computing the Euclidean distance of each pixel from the nearest boundary pixel [36]. This representation provides rich, redundant, and robust information about the boundary, shape, and location of the object to be segmented. For a binary segmentation mask, where is the set of foreground pixels, represent the boundary pixels, and is the Euclidean distance between any two pixels, the truncated signed distance map, , is computed as:
[TABLE]
where,
[TABLE]
is the minimum distance of pixel from the boundary pixels . We truncate the signed distance map at a predefined distance threshold, , hence assigning this maximum negative distance to the slices not containing any foreground pixels (i.e. ), indicating all pixels in the slice are far from the foreground (typically in the apical/basal regions of cardiac cine MR images).
The distance map regularization network is a SegNet-like decoder network, up-sampling the feature maps obtained at the bottleneck layer of the encoder to the size of the input image, with the number of output channels equal to the number of foreground classes (i.e. ). For example, for a four-class segmentation problem (): background, RV blood-pool, LV myocardium, and LV blood-pool, the regularization network has three output channels, predicting the truncated signed distance maps (Eq. 2) computed from the binary masks of the foreground classes: RV bood-pool, LV myocardium, and LV blood-pool.
Fig. 3 shows the regularization network added to the bottleneck layer of existing FCN architectures. Network training loss is the weighted sum of the cross-entropy loss for segmentation and the mean absolute difference (MAD) loss between the predicted and the reference distance maps. Since our goal is to perform semantic segmentation we do not need the distance map prediction at inference time. Therefore, we remove the regularization block after training, such that, the original FCN architecture remains unchanged. Additionally, we found that the quality (mean absolute difference) of the predicted distance maps is insufficient for improving the predicted segmentations from the standard path (see Fig. S2 in supplement).
II-C MTL using Uncertainty-based Loss Weighting
We model the likelihood for a segmentation task as the squashed and scaled version of the model output through a softmax function:
[TABLE]
where, is a positive scalar, equivalent to the temperature, for the defined Gibbs/Boltzmann distribution. The magnitude of determines how uniform the discrete distribution is, and hence relates to the uncertainty of the prediction measured in entropy. The log-likelihood for the segmentation task can be written as:
[TABLE]
where is the ’th element of the vector . In the last step, a simplifying assumption , which becomes an equality when , has been made, resulting in a simple optimization objective with improved empirical results [24].
Similarly, for the regression task, we define our likelihood as a Lapacian distribution with its mean and scale parameter given by the neural network output:
[TABLE]
The log-likelihood for regression task can be written as:
[TABLE]
where is the neural networks observation noise parameter — capturing the noise in the output. A constant term has been removed for simplicity, as it does not affect the optimization.
For a network with two outputs — continuous output modeled with a Laplacian likelihood, and a discrete output modeled with a softmax likelihood — the joint loss is:
[TABLE]
where is the MAD loss of and is the cross-entropy loss of . To arrive at Eq. 7, the two tasks are assumed independent. During the training, the joint likelihood loss is optimized with respect to , as well as , .
From Eq. 7, we can observe that the losses for individual tasks are weighted by the inverse of their corresponding uncertainties (, ) learned during the training. Hence, the task with higher uncertainty will be weighted less and vice versa. Furthermore, the uncertainties cannot grow too large due to the penalty imposed by the last two terms in (Eq. 7). In practice, the network is trained to predict the log variance, , for numerical stability and avoiding any division by zero, such that, the positive scale parameter, , can be computed via exponential mapping .
II-D Clinical Datasets
II-D1 Left Ventricle Segmentation Challenge (LVSC)
This study employed 97 de-identified cardiac MRI image datasets from patients suffering from myocardial infraction and impaired LV contraction available as a part of the STACOM 2011 Cardiac Atlas Segmentation Challenge project [37, 38] database111http://www.cardiacatlas.org/challenges/lv-segmentation-challenge/. Cine-MRI images in short-axis and long-axis views are available for each case. The images were acquired using the Steady-State Free Precession (SSFP) MR imaging protocol with the following settings: typical thickness , gap , TR , TE , flip angle , FOV , spatial resolution to and image matrix using multiple scanners from various manufacturers. Corresponding reference myocardium segmentation generated from expert analyzed 3D surface finite element model are available for all 97 cases throughout the cardiac cycle.
II-D2 Automated Cardiac Diagnosis Challenge (ACDC)
This dataset222https://www.creatis.insa-lyon.fr/Challenge/acdc/databases.html is composed of short-axis cardiac cine-MR images acquired for 100 patients divided into 5 evenly distributed subgroups: normal, myocardial infarction, dilated cardiomyopathy, hypertropic cardiomyopathy, and abnormal right ventricle, available as a part of the STACOM 2017 ACDC challenge [39]. The acquisitions were obtained over a 6 year period using two MRI scanners of different magnetic strengths (1.5T and 3.0T). The images were acquired using the SSFP sequence with the following settings: thickness (sometimes ), interslice gap , spatial resolution to , to frames per cardiac cycle. Corresponding manual segmentations for RV blood-pool, LV myocardium, and LV blood-pool, performed by a clinical expert for the end-systole (ES) and end-diastole (ED) phases are provided.
II-E Data Preprocessing and Augmentation
SimpleITK [40] was used to resample short-axis images to a common resolution of 1.5625 and crop/zero-pad to a common size of and for LVSC and ACDC dataset, respectively. Image intensities were clipped at 99th percentile and normalized to zero mean and unit standard deviation. Each dataset was divided into train, validation, and test set with five non-overlaping folds for cross-validation. Train-validation-test fold was performed randomly over the whole LVSC dataset, whereas it was performed per subgroup (stratified sampling) for the ACDC dataset to maintain even distribution of subgroups over the training, validation, and testing sets. The training images were subjected to random similarity transform with: isotropic scaling of to , rotation of to , and translation of to of the image size along both x- and y-axes. The training set for LVSC and ACDC dataset included the original images along with augmentation of two and four randomly transformed versions of each image, respectively. We heavily augment the ACDC dataset, as the labels are available only for the ES and ED phases, whereas, lightly augment the LVSC dataset, as the labels are available throughout the cardiac cycle.
II-F Network Training and Testing Details
Networks implemented in PyTorch333https://github.com/pytorch/pytorch were initialized with the Kaiming uniform initializer [41] and trained for 30 and 100 epochs for LVSC and ACDC dataset, respectively, with batch size of 15 images. RMS prop optimizer [42] with a learning rate of 0.0001 and 0.0005 for single- and multi-task networks, respectively, decayed by 0.99 every epoch was used. We saved the model with best average Dice coefficient on the validation set, and evaluated on the test set.
Networks were trained on NVIDIA Titan Xp GPU. The distance map threshold was selected empirically and set to a large value of , i.e. full distance map. The cross-entropy and the MAD loss were initialized with equal weights of , such that, the optimal weighting was learned automatically. The auxillary task of distance map regression was removed after the network training. The obtained 2D slice segmentations were rearranged into a 3D volume, and the largest connected component for each heart chamber was retained to yield the final segmentation. Model complexity and average timing requirements for training and testing the models is shown in Table I.
II-G Evaluation Metrics
We use overlap and surface distance measures to evaluate the segmentation. Additionally, we evaluate the clinical indices associated with the segmentation.
II-G1 Dice and Jaccard Coefficients
Given two binary segmentation masks, A and B, the Dice and Jaccard coefficient are defined as:
[TABLE]
where, gives the cardinality (i.e. the number of non-zero elements) of each set. Maximum and minimum values (1.0 and 0.0, repectively) for Dice and Jaccard coefficient occur when there is 100% and 0% overlap between the two binary segmentation masks, respectively.
II-G2 Mean Surface Distance and Hausdorff Distance
Let, and , be surfaces (with and points, respectively) corresponding to two binary segmentation masks, A and B, respectively. The mean surface distance (MSD) is defined as:
[TABLE]
Similarly, Hausdorff Distance (HD) is defined as:
[TABLE]
where,
[TABLE]
is the minimum Euclidean distance of point from the points . Hence, MSD computes the mean distance between the two surfaces, whereas, HD computes the largest distance between the two surfaces, and is sensitive to outliers.
II-G3 Ejection Fraction and Myocardial Mass
Ejection Fraction (EF) is an important cardiac parameter quantifying the cardiac output. EF is defined as:
[TABLE]
where, EDV is the end-diastolic volume, and ESV is the end-systolic volume. Similarly, the myocardial mass can be computed from the myocardial volume as:
[TABLE]
The correlation coefficients for the EF and myocardial mass computed from the ground-truth versus those computed from the automatic segmentation is reported. Correlation coefficient of () represents perfect positive (negative) linear relationship, whereas that of [math] represents no linear relationship between two variables.
II-G4 Limits of Agreement
To compare the clinical indices computed from the ground-truth versus those obtained from the automatic segmentation, we take the difference between each pair of the two observations. The mean of these differences is termed as bias, and the 95% confidence interval, mean 1.96standard deviation (assuming a Gaussian distribution), is termed as limits of agreement (LoA).
III Results
III-A Segmentation and Clinical Indices Evaluation
The proposed Distance Map Regularized (DMR) SegNet, USegNet, and UNet models along with the baseline models were trained for the joint segmentation of RV blood-pool, LV myocardium, and LV blood-pool from the ACDC challenge dataset. The provided reference segmentation and the corresponding automatic segmentation obtained from the DMR-UNet model for a test patient is shown in Fig. 1. Automatic segmentation obtained from all networks, for ED and ES phases, are evaluated against the reference segmentation and summarized in Table IIa; also shown is the evaluation of subsequently computed clinical indices in Table IIb.
We observe consistent improvement in the average segmentation performance of the models after the DM-Regularization. Specifically, there is statistically significant improvement444Wilcoxon signed-rank test is performed for statistical significance testing on several segmentation metrics for all evaluated models. Same results manifest onto the clinical indices with better correlation and LoA on both EF and myocardium mass. Furthermore, the DMR-UNet model outperforms other evaluated networks in many segmentation metrics.
To further analyze the improvement in segmentation performance, we performed a regional analysis by sub-dividing the slices into apical (25% slices in the apical region and beyond), basal (25% slices in the basal region and beyond) and mid-region (remaining 50% mid slices), based on the reference segmentation. From Fig. 4(a), we can observe consistent improvement in segmentation performance at the problematic apical and basal slices [39]; however, due to the small size of these regions, the improvement does not have a large effect on the overall performance, though it is of significance when constructing patient specific models of the heart for simulation purposes [51]. We postulate that the additional constraint imposed by a very high negative distance assigned to empty apical/basal slices prevents the network from over-segmenting these regions, hence, improving the regional dice overlap and effectively reducing the overall Hausdorff distance.
To study the effect of the distance map regularization across the five patient sub-groups, we plot the average Dice coefficient for each sub-group computed for all six models in Fig. 5. As expected, we observe the segmentation performance is better for the normal patients in comparison to the pathological cases. Furthermore, we observe consistent improvement in segmentation performance after the distance map regularization for all patient sub-groups.
We segmented the heart structures from 50 patients ACDC held-out testset and submitted to the challenge organizers. Majority voting prediction of ensemble of DMR-UNet models trained for five-fold cross-validation followed by a 3D connected component analysis yielded the final segmentation. Table IV shows the comparison of our segmentation results against the top three methods submitted to the challenge. Baumgartner et al. [43] tested several architectures and found that 2D U-Net with a cross-entropy loss performed the best. Khened et al. [44] used a 2D U-Net with dense blocks and an inception first layer to obtain the segmentation. Isensee et al. ensembled 2D and 3D U-Net architectures trained with a Dice loss to obtain the best result in the challenge. Our 2D DMR-UNet model is able to perform as good or better than the other two 2D methods, however, the combination of 2D and 3D context has marginal improvement in the Dice overlap metric. Based on this observation, we believe the ensemble of 2D and 3D DMR-UNet model should be able to perform as good or better than [45], which is not the main objective of this work. Nonetheless, we can observe the constraint imposed by the DM regularization is successful in reducing the errors in apical/basal regions, manifested in the improved Hausdorff distance.
Table V shows the segmentation performance evaluated on the LVSC dataset, demonstrating superior performance of the DM regularized models over their baseline. Specifically, there is statistically significant improvement on the Dice and Jaccard metric for the ED phase. Furthermore, the correlation and LoA for the myocardial mass improves after network regularization. The improvement in performance is consistent across different heart regions as shown in Fig. 4(b).
We segmented the myocardium from the LVSC held-out validation set of 100 patients. Majority voting prediction from ensemble of DMR-UNet models trained for five-fold cross-validation followed by a 3D connected-component analysis yielded the final segmentation. Table VI shows our segmentation results (computed per slice) compared against several other semi-/fully-automatic algorithms. Reported segmentation results are computed against the consensus segmentation (CS∗) built from multiple challenge submissions [38]. Segmentation results for the four challenge participants — AU [47], AO [48], SCR [49], and INR [50], and the details on segmentation evaluation metrics can be found in the challenge summary report [38]. The AU method [47] used the interactive guide-point modeling technique to fit a finite element cardiac model to the CMR data and required expert approval of all slices and all frames. This segmentation was provided as the reference segmentation to the challenge participants. The CNN regression (CNR) method [46] regressed the endo- and epi-cardium contours in polar coordinates, while manually eliminating the problematic slices beyond the apex and base of the heart, hence, obtaining a good segmentation result. The mean (std dev) of Jaccard coefficients computed for our DMR-UNet model in apical, mid, and basal slices are 0.66 (0.18), 0.77 (0.12), and 0.74 (0.17), respectively. Our DMR-UNet model has similar performance to competing fully-automatic segmentation algorithms based on the fully convolutional network (FCN) [9] and the densely connected FCN (DFCN) [44] architectures. The DFCN method involves a computationally expensive region of interest (ROI) identification based on a Fourier transform applied across the cardiac cycle, followed by the circular Hough transform; whereas our method requires minimal pre-processing.
Lastly, the segmentation performance on the LVSC dataset (Table VI) is significantly lower than ACDC dataset (Table IV) due to large variability and noise exhibited by the LVSC data as compared to the ACDC dataset.
III-B Cross Dataset Evaluation (Transfer Learning)
To analyze the generalization ability of our proposed distance map regularized networks, we performed a cross-dataset segmentation evaluation. The networks trained on ACDC dataset for five-fold cross-validation were tested on the LVSC dataset, and vice versa; such that, the majority voting scheme produced the final per-pixel segmentation. We observe a significant boost in Dice coefficient of 5% to 12% for distance map regularized networks over their baseline models when trained on ACDC and tested on LVSC dataset (194 ED and ES volumes), as shown in Table VIIa. Similarly, the distance map regularized models significantly outperform the baseline models by 23% to 42% improvement in Dice coefficient, when trained on LVSC and tested on ACDC dataset (200 ED and ES volumes), as shown in Table VIIb. The improvement in generalization performance for the regularization networks trained on LVSC dataset is higher, likely due to the availability of large number of heterogeneous training examples. Similar improvement can be observed in the MSD and HD metric. We want to emphasize that our networks are trained separately on each dataset and are completely unaware of the new data distribution, unlike a typical domain adaptation [52] setting. Nonetheless, the distance map regularized networks are able to generalize better to a new dataset compared to the baseline models.
We further analyzed the feature maps across different layers of the baseline and distance map regularized networks (supplementary material Fig. S3). We can observe the baseline models preserve the intensity information and propagate it throughout the network, hence, they are more sensitive to the dataset-specific intensity distribution. On the other hand, the multi-task regularized networks focus more on the edges and other discriminative features, producing sparse feature maps, while ignoring dataset-specific intensity distribution. Moreover, from the feature maps at the decoding layers, we observe a clear delineation of several cardiac structures in the regularized network, while those for the baseline models are less discriminative, and contain information about all structures present in the image. Hence, we verify that multi-task learning-based distance map regularization helps the network learn generalizable features important for the segmentation task, demonstrated by their excellent transfer learning capabilities (see Supplementary Materials for details on feature visualization (Fig. S3) and network learning curves showing the robustness of distance map regularized models against overfitting (Fig. S4)).
IV Discussion
We performed an extensive study on the effects of hyper-parameters on the performance of the proposed regularization framework. Here we summarize the effects of the learned vs. fixed task weighting, and various choices of the distance map threshold. Furthermore, we analyzed the distribution of network weights before and after regularization.
*Task Weighting: * At first, we initialized the weights for the cross-entropy and MAD loss equally to 1.0. However, the learned weights for the cross-entropy and MAD loss were around 0.01 and 17, and 0.02 and 13 for ACDC and LVSC dataset, respectively, for the best performing models on the validation set.
To determine the effect of learned task weighting scheme presented in section II-C, we analyzed the average Dice coefficient of the test set segmentation results for both ACDC (100 volumes) and LVSC (1050 volumes) datasets with fixed versus learned weighting. From Fig. 6, we can observe a significant improvement in average Dice coefficient (based on the 95% bootstrap confidence intervals) with learned weights compared to fixed (equal) weighting. Since the scales of the two losses are different, the equal weighting scheme emphasizes the distance map regression task more than it should, hence deteriorating the segmentation performance. On the other hand, the learned task weighting scheme is able to automatically weigh the two losses, bringing them to a similar scale, such that the two tasks are given equal importance, ultimately improving the segmentation performance.
*Effect of Distance Map Threshold: * We selected three extreme values for the distance map threshold: , , and . The network weights for cross-entropy and MAD loss were equally initialized to and trained with automatically learned task weighting for a fixed number of epochs. The average Dice coefficient on the test-set obtained from the best performing models on the validation-set across five-fold cross-validation is summarized in Fig. 7. We observe similar performance for different threshold values, demonstrating the low sensitivity of the proposed method to the distance map threshold. Hence, we decided to use a very high threshold of pixels, which is almost equivalent to regressing the full distance map and neglecting this hyper-parameter.
*Network Weight Distribution: * We also analyzed the weight distribution of the network before and after distance map regularization, as shown in the Supplementary Materials (Fig. S5). We observe the number of non-zero weights increase after the distance map regularization, hence, better utilizing the network capacity. A similar flattening of network weight histogram has been reported for the dropout regularization and Bayesian neural networks [53], both reducing the overfitting and hence improving generalization. Specifically, the network weights are randomly dropped during dropout, forcing the network to use the remaining weights to identify the patterns in data (spreading the weight histogram), hence creating an ensemble effect with reduced over-fitting and improved generalization. We observe a similar pattern in the weight distribution after the distance map regularization.
V Conclusion
In this work we proposed and implemented a multi-task learning-based regularization method for fully convolutional networks for semantic image segmentation and demonstrated its benefits in the context of cardiac MR image segmentation. To implement the proposed method, we appended a decoder network at the bottleneck layer of existing FCN architectures to perform an auxiliary task of distance map prediction, which is removed after training.
We automatically learned the weighting of the tasks based on their uncertainty. As the distance map contains robust information regarding the shape, location, and boundary of the object to be segmented, it facilitates the FCN encoder to learn robust global features important for the segmentation task.
Our experiments verify that introducing the distance map regularization improves the segmentation performance of three FCN architectures for both binary and multi-class segmentation across two publicly available cardiac cine MRI datasets featuring significant patient anatomy and image variability. Specifically, we observed consistent improvement in segmentation performance in the challenging apical and basal slices in response to the soft-constraints imposed by the distance map regularization. We also showed consistent segmentation improvement on all five patient pathology in the ACDC dataset. Furthermore, these improvements were also reflected on the computed clinical indices important for the diagnosis of various heart conditions. Lastly, we demonstrated the proposed regularization significantly improved the generalization ability of the networks on cross-dataset segmentation (transfer learning), without being aware of the new data distribution, with 5% to 42% improvement in average Dice coefficient over the baseline FCN architectures.
Acknowledgment
Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award No. R35GM128877 and by the Office of Advanced Cyber infrastructure of the National Science Foundation under Award No. 1808530. Ziv Yaniv’s work was supported by the Intramural Research Program of the U.S. National Institutes of Health, National Library of Medicine.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] P. Peng, K. Lekadir, A. Gooya, L. Shao, S. E. Petersen, and A. F. Frangi, “A review of heart chamber segmentation for structural and functional analysis using cardiac magnetic resonance imaging,” Magnetic Resonance Materials in Physics, Biology and Medicine , vol. 29, no. 2, pp. 155–195, Apr 2016.
- 2[2] C. Petitjean and J.-N. Dacher, “A review of segmentation methods in short axis cardiac MR images,” Medical Image Analysis , vol. 15, no. 2, pp. 169 – 184, 2011.
- 3[3] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2015.
- 4[4] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE , vol. 86, no. 11, pp. 2278–2324, Nov 1998.
- 5[5] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning . MIT Press, 2016.
- 6[6] Y. Le Cun, Y. Bengio, and G. Hinton, “Deep learning,” Nature , vol. 521, no. 7553, pp. 436–444, 05 2015.
- 7[7] D. Shen, G. Wu, and H.-I. Suk, “Deep learning in medical image analysis,” Annual review of biomedical engineering , vol. 19, pp. 221–248, 06 2017.
- 8[8] G. Litjens et al. , “A survey on deep learning in medical image analysis,” Medical Image Analysis , vol. 42, pp. 60 – 88, 2017.
