4D CNN for semantic segmentation of cardiac volumetric sequences

Andriy Myronenko; Dong Yang; Varun Buch; Daguang Xu; Alvin Ihsani,; Sean Doyle; Mark Michalski; Neil Tenenholtz; Holger Roth

arXiv:1906.07295·eess.IV·October 11, 2019

4D CNN for semantic segmentation of cardiac volumetric sequences

Andriy Myronenko, Dong Yang, Varun Buch, Daguang Xu, Alvin Ihsani,, Sean Doyle, Mark Michalski, Neil Tenenholtz, Holger Roth

PDF

TL;DR

This paper introduces a 4D CNN for cardiac CT segmentation that effectively utilizes sparse annotations and temporal information to produce consistent segmentations across sequences.

Contribution

The paper presents a novel 4D CNN architecture with a sparse loss function that leverages unlabeled data for cardiac volumetric sequence segmentation.

Findings

01

The 4D CNN achieves accurate, temporally consistent segmentation results.

02

It outperforms traditional 3D segmentation methods on cardiac 4D CCTA data.

03

The approach effectively uses sparse annotations to train the model.

Abstract

We propose a 4D convolutional neural network (CNN) for the segmentation of retrospective ECG-gated cardiac CT, a series of single-channel volumetric data over time. While only a small subset of volumes in the temporal sequence is annotated, we define a sparse loss function on available labels to allow the network to leverage unlabeled images during training and generate a fully segmented sequence. We investigate the accuracy of the proposed 4D network to predict temporally consistent segmentations and compare with traditional 3D segmentation approaches. We demonstrate the feasibility of the 4D CNN and establish its performance on cardiac 4D CCTA.

Tables2

Table 1. Table 1: Network structure, where GN stands for group normalization (with group size of 8), Conv - 3x3x3x3 convolution, AddId - addition of identity/skip connection. Repeat column shows the number of repetitions of the block. The output, after softmax, has 3 channels (background and 2 foreground classes)

Name	Ops	Repeat	Output size
Input			1x96x96x64x16
InitConv	Conv 3x3x3x3		8x96x96x64x16
EncoderBlock0	GN,ReLU,Conv,GN,ReLU,Conv, AddId		8x96x96x64x16
EncoderDown1	Conv 3x3x3x3 stride 2		16x48x48x32x8
EncoderBlock1	GN,ReLU,Conv,GN,ReLU,Conv, AddId	x2	16x48x48x32x8
EncoderDown2	Conv 3x3x3x3 stride 2		32x24x24x16x4
EncoderBlock2	GN,ReLU,Conv,GN,ReLU,Conv, AddId	x4	32x24x24x16x4
DecoderUp1	Conv1, UpNearest, +EncoderBlock1		16x48x48x32x8
DecoderBlock1	GN,ReLU,Conv,GN,ReLU,Conv, AddId		16x48x48x32x8
DecoderUp0	Conv1, UpNearest, +EncoderBlock0		8x96x96x64x16
DecoderBlock0	GN,ReLU,Conv,GN,ReLU,Conv, AddId		8x96x96x64x16
DecoderEnd	Conv 1x1x1x1, Softmax		3x96x96x64x16

Table 2. Table 2: Performance evaluation of the 4D semantic segmentation network. LVM - left ventricular myocardium and LV - left ventricle. We also measure temporal smoothness in the result using the L2 norm of temporal derivative of the predictions and average surface distance between the consecutive frames. The proposed 4D network produces temporally smoother results with comparable dice scores.

	Dice		Smoothness
Arch	LVM	LV	L2	Surf
3D network	0.85	0.91	1.28	0.74
4D network	0.85	0.90	1.05	0.59

Equations4

L = i \in labeled \sum D (p_{true}^{i}, p_{pred}^{i}) + i = 0 \sum K - 2 ∣∣ p_{pred}^{i + 1} - p_{pred}^{i} ∣ ∣^{2}

L = i \in labeled \sum D (p_{true}^{i}, p_{pred}^{i}) + i = 0 \sum K - 2 ∣∣ p_{pred}^{i + 1} - p_{pred}^{i} ∣ ∣^{2}

D (p_{true}, p_{pred}) = 1 - \frac{2 * \sum p _{true} * p _{pred}}{\sum p _{true}^{2} + \sum p _{pred}^{2} + ε}

D (p_{true}, p_{pred}) = 1 - \frac{2 * \sum p _{true} * p _{pred}}{\sum p _{true}^{2} + \sum p _{pred}^{2} + ε}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

11institutetext: NVIDIA 11email: {amyronenko,dongy,daguangx,aihsani,hroth}@nvidia.com 22institutetext: MGH and BWH Center for Clinical Data Science 22email: {varun.buch,sdoyle}@mgh.harvard.edu 22email: {mmichalski1,ntenenholtz}@partners.org

4D CNN for semantic segmentation of cardiac volumetric sequences

Andriy Myronenko 11

Dong Yang 11

Varun Buch 22

Daguang Xu 11

Alvin Ihsani 11

Sean Doyle 22

Mark Michalski 22

Neil Tenenholtz 22

Holger Roth 11

Abstract

We propose a 4D convolutional neural network (CNN) for the segmentation of retrospective ECG-gated cardiac CT, a series of single-channel volumetric data over time. While only a small subset of volumes in the temporal sequence is annotated, we define a sparse loss function on available labels to allow the network to leverage unlabeled images during training and generate a fully segmented sequence. We investigate the accuracy of the proposed 4D network to predict temporally consistent segmentations and compare with traditional 3D segmentation approaches. We demonstrate the feasibility of the 4D CNN and establish its performance on cardiac 4D CCTA111video: https://drive.google.com/uc?id=1n-GJX5nviVs8R7tque2zy2uHFcN_Ogn1.

1 Introduction

Cardiovascular disease is responsible for 18 million deaths annually, making it one of the leading causes of mortality globally [13]. Coronary computed tomography angiography (CCTA) uses contrast-enhanced CT to evaluate cardiac muscle morphology, function, and vascular patency. Two measurements derived from CCTA with significant diagnostic and prognostic importance are the Left Ventricular Ejection Fraction (LVEF) and Left Ventricular Wall Thickness. Both measurements require the segmentation of the left ventricular muscle, with the former requiring temporal segmentation over the cardiac cycle. The American College of Radiology (ACR) has highlighted the importance of these measurements by listing them among the most important initial ‘use cases’ of artificial intelligence as applied to radiology [1]. A segmentation model of the left ventricular muscle and cavity over the cardiac cycle, especially the end-systole and end-diastole time points, would allow for automated determination of both measurements from 4D CCTA studies. The clinical utility of such a model is highly relevant as it reduces study reading time and improves the consistency of measurements, thereby potentially preventing missed pathology in cases where the measurements may not have otherwise been performed.

Modern 4D CCTA images are acquired over the entire cardiac cycle, including end-systole and end-diastole. A typical 4D scan includes 20 3D volumes reflecting the cardiac anatomy at equally-spaced time points within a 240 ms time interval. This allows for enough temporal resolution to study the heart’s function. In order to limit the amount of effort required to annotate these images, we restrict the annotation to only certain frames, an example of which is shown in Fig. 2.

While convolutional neural networks (CNNs) have demonstrated state-of-the-art performance across a variety of segmentation tasks [8], the adoption of 4D CNNs for 4D medical imaging (3D + time – e.g., CT or ultrasound) has been limited due to the high computational complexity and lack of manually segmented data. The cost of annotating volumetric imaging is significant, making 4D labeling prohibitively expensive. Nevertheless, the temporal dimension offers valuable information that is otherwise lost when treating each volume independently.

In this work, we propose a 4D CNN for the segmentation of the left ventricle (LV) and left ventricular myocardium (LVM) from 4D CCTA images, enabling the computation of the aforementioned cardiac measurements. To reduce annotation costs, our 4D dataset is sparsely labeled across the temporal dimension – only a fraction of volumes in the sequence are labeled. This enables us to leverage a 4D CNN with a sparse loss function, allowing our algorithm to take advantage of unlabeled images which would otherwise be discarded in a 3D model. The network jointly segments the sequence of volumes, implicitly learning temporal correlations and imposing a soft temporal smoothness constraint. We describe the 4D convolution layer generalization in Section 3.1 and introduce a sparse Dice loss function as well as a temporal consistency regularization in Section 3.2. We demonstrate the feasibility of a 4D CNN and compare its performance to a traditional 3D CNN in Section 4.

2 Related work

Deep learning has achieved state-of-the-art segmentation performance in 2D natural images [2] and 2D [8] & 3D medical images [6, 7]. To leverage the temporal dependency and account for segmentation continuity, recurrent neural networks (RNNs) have been adopted for videos [11] and 2D+T cardiac MRI datasets [16]. 3D CNNs have also been applied spatio-temporally and proven effective in segmentation of videos [9, 10] and 2D+T cardiac MRIs [15].

For sequences of volumetric imaging, such as 3D+T CT or ultrasound, 4D CNNs are a natural extension. Wang et al. [12] proposed a CNN for 4D light-field material recognition incorporating separable 4D convolutions to reduce computational complexity. Clark et al. [3] adopted a 4D CNN for the de-noising of low-dose CT, where three independent 3D convolutions (with fixed cyclic time delay) were used to simulate 4D convolutions.

To date, 4D CNNs for semantic segmentation have not been explored in similar depth to 2D and 3D CNNs, in part due to their high computational requirements and lack of available annotations. In this work, we demonstrate the feasibility and advantageousness of a true 4D CNN.

3 Methods

Our 4D segmentation network architecture follows an encoder-decoder semantic segmentation strategy, typical for 2D and 3D images. Throughout the network, we use 4D convolutions with a kernel size of 3x3x3x3, where the last dimension corresponds to time. The network architecture follows the one proposed in [7], where only the main decoder branch is used and modified to fit 4D images within GPU memory limits. The input size of the network is 1x1x96x96x64x16 (corresponding to a batch size of 1, input channel 1, and a spatial crop of 96x96x64 with 16 frames). We randomly crop this 4D array from the input data during training. No other form of augmentation is employed in this study.

Each building block of the network consists of two convolutions with group normalization [14] and ReLU, followed by identity skip-connections similar to ResNet [5] blocks. A sequence of the building blocks is applied sequentially at different spatial levels. In the encoder part of the network, we downscale the spatial dimension after each level and double the feature dimension. We use strided convolutions (stride of 2) for downsizing, and all convolutions are 3x3x3x3. We use one block at level 0 (initial size), two blocks at level 1, and four blocks at level 2. At the smallest scale, the input image crop is downsized by a factor of 4 (to 24x24x16x4), which provides a balance between network depth and GPU memory limits. For the encoder branch, we leverage a similar structure with a single block per each spatial level. To upsample, we use 4D nearest-neighbor interpolation after 1x1x1x1 convolution. Finally, we use additive skip-connections between the corresponding levels. The details of network structure are shown in Table 1 and in Fig. 1.

3.1 4D convolutions

While 4D convolutional layers are not available in common deep-learning frameworks (such as TensorFlow111https://www.tensorflow.org of PyTorch222https://pytorch.org), they can be represented as a sum over a sequence of 3D convolutions along the fourth (temporal) dimension. For efficiency, we rearranged the loop to avoid repeated 3D convolutions by implementing 4D convolution as a custom TensorFlow layer. This strategy allows for a true (non-separable) 4D convolution. A common approach to maintain the same image dimension is to zero-pad prior to a convolution. We were concerned that such an approach may introduce boundary effect for the very first and last frames (when padding with zeros). We have experimented with several padding strategies for the 4th dimension only, including zero padding, mirror reflection, and replication but did not observe any noticeable performance differences, thus we decided to use conventional zero padding.

3.2 Loss

Our training dataset is sparsely labeled along the temporal dimension since labeling medical images in 4D (and even in 3D) is complex and time-consuming. Therefore, we have defined a sparse loss function that is applied only to the labeled time-frames and includes a regularization term to ensure temporal consistency between frames.

The proposed loss function is therefore composed of two terms,

[TABLE]

where $D$ is a soft dice loss [6] applied only to labeled time points (3D images) $p_{\mathrm{true}}$ to match the corresponding outputs $p_{\mathrm{pred}}$ :

[TABLE]

$K$ is the number of frames (K=16 in our case, since we use the 96x96x64x16 crop size). The second term in (1) is a first-order derivative over time to enforce similarity between frames. Re-weighting the contributions between the loss terms did not show consistent difference, so we kept the equal contributions.

3.3 Optimization

Similar to [7], we apply the Adam optimizer with an initial learning rate of $\alpha_{0}=\mathtt{1}\mathrm{e}{\mathtt{-3}}$ and progressively decrease it according to the following schedule $\alpha=\alpha_{0}\left(1-\eta/N_{\eta}\right)^{0.9}$ , where $\eta$ is an epoch counter, and $N_{\eta}$ is the total number of training epochs.

We use a batch size of 1 and sample input sequences randomly (ensuring that each training sequence is drawn once per epoch). From each 4D sequence, we apply a random crop of size 96x96x64x16 centered on a foreground (with a probability of 0.6), otherwise centered on a background voxel. Thus, at each iteration, a different number of ground truth labels is available, depending on the location of the crop window (16) of the time dimension.

3.4 Dataset

Our dataset consists of 61 4D CCTA sequences, each of 512x512x(40-108)x20 size (512x512 axial size, with 40-108 slices of variable thickness and 20 time points). The spatial image resolution is (0.24-0.46)x(0.24-0.46)x2mm. All images were acquired at Massachusetts General Hospital, Boston, USA, using a 128-slice dual-source multi-detector CT with retrospective ECG gating and tube current modulation. Sequences were reconstructed from multiple R-R333R corresponds to the peak of the QRS complex in the ECG wave. intervals, measured via electrocardiogram.

All images were resampled to an isotropic spatial resolution of 1x1x1mm, retaining the temporal resolution. After re-sampling, the 4D image sizes vary between 112x122x80x20 and 238x238x158x20 voxels. We apply a random data split, with 49 4D images used for training and 12 4D images for validation.

The number of annotated frames in each sequence varies widely, ranging from only 2 out 20 (i.e. end-systole and end-diastole) to 9 (every second time point). Overall, 247 time-points have been annotated throughout the dataset, which represents approximately 20% of all frames. We include studies with differing numbers of annotated frames in both training and validation splits to maximize temporal coverage during both training and validation.

As a second form of validation, we compare our model’s segmentation results with clinical findings. One such clinical finding is the ejection fraction measure which typically is being judged as reduced when less than 55% [4].

4 Results

We implemented our 4D network in Tensorflow and trained it on an NVIDIA Tesla P100 SXM2 GPU with 16GB memory based on the NVIDIA Clara Train SDK444https://devblogs.nvidia.com/annotate-adapt-model-medical-imaging-clara-train-sdk. Data is normalized to [-1,1] using a fixed scaling from input CT range [-1024,1024]. We train for 500 epochs and use the model at the end of training for evaluations.

For comparison, we also implemented a 3D network largely following the same architecture as in Fig. 1, except that all convolutions are 3D and include a greater number of layers with one additional down-sampling level (the end of the encoder being of size 12x12x8) as GPU memory requirements permit deeper architecture in the 3D case. For the 3D network, we use a crop of size 96x96x64 and train it only on labeled 3D frames. The 3D network learns to predict segmentation without any temporal constraint considerations. We acknowledge that such a 3D network is trained on less number of images (only the annotate frames), and weakly-supervised 3D segmentation might be a candidate for better comparison.

Segmentation performance: We evaluate both networks on the validation set, using only the labeled frames, in terms of average Dice score. In addition, we assess the temporal continuity of the produced results. A temporal smoothness metric, we compute the L2 norm of the first-order time derivative of segmentation labels, as well as the average surface distance between the consecutive frames. Intuitively, accurate segmentation results must respect the temporal continuity of the heart motion, and are expected to be smoother in the time domain.

The evaluation results are shown in Table 2. In terms of the dice score alone, the proposed 4D network demonstrated only comparable results, with one of the structures (LV cavity) 1% better dice of the 3D network. One reason for this might be that 4D network is not as deep as its 3D counterpart and the dice score is estimated frame by frame; frame-by-frame Dice score may not be the most representative accuracy measure of temporal sequence segmentations as it does not account for consistency across frames.

Visually, the 4D CNN segmentation results have superior temporal consistency, where the label changes more “fluidly” between time-frames. Our smoothness metric confirms this observation, with the proposed 4D network achieving lower smoothness loss than its 3D counterpart (see Table 2). We also observe that in many cases, 4D CNN results look better than the ground truth (See Fig. 2). The manual annotations are done slice-by-slice, which results in jittery out-of-plane annotation profiles; this especially visible in sagittal and coronal views. The proposed 4D segmentation network is able to average out these errors while learning from the overall dataset and produce coherent results both spatially and temporally. In future work, manual relabeling of some cases in all 2D planes consistently (in spatial and time dimensions) could result in a clearer advantage of our 4D approach.

Ejection fraction: We computed the ejection fraction for 12 cases (10 with normal and 2 with reduced ejection fraction) based on the ratio of minimum and maximum LV cavity volume throughout the cardiac cycle as predicted by our models. For both, 3D and 4D models, we achieve a 100% sensitivity and specificity in detecting reduced ejection fraction when compared to the findings reported in the clinical reports (provided by radiologists).

5 Conclusion

We proposed a 4D convolutional neural network for semantic segmentation of the left ventricle (LV) and left ventricular myocardium (LVM) from 4D CCTA studies. The network is fully convolutional and jointly segments a temporal sequence of volumetric images from CCTA.

We utilize a sparse Dice loss function and a temporal consistency regularization to handle the problem of sparse temporal annotation. We have demonstrated the feasibility and advantageousness of a true 4D CNN compared to 3D CNNs, where the first shows improvement in segmentation temporal consistency. The model’s result showed promise in being useful for automatically quantifying clinically measures, such as ejection fraction.

Bibliography16

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] American College of Radiology: Touch-ai directory (2019), https://www.acrdsi.org/DSI-Services/TOUCH-AI
2[2] Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segm. ar Xiv:1802.02611 (2018)
3[3] Clark, D., Badea, C.: Convolutional regularization methods for 4d, x-ray ct reconstruction. In: Medical Imaging: PMI. vol. 10948 (2019)
4[4] Curtis, J.P., Sokol, S.I., Wang, Y., Rathore, S.S., Ko, D.T., Jadbabaie, F., Portnay, E.L., Marshalko, S.J., Radford, M.J., Krumholz, H.M.: The association of left ventricular ejection fraction, mortality, and cause of death in stable outpatients with heart failure. ACC 42 (4), 736–742 (2003)
5[5] He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: European Conference on Computer Vision (ECCV) (2016)
6[6] Milletari, F., Navab, N., Ahmadi, S.A.: V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: Fourth International Conference on 3D Vision (3DV) (2016)
7[7] Myronenko, A.: 3D MRI brain tumor segmentation using autoencoder regularization. In: Brain Les, MICCAI. pp. 311–320. LNCS, Springer (2018), https://arxiv.org/abs/1810.11654
8[8] Ronneberger, O., P.Fischer, Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI). LNCS, vol. 9351, pp. 234–241. Springer (2015)