Automatic Segmentation of Vestibular Schwannoma from T2-Weighted MRI by Deep Spatial Attention with Hardness-Weighted Loss
Guotai Wang, Jonathan Shapey, Wenqi Li, Reuben Dorent, Alex, Demitriadis, Sotirios Bisdas, Ian Paddick, Robert Bradford, Sebastien, Ourselin, Tom Vercauteren

TL;DR
This paper presents a novel 2.5D CNN with supervised attention and a hardness-weighted loss for improved automatic segmentation of vestibular schwannoma tumors in MRI, enhancing accuracy and robustness in clinical settings.
Contribution
It introduces a 2.5D CNN architecture with supervised attention and a hardness-weighted Dice loss, outperforming existing methods for tumor segmentation in MRI.
Findings
The proposed 2.5D CNN outperforms 2D and 3D models.
Supervised attention improves segmentation accuracy.
Hardness-weighted Dice loss enhances training effectiveness.
Abstract
Automatic segmentation of vestibular schwannoma (VS) tumors from magnetic resonance imaging (MRI) would facilitate efficient and accurate volume measurement to guide patient management and improve clinical workflow. The accuracy and robustness is challenged by low contrast, small target region and low through-plane resolution. We introduce a 2.5D convolutional neural network (CNN) able to exploit the different in-plane and through-plane resolutions encountered in standard of care imaging protocols. We use an attention module to enable the CNN to focus on the small target and propose a supervision on the learning of attention maps for more accurate segmentation. Additionally, we propose a hardness-weighted Dice loss function that gives higher weights to harder voxels to boost the training of CNNs. Experiments with ablation studies on the VS tumor segmentation task show that: 1) the…
| Network | Dice(%) | ASSD (mm) | RVE (%) | Time (s) |
|---|---|---|---|---|
| 2D U-Net | 80.3810.42 | 0.920.68 | 18.0117.23 | 3.560.36 |
| 3D U-Net | 83.6113.69 | 0.840.62 | 18.0117.48 | 3.900.49 |
| 2.5D U-Net | 85.697.07 | 0.670.45 | 16.0214.71 | 3.490.39 |
| 2.5D U-Net + AG [5] | 85.936.96 | 0.580.41* | 15.4512.37 | 3.510.34 |
| 2.5D U-Net + PA | 86.096.94 | 0.550.32* | 14.8712.19 | 3.520.37 |
| 2.5D U-Net + SpvPA | 86.714.99* | 0.530.29* | 13.409.34* | 3.520.37 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDice Loss
11institutetext: 1 School of Mechanical and Electrical Engineering, University of Electronic Science and Technology of China, Chengdu, China
2School of Biomedical Engineering and Imaging Sciences, King’s College London, London, UK
3Wellcome/EPSRC Centre for Interventional and Surgical Sciences, University College London, London, UK
4 NVIDIA, Cambridge, UK
5Queen Square Radiosurgery Centre (Gamma Knife), National Hospital for Neurology and Neurosurgery, London, UK
6Neuroimaging Analysis Centre, Queen Square, London, UK
7Department of Neurosurgery, National Hospital for Neurology and Neurosurgery, London, UK
11email: [email protected]
Automatic Segmentation of Vestibular Schwannoma from T2-Weighted MRI by Deep Spatial Attention with Hardness-Weighted Loss
Guotai Wang 112233
Jonathan Shapey 223377
Wenqi Li 44
Reuben Dorent 22
Alex Demitriadis 55
Sotirios Bisdas 66
Ian Paddick 55
Robert Bradford 5577
Sébastien Ourselin 22
Tom Vercauteren 22
Abstract
Automatic segmentation of vestibular schwannoma (VS) tumors from magnetic resonance imaging (MRI) would facilitate efficient and accurate volume measurement to guide patient management and improve clinical workflow. The accuracy and robustness is challenged by low contrast, small target region and low through-plane resolution. We introduce a 2.5D convolutional neural network (CNN) able to exploit the different in-plane and through-plane resolutions encountered in standard of care imaging protocols. We use an attention module to enable the CNN to focus on the small target and propose a supervision on the learning of attention maps for more accurate segmentation. Additionally, we propose a hardness-weighted Dice loss function that gives higher weights to harder voxels to boost the training of CNNs. Experiments with ablation studies on the VS tumor segmentation task show that: 1) the proposed 2.5D CNN outperforms its 2D and 3D counterparts, 2) our supervised attention mechanism outperforms unsupervised attention, 3) the voxel-level hardness-weighted Dice loss can improve the performance of CNNs. Our method achieved an average Dice score and ASSD of 0.87 and 0.43 mm respectively. This will facilitate patient management decisions in clinical practice.
1 Introduction
Vestibular schwannoma (VS) is a benign tumor arising from one of the balance nerves connecting the brain and inner ear. The incidence of VS has risen significantly in recent years and is now estimated to be between 14 and 20 cases per million per year [4]. High-quality magnetic resonance imaging (MRI) is required for diagnosis and expectant management with serial imaging is usually advised for smaller tumors. Current MR protocols include contrast-enhanced T1-weighted (ceT1) and high-resolution T2-weighted (hrT2) images, but there is increasing concern about the potentially harmful cumulative side-effects of gadolinium contrast agents. Accurate measurement of VS tumor volume from MRI is desirable for growth detection and guiding management of the tumor. However, current clinical practice relies on labor-intensive manual segmentation.
This paper aims for automatic segmentation of the VS tumor from high-resolution T2-weighted MRI. This will improve clinical workflow and enable patients to undergo surveillance imaging without the need for gadolinium contrast, thus improving patient safety. However, this task is challenging due to several reasons. First, T2 images have a relatively low contrast and the exact boundary of the tumor is hard to detect. Second, the VS tumor is a relatively small structure with large shape variations in the whole brain image. Additionally, the image is often acquired with low through-plane resolution, as shown in Fig. 1.
In the literature, a Bayesian model was proposed for automatic VS tumor segmentation from ceT1 MRI [9], but it can hardly be applied to T2 images with much lower contrast. Semi-automated tools for this task suffer from inter-operator variations [8]. In recent years, convolutional neural networks (CNNs) have achieved state-of-the-art performance for many segmentation tasks [1, 2, 6]. However, most of them are proposed to segment images with isotropic resolution, and are not readily applicable to our VS images with high in-plane resolution and low through-plane resolution. To segment small structures from large image contexts, Yu et al. [12] used a coarse-to-fine approach with recurrent saliency transformation. Oktay et al. [5] learned an attention map to enable the CNN to focus on target structures. However, the attention map was not learned with explicit supervision during training, and may not be well-aligned with the target region, which can limit the segmentation accuracy. Therefore, we hypothesise that end-to-end supervision on the learning of attention map will lead to better results. Complementary approaches to deal with small structures include the use of adapted loss functions such as Dice loss [3] and generalized Dice loss [7]. They can mitigate the class imbalance between foreground and background by image-level weighting during training. Considering the fact that some voxels are harder than the others to learn during training, we propose a voxel-level hardness-weighted Dice loss function to further improve the segmentation accuracy.
The contribution of this paper is three-fold. First, to the best of our knowledge, this is the first work on automatic VS tumor segmentation using deep learning. We propose a 2.5D CNN combining 2D and 3D convolutions to deal with the low through-plane resolution. Second, we propose an attention module to enable the CNN to focus on the target region. Unlike previous works [5], we explicitly supervise the learning of attention maps so that they can highlight the target structure better. Finally, we propose a voxel-level hardness-weighted Dice loss function to boost the performance of CNNs. The proposed method was validated with T2-weighted MR images of 245 patients with VS tumor.
2 Methods
2.0.1 2.5D CNN for Segmentation of Images with Anisotropic Resolutions.
For our images with high in-plane resolution and low through-plane resolution, 2D CNNs applied slice-by-slice will ignore inter-slice correlation. Isotropic 3D CNNs may need to upsample the image to an isotropic 3D resolution to balance the physical receptive field (in terms of mm rather than voxels) along each axis, which requires more memory and may limit the depth or feature numbers of the CNNs. Therefore, it is desirable to design a 2.5D CNN that can not only use inter-slice features but also be more efficient than 3D CNNs. In addition, to make the receptive field isotropic in terms of physical dimensions, the number of convolution along each axis should be different when dealing with such images. In [10], a 2.5D CNN was proposed for brain tumor segmentation. However, it was designed for isotropically resampled 3D images and limited by a small physical receptive field along the through-plane axis.
We propose a novel attention-based 2.5D CNN combining 2D and 3D convolutions. As shown in Fig. 2, the main structure follows the typical encoder and decoder design of U-Net [6]. The encoder contains five levels of convolutions. The first two levels (L1-L2) and the other three levels (L3-L5) use 2D and 3D convolutions/max-poolings, respectively. This is motivated by the fact that the in-plane resolution of our VS tumor images is about 4 times that of the through-plane resolution. After the first two max-pooling layers that downsample the feature maps only in 2D, the feature maps in L3 and the followings have a near-isotropic 3D resolution. At each level, we use a block of layers containing two convolution layers each followed by batch normalization (BN) and parametric rectified linear unit (pReLU). The number of output feature channels at level is denoted as . is set as in our experiments. The decoder contains similar blocks of 2D and 3D layers. Additionally, to deal with the small target region, we add a spatial attention module to each level of the decoder, which is depicted in Fig. 2 and detailed in the following.
2.0.2 Multi-Scale Supervised Spatial Attention.
Previous works have shown that spatial attention can be automatically learned in CNNs to enable the network to focus on the target region in a large image context [5]. Building upon these works, we further introduce an explicit supervision on the learning of attention to improve its accuracy. A spatial attention map can be seen as a single-channel image of attention coefficient that is a score of relative importance for each spatial position . As shown in Fig. 2, the proposed attention module consists of two convolution layers. For an input feature map at level with channel number , the first convolution layer reduces the channel number to and is followed by ReLU. The second convolution layer further reduces the channel number to 1 and is followed by sigmoid to generate the spatial attention map at level . is multiplied with the input feature map. We also use a residual connection in the attention module, as depicted in Fig. 2.
We propose an attention loss to supervise the learning of spatial attention explicitly during training. Let denote the multi-channel one-hot ground truth segmentation of an image and denote the single-channel binary foreground mask. For attention map at level , let denote the average-pooled version of so that it has the same resolution as . Our loss function for training is:
[TABLE]
where is the number of resolution levels ( in our case). measures the difference between and . It drives the attention maps to be as close to the foreground mask as possible. denotes the prediction output of CNN, i.e., the probability of belonging to each class for each voxel. is the segmentation loss. The multi-scale supervision in Eq. (1) is similar to the holistic loss [11]. However, here we apply it to multi-scale attention maps rather than the network’s final prediction output. The two terms in Eq. (1) share the same underlying loss function , as discussed in the following.
2.0.3 Voxel-Level Hardness-Weighted Dice Loss.
A good choice of is the Dice loss [3] proposed to train CNNs for binary segmentation, and it has shown good performance in dealing with imbalanced foreground and background classes. For segmentation of small structures with low contrast, some voxels are harder than the others to learn. Treating all the voxels for a certain class equally as in [3] may limit the performance of CNNs on hard voxels. Therefore, we propose automatic hard voxel weighting in the loss function by defining a voxel-level weight:
[TABLE]
where is the probability of being class for voxel predicted by a CNN, and is the corresponding ground truth value. controls the degree of hard voxel weighting. Our proposed hardness-weighted Dice loss (HDL) is defined as:
[TABLE]
where is the channel number of and , and = is a small number for numerical stability. Similarly to [3], the gradient of with respect to can be easily computed. Note that for the first term in Eq. (1) dealing with attention maps, the channel number is one.
3 Experiments and Results
3.0.1 Data and Implementation.
T2-weighted MRI of 245 patients with a single sporadic VS tumor were acquired in axial view before radiosurgery treatment, with high in-plane resolution around 0.4 mm0.4 mm, in-plane size 512512, slice thickness and inter-slice spacing 1.5 mm, and slice number 19 to 118. The ground truth was manually annotated by an experienced neurosurgeon and physicist. We randomly split the images into 178, 20 and 47 for training, validation and testing respectively. Each image was cropped with a cubic box of size 100 mm50 mm50 mm manually, and normalized by its intensity mean and standard deviation. The CNNs were implemented in Tensorflow and NiftyNet [2] on a Ubuntu desktop with an NVIDIA GTX 1080 Ti GPU. For training, we used Adam optimizer with weight decay , batch size 2. The learning rate was initialized to and halved every 10k. The training was ended when performance on the validation set stopped to increase. For quantitative evaluation, we measured Dice, average symmetric surface distance (ASSD) and relative volume error (RVE) between segmentation results and the ground truth.
3.0.2 Comparison of Different Networks.
First, we evaluate the performance of our 2.5D network, and refer to our CNN without the attention module as 2.5D U-Net. Its 2D and 3D counterparts with the same configuration except the dimension of convolution/decovolution and max-pooling are referred to as 2D U-Net and 3D U-Net respectively. For 3D U-Net, the images were resampled to isotropic resolution of 0.4 mm0.4 mm0.4 mm. The performance of these networks trained with Dice loss is shown in Table 1. It can be observed that our 2.5D U-Net achieves higher accuracy than its 2D and 3D counterparts. In addition, it is more efficient than the other two. Its lower inference time than slice-by-slice 2D U-Net is due to the 3D down-sampled feature maps in L3-L5.
3.0.3 Effect of Supervised Attention.
We further investigated the effect of our proposed attention (PA) module and supervised attention (SpvPA). PA was compared with the attention gate (AG) module proposed in [5]. We combined these modules with our 2.5D U-Net respectively. AG was used to calibrate features from the encoder, as implemented in [5], while our PA and SpvPA were designed to calibrate the concatenation of encoder and decoder features, as shown in Fig. 2. These variants were trained with Dice loss and their performance is shown in Table 1. It can be observed that both AG [5] and PA lead to a better segmentation accuracy than the 2.5D U-Net without attention, and our proposed PA performs slightly better than AG [5]. By using our SpvPA, the segmentation accuracy can be further improved from that of PA.
Fig. 3 shows a visual comparison of these three different attention methods. It can be observed that the attention map of AG [5] successfully suppresses most of the background region, but the magnitude for the target region is lower than that of PA and SpvPA. The attention map of PA highlights the target region, but also assigns high attention coefficients for strong edges in the input image. This is mainly because the input for the PA module is a concatenation of high-level and low-level features. Benefiting from our explicit supervision on the learning of attention, the attention map of SpvPA focuses more on the target region and is less blurry than that of AG [5].
3.0.4 Performance of Voxel-Level Hardness-Weighted Dice Loss.
We additionally used HDL to train 2.5D U-Net and 2.5D U-Net + SpvPA respectively. The average Dice, ASSD and RVE of these two networks with different values of are shown in Fig. 4. Note that is the baseline without hard voxel weighting and a higher value of corresponds to assigning higher weights to harder voxels during training. The figure shows that our HDL with different values of leads to higher segmentation performance. An improvement of accuracy is observed when increases from 0.0 to 0.4. Interestingly, when is higher than 0.6, the segmentation accuracy decreases, as shown by the curves of Dice in Fig. 4. This indicates that giving too much emphasis to hard voxels may decrease the generalization ability of the CNNs. As a result, we suggest a proper range of as [0.4, 0.6]. Quantitative comparison between Dice loss [3] and our proposed HDL with is presented in Table 2. It shows that our proposed HDL outperforms Dice loss [3] for both 2.5D U-Net and 2.5D U-Net + SpvPA.
4 Discussion and Conclusion
In this work, we propose a 2.5D CNN for automatic VS tumor segmentation from high-resolution T2-weighted MRI. Our network is a trade-off between standard 2D and 3D CNNs and specifically designed for images with high in-plane resolution and low through-plane resolution. Experiments show that it outperforms its 2D and 3D counterparts in terms of segmentation accuracy and efficiency. To deal with the small target region, we propose a multi-scale spatial attention mechanism with explicit supervision on the learning of attention maps. Experimental results demonstrate that the supervised attention can guide the network to focus more accurately on the target region, leading to higher accuracy of the final segmentation. We also combine automatic hard voxel weighting with existing Dice loss [3], and the proposed voxel-level hardness-weighted Dice loss can lead to further performance improvement. This will facilitate the rapid adoption of these techniques into clinical practice, providing clinicians with the means for accurate automatically-generated segmentations that will be used to inform patient management decisions. Though our methods can also be applied to ceT1 images, this work on T2 image segmentation improves patient safety by enabling patients to undergo serial imaging without the need to use potentially harmful contrast agents.
4.0.1 Acknowledgements.
This work was supported by Wellcome Trust [203145Z/16/Z; 203148/Z/16/Z; WT106882], and EPSRC [NS/A000050/1; NS/A000049/1] funding. TV is supported by a Medtronic / Royal Academy of Engineering Research Chair [RCSRF1819/7/34].
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-Net: Learning dense volumetric segmentation from sparse annotation. In: MICCAI. pp. 424–432 (2016)
- 2[2] Gibson, E., Li, W., Sudre, C., Fidon, L., Shakir, D.I., Wang, G., Eaton-Rosen, Z., Gray, R., Doel, T., Hu, Y., Whyntie, T., Nachev, P., Modat, M., Barratt, D.C., Ourselin, S., Cardoso, M.J., Vercauteren, T.: Nifty Net: A deep-learning platform for medical imaging. Comput. Methods Programs Biomed. 158, 113–122 (2018)
- 3[3] Milletari, F., Navab, N., Ahmadi, S.A.: V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In: IC 3DV. pp. 565–571 (2016)
- 4[4] Moffat, D.A., Hardy, D.G., Irving, R.M., Viani, L., Beynon, G.J., Baguley, D.M.: Referral patterns in vestibular schwannomas. Clin. Otolaryngol. Allied Sci. 20(1), 80–83 (1995)
- 5[5] Oktay, O., Schlemper, J., Le Folgoc, L., Lee, M., Heinrich, M., Misawa, K., Mori, K., Mcdonagh, S., Hammerla, N.Y., Kainz, B., Glocker, B., Rueckert, D.: Attention U-Net: Learning where to look for the pancreas. ar Xiv Prepr. ar Xiv 1804.03999 (2018)
- 6[6] Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomedical image segmentation. In: MICCAI. pp. 234–241 (2015)
- 7[7] Sudre, C.H., Li, W., Vercauteren, T., Ourselin, S., Cardoso, M.J.: Generalised Dice overlap as a deep learning loss function for highly unbalanced segmentations. In: Deep Learn. Med. Image Anal. Multimodal Learn. Clin. Decis. Support. pp. 240–248 (2017)
- 8[8] Tysome, J., Patterson, A., Das, T., Donnelly, N., Mannion, R., Axon, P., Graves, M., Mac Keith, S.: A comparison of semi-automated volumetric vs linear measurement of small vestibular schwannomas. Eur. Arch. Oto-Rhino-Laryn. 275(4), 867–874 (2018)
