FRNET: Flattened Residual Network for Infant MRI Skull Stripping
Abstract
Skull stripping for brain MR images is a basic segmentation task. Although many methods have been proposed, most of them focused mainly on the adult MR images. Skull stripping for infant MR images is more challenging due to the small size and dynamic intensity changes of brain tissues during the early ages. In this paper, we propose a novel CNN based framework to robustly extract brain region from infant MR image without any human assistance. Specifically, we propose a simplified but more robust flattened residual network architecture (FRnet). We also introduce a new boundary loss function to highlight ambiguous and low contrast regions between brain and non-brain regions. To make the whole framework more robust to MR images with different imaging quality, we further introduce an artifact simulator for data augmentation. We have trained and tested our proposed framework on a large dataset (N=343), covering newborns to 48-month-olds, and obtained performance better than the state-of-the-art methods in all age groups.
**Index Terms— **
Skull stripping, Infant brain, Deep learning
1 Introduction
Skull stripping, also called brain extraction, aims to retain brain parenchyema and discard non-brain tissues, such as skull, scalp, and dura [1]. As a fundamental problem in brain MR image analysis, numerous methods have been proposed over the past 20 years. Some of them are based on morphological operations, e.g., brain surface extraction (BSE) [2] and some others are based on deformation models that try to fit the brain surface, e.g., the brain extraction tool (BET) [1]. However, most of these methods only focus on adult MR images, and there are only a few methods dedicated on infant brain MRI [3, 4]. The main challenge for skull stripping of infant brain MR images is the rapid change of brain tissues during the early life period [5]. As an example, Fig. 1 shows infant brain images in the first 4 years of life. We can see low contrast, and dynamic changes of imaging intensity, brain size and shape in these images. Recently, deep convolutional neural networks (CNNs) [6] have achieved great success in medical image segmentation. Among them, UNet [7], an evolutionary variant of CNN, has achieved excellent performance by effectively combining upper-level features and low-level features in the network architecture. Inspired by UNet, in this paper, we propose a new deep learning-based framework to deal with size variance and dynamic intensity changes of MR images in different age groups.
In the training process, we introduce a new boundary loss function, by making use of spatial information and applying voxel-wise weights to improve the training speed and avoid local minimum. As shown in Fig. 5, this loss function can largely increase the segmentation accuracy. We have also introduced an artifact simulator for data augmentation to make our proposed framework more robust to low-quality images.
2 METHOD
2.1 Network Architecture
Our proposed flattened residual net (FRnet) is motivated from UNet [7]. To address the over-fitting issue of UNet on some datasets and enhance its generalization ability to adapt image changes in different age groups: 1) we simplify the encoder section and apply only one convolution after down-sampling in each layer; 2) we use strip 2 convolutional layers for down-sampling and also the deconvolutional layers for up-sampling; 3) we introduce some residual paths in the decoder section to help training. The architecture of FRnet is shown in Fig. 2.
2.2 Boundary Loss
The output of the network goes through a soft-max layer, with the value of each voxel representing the probability for each class. In most studies, people use cross-entropy loss to train the network. However, it is very common that this loss only decreases for a while at the beginning of the training process, and then shakes harshly and ends up at certain local minima. This yields a network that can only roughly point out the locations of the regions but fail to produce the details. Some methods have been proposed to fix this issue by using weighted loss functions i.e., weighted cross-entropy (WCE) loss [8], which applies a weight for each class during the training process as shown below.
[TABLE]
where pt is the predicted probability of the targeted voxel, and α is a manually selected weight for each class. Usually the loss value of this function should also be divided by the volume size during the training.
An improved version of WCE is the well-known focal loss [9], which modifies the weight dynamically according to the output of the network:
[TABLE]
Our proposed method, boundary loss, takes into consideration the spatial information derived from the whole brain mask. To avoid being dominated by the easily classified inner regions in the training process, we increase the weight of loss generated by the voxels near the boundary during the training process.
[TABLE]
where B is a density map derived from the boundary of our targeted region using Gaussian filter as shown in Fig. 3.
2.3 Artifact Simulator
MR Imaging is a time-consuming process, so it is very sensitive to motion, which can cause ghosting, blurring and geometric distortion in the images and heavily corrupt further analysis of the image. Over the past 30 years since MRI has been used for medical diagnosis, numerous methods have been proposed to mitigate or to correct this artifact, but still, there is no single method that can be applied in all imaging situations [10], due to the huge number of unknown parameters of the motion. Therefore, instead of trying to correct the motion artifacts, we propose to use an artifact simulator for data augmentation during the training process, for improving our simple task of skull-stripping, not image acquisition. We find that this method can also improve the robustness of the whole skull-stripping framework, for both the images with and without artifacts.
The raw data acquired by a MR scanner are stored in the so-called k-space domain [11]. To simulate the motion affects in a MR image, we first use a Fourier transformation to obtain the k-space data. Then we apply a random phase shift to each k-space line along the readout direction, where the random phase shift is given by:
[TABLE]
where ky and kz are the k-space coordinates along the phase and partition encoding directions for the k-space line, and σy and σz are the random motion amounts. The random motion amounts are assumed to follow a Gaussian distribution. After applying the random phase shift, inverse Fourier transform is applied to the k-space data to obtain the motion corrupted images. The standard deviation σ of the Gaussian distribution is varied to generate images with different severities of motion artifacts. Fig. 4 shows the simulated images.
3 EXPERIMENTS
3.1 Dataset
All data used in this study were obtained from UNC/UMN Baby Connectome Project (BCP) dataset [12]. All scans for subjects less than two years old were acquired while the infants were naturally sleeping and fitted with ear protection, with their heads secured in a vacuum-fixation device. T1w MR images were acquired with 320 sagittal slices using parameters: TR/TE=2400/2.24 ms and resolution=0.8x0.8x0.8 mm3.
As shown in Table. 1, all scans were labeled by human experts and were grouped into 7 classes: 0 (newborn), 3, 6, 9, 12, 24, 48 months of age. We randomly chose 20% from each class as our testing set.
3.2 Implementation Details
We chose pytorch to implement our network. We also used python libraries SimpleITK and Scikit-image for data preprocessing. We trained and tested our framework on a Linux workstation equipped with an Intel Xeon E5-2650 v4 CPU and 12 GB NVIDIA TITAN Xp GPUs.
All MR images were preprocessed with N4 bias correction [13], and we also used randomized reorientation and resizing for data augmentation. For each training image, we used the artifact simulator to generate two simulated images for training with σ equals 0.3 and 1.0 respectively.
All the CNN models were initialized with xavier [14], and we chose Adam as the optimizer with a fixed learning rate of 0.003. We used a 2-fold cross-validation on the training set to select the best model for testing.
3.3 Results
As shown in Fig. 5, the widely used cross-entropy loss func- tion cannot clearly distinguish the brain and non-brain regions because the inner easy-to-segment region is too large and stops the training process. On the contrary, our proposed boundary loss forces the optimizer to focus more on the outer regions and thus produces a much better result.
We further quantitatively evaluate our two main contributions: 1) the proposed FRnet architecture for segmentation and 2) the boundary loss for training. For the traditional methods, we chose BSE and BET for comparison, while, for deep learning-based methods, we chose UNet. We also tested different loss functions on each network architecture. The results are shown in Table. 2.
4 CONCLUSION
In this work, we have proposed a novel framework to train a CNN model for skull stripping of infant brain MR images. We believe the same framework could also be applied for other segmentation tasks like the white matter (WM) and gray matter (GM) segmentation. Our proposed FRnet is a simplified version of UNet,and we have demonstrated its robustness to dynamic intensity changes and low contrast as shown in MR images of early age groups. Since UNet is a widely used deep learning model for segmentation tasks, we believe our FRnet can help related applications where UNet is used. Also, our proposed boundary loss would also benefit other tasks because the boundary is always the most crucial part in segmentation map. The artifact simulator could also be helpful in some other tasks such as registration and landmark detection, because motion artifact is a common problem in MR related applications.