Automatic detection and segmentation of lumbar vertebra from X-ray   images for compression fracture evaluation

Kang Cheol Kim; Hyun Cheol Cho; Tae Jun Jang; Jong Mun Choi; Jin Keun; Seo

arXiv:1904.07624·physics.med-ph·April 17, 2019

Automatic detection and segmentation of lumbar vertebra from X-ray images for compression fracture evaluation

Kang Cheol Kim, Hyun Cheol Cho, Tae Jun Jang, Jong Mun Choi, Jin Keun, Seo

PDF

Open Access

TL;DR

This paper presents a hierarchical deep-learning and level-set based method for automatic lumbar vertebra segmentation in X-ray images, improving accuracy for compression fracture assessment despite challenges like overlapping shadows and unclear boundaries.

Contribution

It introduces a novel structured approach combining pose-driven learning, M-net segmentation, and level-set refinement specifically for lumbar vertebra detection in challenging X-ray images.

Findings

01

Center position detection error: 25.35±10.86 pixels

02

Mean Dice similarity: 91.60±2.22%

03

Validated on clinical data with promising results

Abstract

For compression fracture detection and evaluation, an automatic X-ray image segmentation technique that combines deep-learning and level-set methods is proposed. Automatic segmentation is much more difficult for X-ray images than for CT or MRI images because they contain overlapping shadows of thoracoabdominal structures including lungs, bowel gases, and other bony structures such as ribs. Additional difficulties include unclear object boundaries, the complex shape of the vertebra, inter-patient variability, and variations in image contrast. Accordingly, a structured hierarchical segmentation method is presented that combines the advantages of two deep-learning methods. Pose-driven learning is used to selectively identify the five lumbar vertebra in an accurate and robust manner. With knowledge of the vertebral positions, M-net is employed to segment the individual vertebra. Finally,…

Tables2

Table 1. Table 1: Center position errors for the five lumber vertebra in pixel space are represented with mean and standard detivation.

Distance Error in pixel space
L1	L2	L3	L4	L5	All
$26.84 \pm 10.58$	$24.43 \pm 11.45$	$24.57 \pm 10.63$	$26.73 \pm 10.24$	$24.17 \pm 11.18$	$25.35 \pm 10.86$

Table 2. Table 2: Comparison the result for several methods. Evaluation of segmentation results using multiple metrics. Dice coefficient, precision, sensitivity, and specificity are represented with mean and standard detivation.

	Region-Based Metric( $%$ )
Method	Dice	Precision	Sensitivity	Specificity
Pose-net+M-net+Level set	$91.60 \pm 2.22$	$84.57 \pm 3.64$	$90.13 \pm 2.91$	$99.59 \pm 0.17$
Poes-net+U-net+Level set	$91.05 \pm 3.50$	$83.74 \pm 5.47$	$90.76 \pm 3.72$	$99.49 \pm 0.26$
Pose-net+M-net	$90.38 \pm 4.31$	$82.72 \pm 6.86$	$92.74 \pm 4.33$	$99.32 \pm 0.33$
Poes-net+U-net	$90.14 \pm 4.26$	$82.32 \pm 6.79$	$93.61 \pm 3.69$	$99.24 \pm 0.035$
Original M-net	$88.31 \pm 5.97$	$79.55 \pm 9.01$	$89.22 \pm 7.27$	$99.31 \pm 0.49$
Original U-net	$87.39 \pm 7.13$	$78.27 \pm 10.46$	$89.07 \pm 9.01$	$99.22 \pm 0.46$

Equations24

f_{\mbox L 5} (I) = W^{l} ⊛ η (\dots P (η (W^{2} ⊛ (η (W^{1} ⊛ I)))))

f_{\mbox L 5} (I) = W^{l} ⊛ η (\dots P (η (W^{2} ⊛ (η (W^{1} ⊛ I)))))

L (θ_{\mbox p ose}) = \frac{1}{N} n = 1 \sum N L^{(n)} (θ_{\mbox p ose})

L (θ_{\mbox p ose}) = \frac{1}{N} n = 1 \sum N L^{(n)} (θ_{\mbox p ose})

L^{(n)} (θ_{\mbox p ose}) = f_{\mbox L 5} (I^{(n)}) - y_{\mbox L 5}^{(n)}_{2}^{2} + f_{\mbox L 1 - 5} (I_{*}^{(n)}, f_{\mbox L 5} (I^{(n)})) - y_{\mbox L 1 - 5}^{(n)}_{2}^{2} .

L^{(n)} (θ_{\mbox p ose}) = f_{\mbox L 5} (I^{(n)}) - y_{\mbox L 5}^{(n)}_{2}^{2} + f_{\mbox L 1 - 5} (I_{*}^{(n)}, f_{\mbox L 5} (I^{(n)})) - y_{\mbox L 1 - 5}^{(n)}_{2}^{2} .

y_{\mbox L 5}^{(n)} (x) = exp (- \frac{∣∣ x - p _{5} ∣ ∣ _{2}^{2}}{σ ^{2}})

y_{\mbox L 5}^{(n)} (x) = exp (- \frac{∣∣ x - p _{5} ∣ ∣ _{2}^{2}}{σ ^{2}})

y_{\mbox L 1 - 5} (x) = max {y_{\mbox L 1} (x), \dots, y_{\mbox L 5} (x)} .

y_{\mbox L 1 - 5} (x) = max {y_{\mbox L 1} (x), \dots, y_{\mbox L 5} (x)} .

h = w = ⎩ ⎨ ⎧ 3∣ (p_{o, j} - p_{o, j + 1})_{y} ∣, \frac{3}{2} (∣ (p_{o, j - 1} - p_{o, j})_{y} ∣ + ∣ (p_{o, j} - p_{o, j + 1})_{y} ∣), 3∣ (p_{o, j - 1} - p_{o, j})_{y} ∣, for j = 1 for j = 2, 3, 4 for j = 5

h = w = ⎩ ⎨ ⎧ 3∣ (p_{o, j} - p_{o, j + 1})_{y} ∣, \frac{3}{2} (∣ (p_{o, j - 1} - p_{o, j})_{y} ∣ + ∣ (p_{o, j} - p_{o, j + 1})_{y} ∣), 3∣ (p_{o, j - 1} - p_{o, j})_{y} ∣, for j = 1 for j = 2, 3, 4 for j = 5

f_{\mbox se g} (I_{p}; θ_{\mbox se g}) = \frac{1}{4} i = 1 \sum 4 f_{\mbox se g, i} (I_{p}; θ_{\mbox se g, i})

f_{\mbox se g} (I_{p}; θ_{\mbox se g}) = \frac{1}{4} i = 1 \sum 4 f_{\mbox se g, i} (I_{p}; θ_{\mbox se g, i})

L (θ_{\mbox se g}) = \frac{1}{4 N} n = 1 \sum N i = 1 \sum 4 [- \frac{1}{ℵ _{\mbox p i x e l}} ⟨ y_{\mbox se g}^{(n)}, L (f_{\mbox se g, i} (I_{p}^{(n)}; θ_{\mbox se g, i})) ⟩]

L (θ_{\mbox se g}) = \frac{1}{4 N} n = 1 \sum N i = 1 \sum 4 [- \frac{1}{ℵ _{\mbox p i x e l}} ⟨ y_{\mbox se g}^{(n)}, L (f_{\mbox se g, i} (I_{p}^{(n)}; θ_{\mbox se g, i})) ⟩]

Φ (φ) = \int_{Ω} g (x) δ (φ (x)) ∣\nabla φ (x) ∣ d x + \frac{1}{2} \int_{Ω} (∣\nabla φ (x) ∣ - 1)^{2} d x + \int_{Ω} g (x) H (- φ (x)) d x + λ \int_{Ω} y_{\mbox se g} (x) H (- φ (x)) d x

Φ (φ) = \int_{Ω} g (x) δ (φ (x)) ∣\nabla φ (x) ∣ d x + \frac{1}{2} \int_{Ω} (∣\nabla φ (x) ∣ - 1)^{2} d x + \int_{Ω} g (x) H (- φ (x)) d x + λ \int_{Ω} y_{\mbox se g} (x) H (- φ (x)) d x

\frac{\partial}{\partial t} φ (x) = δ (φ) [(\nabla g \cdot \frac{\nabla φ}{∣\nabla φ ∣} + \nabla \cdot \frac{\nabla φ}{∣\nabla φ ∣} g) ∣\nabla φ ∣] + [\nabla^{2} φ - \nabla \cdot \frac{\nabla φ}{∣\nabla φ ∣}] + g δ (φ) + λ y_{\mbox se g} δ (φ) .

\frac{\partial}{\partial t} φ (x) = δ (φ) [(\nabla g \cdot \frac{\nabla φ}{∣\nabla φ ∣} + \nabla \cdot \frac{\nabla φ}{∣\nabla φ ∣} g) ∣\nabla φ ∣] + [\nabla^{2} φ - \nabla \cdot \frac{\nabla φ}{∣\nabla φ ∣}] + g δ (φ) + λ y_{\mbox se g} δ (φ) .

d_{D} := \frac{2∣ O _{\mbox GT} \cap O _{\mbox S G} ∣}{∣ O _{\mbox GT} ∣ + ∣ O _{\mbox S G} ∣}, d_{P} := \frac{∣ O _{\mbox GT} \cap O _{\mbox S G} ∣}{∣ O _{\mbox GT} \cup O _{\mbox S G} ∣}

d_{D} := \frac{2∣ O _{\mbox GT} \cap O _{\mbox S G} ∣}{∣ O _{\mbox GT} ∣ + ∣ O _{\mbox S G} ∣}, d_{P} := \frac{∣ O _{\mbox GT} \cap O _{\mbox S G} ∣}{∣ O _{\mbox GT} \cup O _{\mbox S G} ∣}

d_{S e n} := \frac{∣ O _{\mbox GT} \cap O _{\mbox S G} ∣}{∣ O _{\mbox GT} ∣}, d_{S p e} := \frac{∣ ( O _{\mbox GT} \cup O _{\mbox S G} ) ^{c} ∣}{∣ ( O _{\mbox GT} ) ^{c} ∣} .

d_{S e n} := \frac{∣ O _{\mbox GT} \cap O _{\mbox S G} ∣}{∣ O _{\mbox GT} ∣}, d_{S p e} := \frac{∣ ( O _{\mbox GT} \cup O _{\mbox S G} ) ^{c} ∣}{∣ ( O _{\mbox GT} ) ^{c} ∣} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedical Imaging and Analysis · COVID-19 diagnosis using AI · Dental Radiography and Imaging

Full text

Automatic detection and segmentation of lumbar vertebra from X-ray images for compression fracture evaluation

Kang Cheol Kim†, Hyun Cheol Cho†, Tae Jun Jang†, Jong Mun Choi, MD‡444To whom correspondence should be addressed ([email protected]), and Jin Keun Seo

†Department of Computational Science and Engineering, Yonsei University, Seoul 03722, South Korea

‡DEEPNOID Inc., Seoul, South Korea

Abstract

For compression fracture detection and evaluation, an automatic X-ray image segmentation technique that combines deep-learning and level-set methods is proposed. Automatic segmentation is much more difficult for X-ray images than for CT or MRI images because they contain overlapping shadows of thoracoabdominal structures including lungs, bowel gases, and other bony structures such as ribs. Additional difficulties include unclear object boundaries, the complex shape of the vertebra, inter-patient variability, and variations in image contrast. Accordingly, a structured hierarchical segmentation method is presented that combines the advantages of two deep-learning methods. Pose-driven learning is used to selectively identify the five lumbar vertebra in an accurate and robust manner. With knowledge of the vertebral positions, M-net is employed to segment the individual vertebra. Finally, fine-tuning segmentation is applied by combining the level-set method with the previously obtained segmentation results. The performance of the proposed method was validated using clinical data, resulting in center position detection error of $25.35\pm 10.86$ and a mean Dice similarity metric of $91.60\pm 2.22\%$ .

1 Introduction

Compression fracture usually occurs when osteoporosis patient slip down. Severe osteoporosis can cause it without major traumatic event. Patients with compression fractures generally have symptoms such as back pain, but the symptoms are not always clear. Accurate and rapid diagnosis is essential to ensure that suspicious patients do not miss the right time to treat.

There are a variety of modalities to diagnose compression fractures. A plain lumber X-ray is the frontline imaging examination for the diagnosis of spinal fracture and for monitoring the progression of that. X-rays are generally obtained in the first instance because the procedure is fast, inexpensive, and simple. On the other hand, X-ray have disadvantages of overlapping shadows of other thoracoabdominal 3-dimensional structures, compared with CT or MRI. In clinical terms, accurate segmentation of the lumber vertebra could assist in accurate compression fracture diagnosis and progress estimation.

Automatic segmentation of lumbar vertebra is desirable because manual segmentation is cumbersome and time-consuming. It can help guide the clinician’s assessment and reduce misdiagnosis caused by human error. However, compared with CT and MRI images, automatic segmentation of the lumbar spine from X-ray images is challenging because of the overlapping shadows of complex 3D structures such as the rib cage. It is difficult to segment the five lumbar vertebra selectively without using anatomical and morphological information.

Various automated vertebral segmentation methods have been developed for use with medical imaging modalities, most commonly for CT and to a lesser degree for X-ray images. Most of the methods are based on variants of deformable models [Caselles1988, Cootes1995, Davatzikos2002], with some constraints to improve accuracy and robustness. Klinder et al. [Klinder2009] developed a mean-shape-constrained deformable model for CT images, and Ibragimov et al. [Ibragimov2014, Ibragimov2017] developed a landmark-assisted deformable model for CT images that combines the advantages of landmark detection and deformable models with Laplacian shape-editing into a supervised multi-energy segmentation framework. Lim et al. [Lim2013] integrated an edge-mounted Willmore flow with prior shape energies into a level-set framework for the segmentation of spinal vertebra from CT images. Kadoury et al. [Kadoury2011, Kadoury2013] used an articulated deformable model for spine segmentation in CT, where manifold embeddings are used to infer constellations accounting for deformations, and higher-order Markov random fields are used to infer articulated objects directly from low-dimensional parameters. Glocker et al. [Glocker2012, Glocker2013] developed a method based on regression forests for the rough detection of vertebra and probabilistic graphical models for accurate localization and identification of individual vertebra in CT. For a comprehensive comparative study of vertebral segmentation in CT, see [Yao2016] and references therein. With regard to segmentation from X-ray images, Arif et al. [AlArif2018] developed a deep-learning-based fully automatic framework for segmentation of cervical vertebra.

Difficulties in directly applying existing segmentation methods to lumbar spine X-rays include the multiple overlapping shadows of the ribs and pelvis, relatively weak contrast, and the need to identify the five lumbar vertebra individually. Accordingly, it is necessary to use the local and global characteristics of lumbosacral spine X-rays and consider factors such as the position of the sacrum and the curve of the vertebral column.

This paper proposes fully automatic lumbar vertebral segmentation from X-ray images by combining deep-learning techniques and level-set methods. The proposed method comprises four main steps, as follows.

Pre-processing of X-ray images by an adoptive histogram equalization, which are used for adoptive contrast enhancement. 2. 2.

Pose-driven learning to identify each of the five lumbar vertebra. 3. 3.

After their positions are known, M-net based segmentation of the individual lumbar vertebra. 4. 4.

Subsequent fine-tuning of the segmentation by combining these deep-learning segmentation results (in the previous steps) with the level-set method.

The performance of the proposed method was validated on clinical X-ray images from 80 normal person and 80 abnormal person. The experimental results show that the proposed method provide reasonable performance for localization and segmentation of lumbar vertebra. We achieved the center position detection error of $25.35\pm 10.86$ , and $91.60\pm 2.22\%$ Dice similarity metric for segmentation of the five lumbar vertebra.

2 Methods

This section proposes a fully automated method for segmentation of the five lumbar vertebra from X-ray images. Segmentation from X-ray images is complicated by the overlapping shadows of other thoracoabdominal 3-dimensional structures. In addition, lateral view X-ray images usually include the thoracic spines which are adjacent to each other and have similar shape.

To address these problems, a three-part hierarchical method is adopted that mimics the steps in the clinician’s decision process: spine localization, segmentation of lumbar vertebra, and fine-tuning of segmentation. The overall process is shown in Fig. 1. A landmark detection method is used to identify the center of the lumbar spine. From the knowledge of the central position, we extract bounding boxes corresponding to the five individual lumbar vertebra. Deep-learning-based segmentation was then applied to identify the vertebral levels of the extracted patches, and the level-set method was subsequently used to improve the quality of segmentation.

2.1 Pre-processing

Given that X-ray images have a narrow intensity distribution, an adaptive histogram equalization method [Pizer1987] is employed in pre-processing to increase the contrast of the images by spreading the intensity values. Then, Gaussian filtering is applied to the contrast-enhanced images to alleviate the background noise. Fig. 2 for these preprocessing steps images.

Throughout this paper, $I_{o}(\mathbf{x})\in\mathbb{R}^{3072\times 1536}$ represents an image obtained by applying the preprocessing procedure, where $\mathbf{x}=(x_{1},x_{2})$ denotes the pixel position. Each image $I_{o}$ is resized to $512\times 256$ pixels, and the resized image, denoted by $I(\mathbf{x})$ , is used as input of the pose-estimation method described in the next section.

2.2 Localization of the five lumbar vertebra

This section describes how to automatically find the center positions of five lumber vertebra (lower back) between rib cage and the pelvis, which are denoted L1 to L5, starting at the top. Selective detection of the five L-vertebra is a difficult task because of the need to observe neighboring structures such as the sacrum and the whole spine. To solve this problem, we employed a deep learning-based pose estimation method [Toshev2014, Wei2016, Cao2017], which is widely used in computer vision area for detecting human joint.

The pose estimation aims to predict the pose of five lumber vertebra in the image $I$ , where the output poses are expressed by a vector ${\bf P}=({\bf p}_{1},\cdots,{\bf p}_{5})\in{\mathbb{R}}^{2\times 5}$ , representing the set of center positions of five lumbar vertebra. To achieve this, we use two-stage neural networks, namely Pose-net denoted by functions $f_{{\mbox{\tiny{L5}}}}$ and $f_{{\mbox{\tiny{L1-5}}}}$ , to generate two confidence maps; the first confidence map $y_{{\mbox{\tiny{L5}}}}=f_{{\mbox{\tiny{L5}}}}(I)\in\mathbb{R}^{512\times 256}$ provides the belief of the center of L5 vertebra and the second confidence map $y_{\mbox{\tiny{L1-5}}}=f_{\mbox{\tiny{L1-5}}}(I_{*},y_{{\mbox{\tiny{L5}}}})\in\mathbb{R}^{512\times 256}$ provides the belief of ${\bf P}$ by taking advantage of the first one $y_{{\mbox{\tiny{L5}}}}$ . Here, $I_{*}$ is an intermediate feature layer of $f_{\mbox{\tiny{L5}}}$ with input $I$ as shown in Fig. 3. We adopt a convolutional neural network(CNN) to learn functions $f_{\mbox{\tiny{L5}}}$ and $f_{\mbox{\tiny{L1-5}}}$ . The input of $f_{\mbox{\tiny{L5}}}$ is $I$ and $f_{\mbox{\tiny{L5}}}(I)$ is expressed as

[TABLE]

where $W\circledast{\mathbf{h}}$ is the convolution of ${\mathbf{h}}$ with the weight $W$ ; $\mathcal{P}$ is the max pooling; and $\eta$ is the rectified linear unit activation function $ReLU$ . The input of $f_{\mbox{\tiny{L1-5}}}$ is a concatenated vector of $y_{\mbox{\tiny{L5}}}$ and $I_{*}$ . Fig. 3 shows the architecture of $f_{\mbox{\tiny{L1-5}}}$ .

These networks $f_{\mbox{\tiny{L5}}}$ and $f_{\mbox{\tiny{L1-5}}}$ are learned simultaneously, using the training data $\mathcal{S}_{\mbox{\tiny pose}}:=\{I^{(n)},y_{\mbox{\tiny{L5}}}^{(n)},y_{\mbox{\tiny{L1-5}}}^{(n)}\}_{n=1}^{N}$ . The loss function is given by

[TABLE]

where $\theta_{\mbox{\tiny{pose}}}$ is a set of all parameters in the network and $\mathcal{L}^{(n)}(\theta_{\mbox{\tiny{pose}}})$ is the sum of the intermediate loss and the final loss:

[TABLE]

Here, the labeled data $y^{(n)}_{\mbox{\tiny{L5}}}$ is given by

[TABLE]

where ${\bf p}_{5}$ is the ground-truth of the center position of L5 vertebra and $\sigma^{2}$ is given by 1/4 of L5 vertebra height in the image $I^{(n)}$ . The others ( $y_{\mbox{\tiny{L1}}},\cdots,y_{\mbox{\tiny{L4}}}$ ) are given in the same way. Then the confidence map $y_{\mbox{\tiny{L1-5}}}$ can be obtained by

[TABLE]

Note that the functions $f_{\mbox{\tiny{L5}}}$ and $f_{\mbox{\tiny{L1-5}}}$ are determined by minimizing the loss function in (2) using the training data. Hence, given a test data $I$ , this Pose-net provides the confidence map $y_{\mbox{\tiny{L1-5}}}(\mathbf{x})=f_{\mbox{\tiny{L1-5}}}({I_{*}},f_{\mbox{\tiny{L5}}}(I))$ , as shown in Fig. 4 (a).

Now, we are ready to explain our method to determine the center positions ${\bf P}=({\bf p}_{1},\cdots,{\bf p}_{5})\in{\mathbb{R}}^{2\times 5}$ of five lumbar vertebra from this confidence map $y_{\mbox{\tiny{L1-5}}}(\mathbf{x})$ . We first applied the Otsu’s thresholding [Otsu1979] to the confidence map $y_{\mbox{\tiny{L1-5}}}(\mathbf{x})$ in Fig. 4(a) to remove small local perturbations in $y_{\mbox{\tiny{L1-5}}}(\mathbf{x})$ so that the local maxima distant from the spines are filtered. See Fig. 4 (c), which shows six local maxima in the image. These local maximum points are the candidates of the center positions. Next, we need to select five points ${\bf P}=({\bf p}_{1},\cdots,{\bf p}_{5})\in{\mathbb{R}}^{2\times 5}$ from the several candidates. To do this, we computed the score by averaging the pixel values of the image $y_{\mbox{\tiny{L1-5}}}(\mathbf{x})$ in a window of size $31\times 31$ , centered at each local maximum point. We excluded the candidates whose score are less than half of the mean score. Finally, we select the five candidates starting from the bottom candidate, as shown in Fig. 4 (d).

2.3 Deep learning-based segmentation of lumbar vertebra

Our segmentation method takes advantage of the knowledge of the center positions ${\bf P}$ of the five vertebra (that are obtained from the Pose-net explained in the previous section) to greatly reduce the area performing the segmentation. The segmentation is performed in the original high-resolution image $I_{o}\in\mathbb{R}^{3072\times 1536}$ instead of the resized low-resolution image $I\in\mathbb{R}^{512\times 256}$ . Given ${\bf P}$ in the resized image $I$ , it is easy to calculate the corresponding vector ${\bf P}_{o}=({\bf p}_{o,1},\cdots,{\bf p}_{o,5})\in{\mathbb{R}}^{2\times 5}$ representing the center positions of the five vertebra in the original image $I_{o}$ .

The proposed segmentation method takes as input the content of a bounding box of each lumbar vertebra, as shown in Fig. 5, where the five bounding boxes are centered at ${\bf P}_{o}$ , and the height $h$ and width $w$ of which are the same:

[TABLE]

where the subscript $y$ stands for the vertical component of the corresponding vector. See Fig. 5 for a description of the bounding box.

We use the M-net [Fu2018] to learn the segmentation map $f_{\mbox{\tiny seg}}:{I_{p}}\mapsto y_{\mbox{\tiny seg}}$ , where $I_{p}$ denotes the content of a bounding box ( $I_{p}$ is a resized image to $224\times 224$ ) and $y_{\mbox{\tiny seg}}$ is a binary image representing vertebra segmentation corresponding to the patch $I_{p}$ . The M-net is based on U-net [Ronneberger2015] and has advantages by adding two major parts: (1) multi-scale layer used to construct an image pyramid input and (2) multi-label loss function with side-output layer to learn local and global information at the same time. Here, the multi-scale input is to integrate multiple level receptive field[Fu2018], and the multi-label loss in (8) can deal with the vanishing gradient problem by replenishing back-propagated gradients [Wei2016].

Fig. 6 shows the M-net structure, and $f_{\mbox{\tiny seg}}$ is expressed as

[TABLE]

where $f_{\mbox{\tiny seg},i}$ is the function producing the $i$ -th side output. Here, $\theta_{\mbox{\tiny{seg}}}=(\theta_{\mbox{\tiny{seg}},1},\cdots,\theta_{\mbox{\tiny{seg}},4})$ is a set of parameter related to $f_{\mbox{\tiny seg},i}$ for $i=1,\cdots,4$ . The M-net $f_{\mbox{\tiny{seg}}}$ is learned using training data $\mathcal{S}_{\mbox{\tiny seg}}:=\{I_{p}^{(n)},y_{\mbox{\tiny{seg}}}^{(n)}\}_{n=1}^{N}$ . The loss multi-label function is given by

[TABLE]

where $\aleph_{\tiny{\mbox{pixel}}}$ denotes the number of pixels of the input image, $<\cdot,\cdot>$ denotes an inner product, and ${\mathbb{L}}(\cdot)$ denotes an element-wise log operation. The segmentation function $f_{\mbox{\tiny seg}}$ is obtained by minimizing loss in (8).

2.4 Fine-tuning of segmentation

For a fine-tuning of segmentation, one may use the level-set method [Kass1987, Caselles1993, Malladi1995, Sussman1994, Osher2001] with using the previously obtained segmentation results. Given an X-ray image patch $I_{p}$ and deep learning-based segmentation $y_{\mbox{\tiny seg}}$ , the following energy functional is used to provide a fine segmentation:

[TABLE]

where $\varphi$ is a level set function, $H$ is the one-dimensional Heaviside function, $G_{\sigma}$ is Gaussian kernel, $g(\mathbf{x})=\frac{1}{1+|\nabla G_{\sigma}\ast I_{p}(\mathbf{x})|^{2}}$ is an edge detector, $\delta$ is a regularized delta function, and $y_{\mbox{\tiny seg}}$ is the binary image obtained by M-net in the previous section. A segmented region $\{\mathbf{x}:\varphi(\mathbf{x})<0\}$ is obtained via a level set function $\varphi$ which minimize the energy functional in (9). To compute a minimizer $\varphi$ for the energy functional $\Phi(\varphi)$ , the following parabolic equation is solved to get a static state:

[TABLE]

In the last term of (9), Chan-Vese method [Chan2001] is applied to the binary image $y_{\mbox{\tiny seg}}$ , which was used as the initial segmentation as well as a strong fidelity to a target segmentation. The key role of the last term is that the level set of a minimizer $\phi$ of (9) is very close to the edge of $y_{\mbox{\tiny seg}}$ . For the first three terms of (9) [Li2010], a distance regularization term and an external energy are used to push the contour to a target area (9). See Fig. 7 for the fine-tuning of segmentation. One may use other level set methods for fine-tuning segmentation.

3 Experiments and Results

In this experiments, Python with Tensorflow was used to implement deep learning framework and MATLAB was used for data processing. All process was performed in workstation equipped with the two Intel(R) Xeon(R) E5-2630 v4 @ 2.20GHz CPU, 128GB DDR4 RAM, and 4 NVIDIA GeForce GTX 1080ti 11GB GPU.

3.1 Data

The training data are 637 Digital Radiography(DR) X-ray images, and test data are 160 X-ray images which consist of DR and Computed Radiography(CR) X-ray images. In training process, we split the training data into 537 and 100 for training and validation, respectively. The size of X-ray images were approximately $3000\times 1500$ and we resized the images to $3072\times 1536$ . We first manually labeled the center positions denoted by yellow points in Fig. 8 (a). Segmentation label of the lumbar vertebra(Fig. 8 (c)) was given by plotting 8 red points(Fig. 8 (b)). The segmentation label of the five lumbar vertebra was shown in Fig. 8 (d).

Data processing and augmentation method of training data are explained as following:

For the training data of Pose-net, we resized the all images to $512\times 256$ . Then using (4) and (5), we computed ground-truth confidence map using center ground-truth positions $P$ . 2. 2.

For the training data of M-net, we first extracted the patches of input image and segmentation label using center positions ${\bf P}_{o}$ . Then we resized all extracted patches to $224\times 224$ . 3. 3.

In the patch extraction process, random noise was added to ${\bf P}_{o}$ to reflect the errors of center positions which occur during the test stage. 4. 4.

For the augmentation of data, we applied the random contrast adjustment, random cropping, and random rotation within angle $-15^{\circ}$ to $15^{\circ}$ .

3.2 Training and validation of the proposed networks

The training of the proposed networks are carried out by minimizing loss functions in (2) and (8) using Adam method [Kingma2014]. Here, the batch size was selected to 4 in the consideration of our computational ability. We used the batch normalization [Ioffe2015] which allows higher learning rate, resulting in relatively short training time. We set the learning rate to $10^{-3}$ . For initial 50 epoch, we used warm-up learning rate [He2015, Baumgartner2017] of $10^{-5}$ to prevent rapid decrease of training loss.

We trained the Pose-net and the M-net for 700 and 400 epochs, respectively. The stopping criterion was determined when validation loss stopped decreasing. Fig. 9 shows the training and validation loss for Pose-net and M-net.

3.3 Results and Quantitative evaluations

3.3.1 Center position detection results

For the quantitative evaluation of the Pose-net, we used the distance error between the output of the proposed method and ground-truth center positions in pixel space( $3072\times 1536$ ). The error was computed for the case which succeed to detect the five lumbar vertebra correctly. The success rate was $96.25\%$ for 160 test data set. Failure case which predicts the center position under five or wrong part(including T-spine and background) was excluded from the center position evaluation. The distance error in the pixel space was shown in Table 1.

We then visualized the cumulative distribution of center position error for the five lumbar vertebra and all lumbar vertebra in Fig. 10 (a). From the cumulative distribution, it can be observed that most of center positions is within 50 pixels. Fig. 10 (b) shows the boxplot of the center position detection error.

For the qualitative evaluation of pose-estimation network, we visualized the confidence map and output of the center positions in Fig. 13 (b).

3.3.2 Lumbar vertebra segmentation results

Fig. 11 shows the segmentation results using M-net with Pose-net and fine-tuning segmentation using level set.

For the evaluation of the proposed segmentation method, we used the region-based metric [Udupa2006] including Dice similarity metric, precision, sensitivity, and specificity. The Dice similarity metric( $d_{D}$ ) and the precision( $d_{P}$ ) between $O_{\mbox{GT}}$ and $O_{\mbox{SG}}$ are defined as following:

[TABLE]

where $O_{\mbox{SG}}$ is the lumbar vertebra region obtained from segmentation and $O_{\mbox{GT}}$ is the ground-truth segmentation. Here, $d_{D}$ describes how the ground-truth and detected region are close to and overlapped with each other. The the sensitivity( $d_{Sen}$ ) and the specificity( $d_{Spe}$ ) are defined as

[TABLE]

We compared the multi-step proposed method with existing deep learning segmentation method. For comparison, we used M-net which take an image $I\in{\mathbb{R}}^{512\times 256}$ as an input and produce an output of size $512\times 256$ for segmentation of the five lumbar vertebra. We will refer this M-net as original M-net to distinguish the M-net used in the proposed method. We should note that the M-net used in the proposed method takes as input the extracted patch $I_{p}$ to segment the individual lumbar vertebra. We also used U-net[Ronneberger2015] for both the proposed method and existing method, namely original U-net. The evaluation results are reported in Table 2. Here, Pose-net+M-net+Level set denotes the fine-tuning segmentation and Pose-net+M-net represents the segmentation without fine-tuning.

From this results we can see that the proposed method achieves the improved Dice similarity metric by combining deep learning method and level-set method. The level-set method combined with segmentation deep learning takes advantages of clear edge at anterior wall, upper, and lower plate of vertebra body, therefore Dice similarity metric was increased by reducing false-positive of segmentation. However, the posterior wall of vertebral body has unclear boundary due to overlapping two pedicles in the lateral view of lumbar X-ray image, it is difficult to capture the boundary of the posterior wall using level set. This causes level set method to segment inside region of vertebral body, resulting in decreasing of sensitivity value. We expected that one can improve the level set method to achieve the more accurate segmentation, but this is out of the scope of the our paper.

Fig. 12 shows the comparison results with two cases. From red box in Fig. 12, we can observed that the original M-net failed to segment L5 vertebra while the proposed method can segment L5 vertebra by taking advantage of accurate localization of the spine. This accurate localization can also prevent segmenting of thoracic spine. See blue box in Fig. 12 (c) and (d).

The results of the entire process for selected six subject were shown in Fig. 13.

4 Discussion and Conclusion

The main contribution of the proposed method is that it achieves (i) accurate and robust identification of each lumbar vertebra using a pose-driven deep-learning technique, and (ii) fine segmentation of individual vertebra using a hierarchical method that combines M-net and level-set methods.

Lumber compression fractures are becoming increasingly prevalent in Korea as the incidence of osteoporosis increase with aging populations. Compression fracture is the most common fracture in osteoporosis patients. In Korea, the burden of medical imaging due to aging is increasing rapidly, and the rate of increase of radiologists is falling short of that. As a result, radiologists are more likely to be difficult to read quickly and accurately. In particular, if imaging diagnosis is missed or delayed in spinal compression fractures, it can lead to complications such as height reduction and scoliosis. Therefore, the automatic vertebral segmentation could play an important role in improving physicians’ workflow with being diagnosed quickly and accurately through images.

The automatic vertebral segmentation may proceed with follow-up studies in the automatic grading of compression fracture in place of existing semiquantitative grade system(genant grade). Such an automatic quantitative grading method would result in a clear and reproducible definition of compression fracture. In addition to compression fracture in lumbar vertebra, automatic vertebral segmentation study can also enable research on other diseases such as degenerative changes(including disc space narrowing and degenerative spondylolisthesis as shown in Fig. 14) and traumatic conditions such as including burst fracture. It is also believed that studies of various diseases in the spine (bone tumor such as metastasis, infectious disease such as pyogenic spondylitis, autoimmune disease such as Ankylosing spondylitis, etc.) could be possible if they were extended to cervicothoracic spine, sacrum, and coccyx.

Acknowledgement

This work was supported by Samsung Science $\&$ Technology Foundation (No. SSTF-BA1402-01). K.C.K. was supported by NRF grant 2017R1E1A1A03070653.

References

[1] \harvarditemAl Arif2018AlArif2018 Al Arif SMMR, Knapp K, Slabaugh G, Fully automatic cervical vertebra segmentation framework for X-ray images, *Computer Methods and Programs in Biomedicine *, vol. 157, pp. 95-111, 2018.
[2] \harvarditemBaumgartner2017Baumgartner2017 Baumgartner C F, Kamnitsas K, Matthew J, Fletcher T P, Smith S, Koch L M, Kainz B, and Rueckert D, SonoNet: Real-Time Detection and Localisation of Fetal Standard Scan Planes in Freehand Ultrasound, IEEE Trans. Med. Imaging, vol. 36, no. 11, 2017.
[3] \harvarditemCao2017Cao2017 Cao Z, Simon T, Wei S and Sheikh Y, Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields, *IEEE CVPR *, 2017.
[4] \harvarditemCaselles1993Caselles1993 Caselles V, Catte F, Coll T and Dibos F, A geometric model for active contours in image processing, Numerische Mathematik, vol. 66, no. 1, pp. 1–31, 1993.
[5] \harvarditemCaselles1988Caselles1988 Caselles V, Kimmel R and Kimmel G, Geodesic Active Contours, *IJCV *, vol. 22, issue. 1, pp. 61-79, 1997.
[6] \harvarditemChan-Vese2001Chan2001 Chan T F and Vese L A, Active contours without edges, IEEE TIP, vol. 10, no. 2, pp. 266-277, 2001.
[7] \harvarditemCootes1995Cootes1995 Cootes T F, Taylor C J, Cooper D H, Graham J, Active shape models-their training and application, *CVIU *, vol. 61, issue. 1, pp. 38–59, 1995.
[8] \harvarditemDavatzikos2002Davatzikos2002 Davatzikos C, Liu D, Shen D and Herskovits E H, Spatial normalization of spine MR images for statistical correlation of lesions with clinical symptoms, *Radiology *, vol. 224, no. 3, pp. 919–926, 2002.
[9] \harvarditemFu2018Fu2018 Fu H, Cheng J, Xu Y, Wong D W K, Liu J and Cao X, Joint Optic Disc and Cup Segmentation Based on Multi-Label Deep Network and Polar Transformation, *IEEE Trans. Med. Imaging *, vol. 37, no. 7, 2018.
[10] \harvarditemGlocker2012Glocker2012 Glocker B, Feulner J, Criminisi A,Haynor D R and Konukoglu E, Automatic localization and identification of vertebra in arbitrary field-of-view CT scans, *MICCAI *, pp. 590–598, 2012.
[11] \harvarditemGlocker2013Glocker2013 Glocker B, Zikic D, Konukoglu E, Haynor D R, Criminisi A, Vertebrae localization in pathological spine CT via dense classification from sparse annotations, *MICCAI *, pp. 262–270, 2013.
[12] \harvarditemHe2015He2015 He K, Zhang X, Ren S and Sun J, Deep residual learning for image recognition, *IEEE CVPR *, 2016.
[13] \harvarditemHeimann2009Heimann2009 Heimann T et al, “Comparison and evaluation of methods for liver segmentation from CT datasets”, IEEE Trans. Med. Imaging, vol. 28, no. 8, pp. 1251–1265, 2009.
[14] \harvarditemIbragimov2017Ibragimov2017 Ibragimov B, Korez R, Likar B, Pernus F, Xing L, and Vrtovec T, Segmentation of Pathological Structures by Landmark-Assisted Deformable Models, *IEEE Trans. Med. Imaging *, vol. 36, no. 7, 2017.
[15] \harvarditemIbragimov2014Ibragimov2014 Ibragimov B, Likar B, pernus F, and Vrtovec T, “Shape representation for efficient landmark-based segmentation in 3-D.”, *IEEE Trans. Med. Imaging *, vol. 33, no. 4, pp. 861–874, 2014.
[16] \harvarditemIoffe2015Ioffe2015 Ioffe S, Szegedy C, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, *arXiv: 1502.03167v3 *, 2015.
[17] \harvarditemKadoury2011Kadoury2011 Kadoury S, Labelle H, and Paragios N, “Automatic inference of articulated spine models in CT images using high-order Markov random fields.”, *Med. Image Anal *, vol. 15, no. 4, pp. 426–437, 2011.
[18] \harvarditemKadoury2013Kadoury2013 Kadoury S, Labelle H, and Paragios N, “Spine segmentation in medical images using manifold embeddings and higher-order MRFs.”, *IEEE Trans. Med. Imaging *, vol. 32, no. 7, pp. 1227–1238, 2013.
[19] \harvarditemKass1987Kass1987 Kass M, Witkin A and Terzopoulos D, Snakes: Active contour models, *IJCV *, vol. 1, no. 4, pp. 321–331, 1987.
[20] \harvarditemKingma2014Kingma2014 Kingma D P, Ba J, Adam: A Method for Stochastic Optimization, *arXiv:1412.6980 *, 2014.
[21] \harvarditemKlinder2009Klinder2009 Klinder T, Ostermann J, Ehm M, Franz A, Kneser R, and Lorenz C, “Automated model-based vertebra detection, identification, and segmentation in CT images.”, *Med. Image Anal *, vol. 13, no. 3, pp. 471–482, 2009.
[22] \harvarditemKlinder2009Klinder2009 Klinder T, Ostermann J, Ehmb M, Franz A, Kneser R and Lorenz C, Automated model-based vertebra detection, identification, and segmentation in CT images, Medical Image Analysis, vol. 13, issue. 3, pp. 471–482, 2009.
[23] \harvarditemLeventon2000Leventon2000 Leventon M, Grimson W, and Faugeras O, “Statistical shape influence in geodesic active contours” *5th IEEE EMBS International Summer School on Biomedical Imaging *, vol. 1, pp. 316–323, 2000.
[24] \harvarditemLi2010Li2010 Li C, Xu C, Gui C, Fox M D, Distance Regularized Level Set Evolution and Its Application to Image Segmentation, IEEE TIP, vol. 19, issue. 12, pp. 3243 - 3254, 2010.
[25] \harvarditemLim2013Lim2013 Lim P, Bagci U, and Bai L, “Introducing Willmore flow into level set segmentation of spinal vertebra.”, *IEEE Trans. Biomed. Eng. *, vol. 60, no. 1, pp. 115–122, 2013.
[26] \harvarditemMalladi1995Malladi1995 Malladi R, Sethian J A, and Vemuri B C, Shape modeling with front propagation: A level set approach *IEEE TPAMI *, vol. 17, no. 2, pp. 158–175, 1995.
[27] \harvarditemOsher2001Osher2001 Osher S, Fedkiw R P, Level set methods: an overview and some recent results, *Journal of Computational Physics *, vol. 169, issue. 2, pp. 463 - 502, 2001.
[28] \harvarditemOtsu1979Otsu1979 Otsu N, A Threshold Selection Method from Gray-Level Histograms, *IEEE Trans ON SMC *, vol. SMC-9, no. 1, 1979.
[29] \harvarditemPizer1987Pizer1987 Pizer S M. et al, Adaptive histogram equalization and its variations, Computer Vision, Graphics, and Image Processing, vol. 39, issue. 3, pp. 355-368, Sep 1987.
[30] \harvarditemRonneberger2015Ronneberger2015 Ronneberger O, Fischer P and Brox T, U-Net: Convolutional Networks for Biomedical Image Segmentation, Proc. Med. Image Comput. Comput.-Assist. Intervention, pp. 234-241, 2015.
[31] \harvarditemSussman1994Sussman1994 Sussman M, Smereka P and Osher S, A level set approach for computing solutions to incompressible two-phase flow, *Journal of Computational Physics *, vol. 114, no. 1, pp. 146–159, 1994.
[32] \harvarditemToshev2014Toshev2014 Toshev A and Szegedy C, DeepPose: Human Pose Estimation via Deep Neural Networks, IEEE CVPR, 2014.
[33] \harvarditemUdupa2006Udupa2006 Udupa J K et al, “A framework for evaluating image segmentation algorithms”, Comput. Med. Imaging Graph, vol. 30, no. 2, pp. 75–87, 2006.
[34] \harvarditemWei2016Wei2016 Wei S E, Ramakrishna V, Kanade T and Sheikh Y, Convolutional Pose Machines, *IEEE CVPR *, 2016.
[35] \harvarditemYao2016Yao2016 Yao J et al, A multi-center milestone study of clinical vertebral CT segmentation, *Computerized Medical Imaging and Graphics *, vol 49, pp 16-28, 2016
[36]

Bibliography36

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] \harvarditem Al Arif 2018 Al Arif 2018 Al Arif SMMR, Knapp K, Slabaugh G, Fully automatic cervical vertebra segmentation framework for X-ray images, Computer Methods and Programs in Biomedicine , vol. 157, pp. 95-111, 2018.
2[2] \harvarditem Baumgartner 2017 Baumgartner 2017 Baumgartner C F, Kamnitsas K, Matthew J, Fletcher T P, Smith S, Koch L M, Kainz B, and Rueckert D, Sono Net: Real-Time Detection and Localisation of Fetal Standard Scan Planes in Freehand Ultrasound, IEEE Trans. Med. Imaging , vol. 36, no. 11, 2017.
3[3] \harvarditem Cao 2017 Cao 2017 Cao Z, Simon T, Wei S and Sheikh Y, Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields, IEEE CVPR , 2017.
4[4] \harvarditem Caselles 1993 Caselles 1993 Caselles V, Catte F, Coll T and Dibos F, A geometric model for active contours in image processing, Numerische Mathematik , vol. 66, no. 1, pp. 1–31, 1993.
5[5] \harvarditem Caselles 1988 Caselles 1988 Caselles V, Kimmel R and Kimmel G, Geodesic Active Contours, IJCV , vol. 22, issue. 1, pp. 61-79, 1997.
6[6] \harvarditem Chan-Vese 2001 Chan 2001 Chan T F and Vese L A, Active contours without edges, IEEE TIP , vol. 10, no. 2, pp. 266-277, 2001.
7[7] \harvarditem Cootes 1995 Cootes 1995 Cootes T F, Taylor C J, Cooper D H, Graham J, Active shape models-their training and application, CVIU , vol. 61, issue. 1, pp. 38–59, 1995.
8[8] \harvarditem Davatzikos 2002 Davatzikos 2002 Davatzikos C, Liu D, Shen D and Herskovits E H, Spatial normalization of spine MR images for statistical correlation of lesions with clinical symptoms, Radiology , vol. 224, no. 3, pp. 919–926, 2002.