TL;DR
This paper presents an efficient deep learning framework using hourglass networks and transfer learning for accurate knee anatomical landmark localization in X-ray images, improving generalization across different OA stages.
Contribution
It introduces a novel hourglass network architecture with soft-argmax for landmark prediction and explores transfer learning from low-budget annotations to enhance accuracy.
Findings
Outperforms existing methods on independent datasets
Improves localization accuracy across OA severity stages
Demonstrates effective transfer learning benefits
Abstract
This paper addresses the challenge of localization of anatomical landmarks in knee X-ray images at different stages of osteoarthritis (OA). Landmark localization can be viewed as regression problem, where the landmark position is directly predicted by using the region of interest or even full-size images leading to large memory footprint, especially in case of high resolution medical images. In this work, we propose an efficient deep neural networks framework with an hourglass architecture utilizing a soft-argmax layer to directly predict normalized coordinates of the landmark points. We provide an extensive evaluation of different regularization techniques and various loss functions to understand their influence on the localization performance. Furthermore, we introduce the concept of transfer learning from low-budget annotations, and experimentally demonstrate that such approach is…
| Setting | 1 mm | 1.5 mm | 2 mm | 2.5 mm | % out |
|---|---|---|---|---|---|
| AAM (IGO [40]) | |||||
| AAM (LBP [29]) | |||||
| CLM (IGO [40]) | |||||
| CLM (LBP [29]) | |||||
| L2 loss | |||||
| L1 loss | |||||
| Robust loss [5] | |||||
| Elastic loss | |||||
| Wing loss [15] | |||||
| Wing + regular res. block | |||||
| Wing + mixup | |||||
| Wing + mixup | |||||
| Wing + mixup | |||||
| Wing + mixip | |||||
| Wing + mixup (no wd) | |||||
| Wing + mixup (no wd) | |||||
| Wing + mixup (no wd) | |||||
| Wing + mixup (no wd) | |||||
| Wing + mixup (no wd, no dropout) | |||||
| Wing + mixup + jitter (no wd) | |||||
| Wing + mixup + cutout 5% (no wd) | |||||
| Wing + mixup + cutout 10% (no wd) | |||||
| Wing + mixup + cutout 25% (no wd) | |||||
| Wing + mixup + cutout 10% (no wd, finetune) |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
KNEEL: Knee Anatomical Landmark Localization Using Hourglass Networks
Aleksei Tiulpin
University of Oulu,
Oulu University Hospital
Iaroslav Melekhov
Aalto University
Simo Saarakkala
University of Oulu,
Oulu University Hospital
Abstract
This paper addresses the challenge of localization of anatomical landmarks in knee X-ray images at different stages of osteoarthritis (OA). Landmark localization can be viewed as regression problem, where the landmark position is directly predicted by using the region of interest or even full-size images leading to large memory footprint, especially in case of high resolution medical images. In this work, we propose an efficient deep neural networks framework with an hourglass architecture utilizing a soft-argmax layer to directly predict normalized coordinates of the landmark points. We provide an extensive evaluation of different regularization techniques and various loss functions to understand their influence on the localization performance. Furthermore, we introduce the concept of transfer learning from low-budget annotations, and experimentally demonstrate that such approach is improving the accuracy of landmark localization. Compared to the prior methods, we validate our model on two datasets that are independent from the train data and assess the performance of the method for different stages of OA severity. The proposed approach demonstrates better generalization performance compared to the current state-of-the-art.
1 Introduction
Anatomical landmark localization is a challenging problem that appears in many medical image analysis problems [31]. One particular realm where the localization of landmarks is of high importance is the analysis of knee plain radiographs at different stages of osteoarthritis (OA) – the most common joint disorder and highest disability factor in the world [2].
In knee OA research field, as well as in the other domains, two sub-tasks that form a typical pipeline for landmark localization can be defined: the region of interest (ROI) localization and the landmark localization itself [41]. In knee radiographs, the former one is typically applied in the analysis of the whole knee images [3, 4, 28, 36, 38], while the latter is used for bone shape and texture analyses [6, 19, 34]. Furthermore, Tiulpin et al. also used the landmark localization for image standardization applied after the ROI localization step [36, 37].
Manual annotation of knee landmarks is not a trivial problem without the knowledge of knee anatomy, and it becomes even more challenging when the severity of OA increases. In particular, it makes the annotation process of fine-grained bone edges and tibial spines intractable and time consuming. In Fig. 2, we show the examples of annotations of the landmarks for each stage of OA severity graded according to the gold-standard Kellgren-Lawrence system (grading from [math] to ) [20]. It can be seen from this figure that when the severity of the disease progresses, bone spurs (osteophytes) and the general bone deformity affect the appearance of the image. Other factors, such as X-ray beam angle are also known to have impact on the image appearance [22].
In this paper, we propose a novel Deep Learning based framework for localization of anatomical landmarks in knee plain radiographs and validate its generalization performance. First, we train a model to localize ROIs in a bilateral radiograph using low-cost labels, and subsequently, train a model on the localized ROIs to predict the location of anatomical landmarks in femur and tibia. Here, we utilize transfer learning and use the model weights from the first step of our pipeline for initialization of the second-stage model. The proposed approach is schematically illustrated in Fig. 1.
Our method is based on the hourglass convolutional network [27] that localizes the landmarks in a weakly-supervised manner and subsequently uses the soft-argmax layer to directly estimate the location of every landmark point. To summarize, the contributions of this study are the following:
- •
We leverage recent advances in landmark detection using hourglass networks and combine the best design choices in our method.
- •
For the first time, we propose to use MixUp [42] data augmentation principle for anatomical landmark localization and perform a thorough ablation study for the knee radiographs.
- •
We demonstrate an effective strategy of enhancing the performance of our landmark localization method by pre-training it on low-budget landmark annotations.
- •
We evaluate our method on two independent datasets and demonstrate better generalization ability of the proposed approach compared to the current state-of-the-art baseline.
- •
The pre-trained models, source code and the annotations performed for the Osteoarthritis Initiative (OAI) dataset are publicly available at https://github.com/MIPT-Oulu/KNEEL.
2 Related Work
In the literature, there exist only a few studies specifically focused on localization of landmarks in plain knee radiographs. Specifically, the current state-of-the-art was proposed by Lindner et.al [24, 25] and it is based on a combination of random forest regression voting (RFRV) with constrained local models (CLM) fitting.
There are several methods focusing solely on the ROI localization. Tiulpin et al. [39] proposed a novel anatomical proposal method to localize the knee joint area. Antony et al. [3] used fully convolutional networks for the same problem. Recently, Chen et al. [9] proposed to use object detection methods to measure the knee OA severity.
The proposed approach is related to the regression-based methods for keypoint localization [41]. We utilize an hourglass network which is an encoder-decoder model initially introduced for human pose estimation [27] and address both ROI and landmark localization tasks. Several other studies in medical imaging domain also leveraged a similar approach by applying U-Net [33] to the landmark localization problem [12, 31]. However, the encoder-decoder networks are computationally heavy during the training phase since they regress a tensor of high-resolution heatmaps which is challenging for medical images that are typically of a large size. It is notable that decreasing the image resolution could negatively impact the accuracy of landmark localization. In addition, most of the existing approaches use a refinement step which makes the computational burden even harder to cope with. Nevertheless, hourglass CNNs are widely used in human pose estimation [27] due to a possibility of lowering down the resolution and the absence of precise ground truth.
More similar to our approach, Honari et al. [18] recently leveraged deep learning and applied soft-argmax layer to the feature maps of the full image resolution to improve landmark localization performance leading to remarkable results. However, such strategy is computationally heavy for medical images due to their high resolution. In contrast, we first moderately reduce the image resolution by embedding it into a feature space, utilize an hourglass module to process the obtained feature maps at all scales, and eventually apply the soft-argmax operator that makes the proposed configuration more applicable to high-resolution images allowing to get sub-pixel accurate landmark coordinates.
3 Method
3.1 Network architecture
Overview.
Our model comprises several architectural components of modern hourglass-like encoder-decoder models for landmark localization. In particular, we utilize the hierarchical multi-scale parallel (HMP) residual block [7] which improves the gradient flow compared to the traditional bottleneck layer described in: [17, 27]. The HMP block structure is illustrated in Fig. 3.
The architecture of the proposed model is represented in Fig. 4. In general, our model comprises three main components: entry block, hourglass block, and output block. The whole network is parameterized by two hyperparameters – width and depth , where the latter is related to the number of max-pooling steps in the hourglass block. In our experiments we found the width of and the depth of to be optimal to maintain both high accuracy and speed of computations.
Entry block.
Similar to the original hourglass model [27] we apply a convolution with stride and zero padding of and pass the results into a residual module. Further, we use a max-pooling and utilize three residual modules before the hourglass block. This block allows to simultaneously downscale the image times and obtain representative feature embeddings suitable for multi-scale processing performed in the hourglass block.
Hourglass block.
This block starts with a max-pooling and recursively repeats dual-path structure times as can be seen in Fig. 4. In particular, each level of the hourglass block starts with a max-pooling subsequently followed by HMP residual blocks. At the next stage, the representations from the current level are passed to the next hourglass’ level and also passed forward to be summed with the up-sampled outputs of the hourglass level . Since spatial resolution of the feature maps at level and is different, the nearest-neighbours up-sampling is used [27]. At level , we simply feed the representations into the HMP block instead of the next hourglass level due to the reached limit of hourglass’ depth.
Output block.
The final block of the model uses the representations coming from the hourglass module and sequentially applies two blocks of dropout () and convolutional block with batch normalization and ReLU. At the final stage, a convolution and soft-argmax [8] are utilized to regress the coordinates of each landmark point.
Soft-argmax.
Since soft-argmax is an important component of our model, we review its formulation in this paragraph. This operator can be defined as a sequence of two steps, where the first one calculates the spatial softmax for pixel :
[TABLE]
At the next stage, the obtained spatial softmax is multiplied by the expected value of landmark coordinate at every pixel:
[TABLE]
where
[TABLE]
3.2 Loss function
We assessed various loss functions for training our model and finalized our choice at wing loss [15] that is closely related to loss. However, in the case of wing loss, the errors in a small vicinity of [math] – are better amplified due to the logarithmic nature of the function:
[TABLE]
where – is a ground truth, – prediction, (, ) – range of non-linear part of the loss, – constant smoothly linking the linear and non-linear parts.
3.3 Training techniques
MixUp
We use a MixUp technique [42] to improve the performance of our method. In particular, MixUp mixes the data inputs and , the corresponding keypoint arrays and :
[TABLE]
thereby augmenting the dataset with the new interpolated examples. Our implementation of mixup does not differ from the one proposed in the original work111https://github.com/facebookresearch/mixup-cifar10 and we do not compute the mixed targets . In contrast, we rather optimize the following loss function calculated mini-batch-wise:
[TABLE]
where and are the outputs of the network for and , respectively. Here, the points for every point are generated by a simple mini-batch shuffling.
Data Augmentation.
Medical images can vary in appearance due to different data acquisition settings or patient-related anatomical features. To tackle the issue of limited data, we applied the data augmentation. We use geometric and textural augmentations similarly to to the face landmark detection problem [16]. The former included all classes of homographic transformations while the latter included gamma correction, salt and pepper, blur (both median and gaussian) and the addition of a gaussian noise. Interestingly, the homographic transformations were shown effective in improving, for example, self-supervised learning [23, 26], however only more narrow class of transformation (affine) has been applied to the landmark localization [16] in faces.
Transfer learning from low-budget annotations.
As shown in Fig. 1, the problem of localizing the landmarks comprises two stages: identification of the ROI and the actual landmark localization. We previously mentioned the two classes of labels that are needed to train such a pipeline: low-cost ( points / image) and high-cost labels ( points). The low-cost labels can be noisy / inaccurate and are quick to produce, while the high-cost labels require the expert knowledge. In this work, we first train the ROI localization model ( landmark per leg) on the low-cost labels – knee joint centers (see Fig. 1) and then re-use the pre-trained weights from this stage to train the landmark localization model ( landmarks per knee joint).
4 Experiments
4.1 Datasets
Annotation Process
For all the following datasets, we applied the same annotations process. Firstly, for all the images in all the datasets we run BoneFinder tool (see Sec. 4.2). At the second stage, for every image, a person experienced in knee anatomy and OA manually refine all the landmark points. In Fig. 1, we highlight the numbering of the landmarks that we use in this paper. Specifically, we marked the corner landmarks in tibia from [math] to and in femur from to (lateral to medial). To perform the annotations, we used VGG image annotation tool [14].
OAI.
We trained our model and performed model selection using the images from Osteoarthritis Initiative (OAI) dataset222https://oai.epi-ucsf.org/datarelease/. Roughly knee joint images per KL grade were sampled to be included into the dataset. The final dataset size comprised knee joints in total. In the case of the ROI localization, we used a half of the image that corresponded to each knee.
Dataset A.
These data were collected at our hospital (Oulu University Hospital, Finland) [32], and thus, it comes from a completely different population than OAI (from USA). It includes the images from subjects, and KL grade-wise the data have the following distribution: 4 knees with KL [math], knees with KL , knees with KL , knees with KL 3 and knees with KL . From this dataset, we excluded knee due to an implant, thereby using knees for testing of our model.
Dataset B.
This dataset was also acquired from our hospital (Oulu University Hospital, Finland; ClinicalTrials.gov ID: NCT02937064) and included originally subjects. Out of these, 5 knee joints were excluded, thereby making a dataset of knees ( implants and due to error during the annotation process). With respect to OA severity, these data had cases with KL [math], with KL , with KL , with KL and with KL . This dataset was also used solely for testing of our model.
4.2 Baseline methods
We used several baseline methods at the model selection phase and one strong pre-trained baseline method at the test phase. In particular, we used Active Appearance Models [10] and Constrained Local Models [11] with both Image Gradient Orientations (IGO) [40] and Local Binary Patterns Features (LBP) [29]. Our implementation is based on the available methods with default hyperparameters from the Menpo library [1].
At the test phase, we used pre-trained RFRV-CLM method [25] implemented in BoneFinder tool. Here, the RFRV-CLM model was trained on images from OAI dataset. However we did not have access to the train data to assess which samples were used for training this method, therefore, we used this tool only for testing on datasets A and B.
4.3 Implementation Details
Ablation experiments
All our ablation experiments were conducted on the same -fold patient-wise cross-validation split stratified by a KL grade to ensure equal distribution of different stages of OA severity. Both ROI and landmark localization models were trained using the same split.
During the training, we used exactly the same hyperparameters for all the experiments. In particular, we used and for our network. The learning rate and the batch size were fixed to and , respectively. In some of our experiments where the weight decay was used, we set it to . All the models were trained with Adam optimizer [21]. The pixel spacing for ROI localization was set to mm and for the landmark localization to mm. We used bi-linear interpolation for image resizing.
All the ablation experiments were conducted solely on landmark localization task and eventually, after selecting the best configuration, we used it for training the ROI localization model due to the similarity of the tasks. We used the ground truth annotations to crop the mm ROIs around the tibial center (landmark in Fig. 1) to create the data for model selection and training the landmark localization model. In our experiments, we flipped all the left ROI images to look like the right ones, however this strategy was not applied for the ROI localization task.
When performing the fine-tuning of landmark localization model using the pre-trained weights of the ROI localization model, we simply initialized all the layers of the former with the weights of the latter one. We note here that the last layer was initialized randomly and we did not freeze the pre-trained part for simplicity.
In our experiments, we used PyTorch v [30] on a single Nvidia GTX1080Ti. For data augmentation, we used SOLT library [35]. For training AAM and CLM, we used Menpo [1], as mentioned earlier.
Evaluation and Metrics
To assess the results of our method, we used multiple metrics and evaluation strategies. Firstly, we performed the ablation experiments and used the landmarks for evaluation of the results (see Fig. 1). At the test time, when comparing the performance of the full system, we used an extended set of landmarks for evaluation – . The intuition here is to compare the landmark methods on those landmark points that are the most crucial in applications (tibial corners for landmark localization as well as tibial and femoral centers for the ROI localization). Besides, we excluded all the knees with implants from the evaluation.
As as the main metric for comparison, we used Percentage of Correct Keypoints (PCK) to compare the landmark localization methods. This metric shows the percentage of points that fall within the neighborhood of a ground truth landmark having the radius (recall at different precision thresholds). In our experiments, we used of mm, mm, mm and mm for quantitative comparison.
Finally, we also assessed the amount of outliers in the landmark localization task. An outlier was defined as a landmark that do not fall within the mm neighbourhood of the ground truth landmark. This value was computed for all the landmark points in contrast to PCK.
4.4 Ablation Study
Conventional approaches.
We first investigated the conventional approaches for landmark localization. The benchmarks of AAM and CLM with IGO and LBP features with default hyperparameters from Menpo [1] showed satisfactory results. The best model here was CLM with IGO features (Tab. 1).
Loss Function.
In the initial experiments with our model we assessed different loss functions ( see Tab. 1). In particular, we used ,, wing [15] and elastic loss (sum of and losses). Besides, we also utilized a recently introduced general adaptive robust loss with the default hyperparameters [5]. Our experiments showed that wing loss with the default hyperparameters as in the original paper ( and ), produces the best results.
Effect of Multi-scale Residual Blocks.
The experiments done for loss functions were conducted using the HMP block. However, it is worth to assess the added value of this block compare to the bottleneck residual block. Tab. 1 demonstrates that the bottleneck residual block (”Wing + regular res. block” of the Table) fell behind of HMP (”Wing loss”) in terms of PCK.
MixUp vs. Weight Decay
After observing that the wing loss and HMP block yield the best default configuration, we experimented with various forms of regularization. In this series of experiments, we used our default configuration and applied MixUp with different . Our experiments showed that using MixUp the default configuration and weight decay degrades the performance (Tab. 1). However, MixUp itself is also a powerful regularizer, therefore, we conducted the experiments without weight decay (marked as no wd in Tab. 1). Interestingly, setting weight decay to [math] increases the performance of our model with any . To assess the strength of regularization, we also conducted an experiment with (best) and without dropout. We observed that having dropout helps MixUp.
CutOut vs. Target Jitter
Besides MixUp, we tested two other data augmentation techniques – cutout [13] and noise addition to the ground truth annotations during the training (uniform distribution, pixel). We observed that the latter did not improve the results of our configuration with MixUp, however the former helped to lower down the amount of outliers twice while yielding nearly the same localization performance. This configuration had a cutout of of the image. These results are also presented in Tab. 1.
Transfer Learning from Low-cost Labels.
At the final stage of our experiments, we used the best configuration that included the wing loss, MixUp with , weight decay of [math] and cutout to train the ROI localization model. Essentially, both of these methods are landmark localization approaches, therefore, in our cross-validation experiments, we also assessed the performance of ROI localization using PCK. In our experiments, we found that pre-training of the landmark localization model on the ROI localization task significantly increases the performance of the former (see the last row of Tab. 1). The performance of both these models on cross-validation is presented in Fig. 5. Quantitatively, ROI localization model yielded PCK of , , , at mm, mm, mm and mm thresholds, respectively and had outliers.
4.5 Test datasets
Testing on the full datasets
Testing of our model was conducted on datasets A and B, respectively. We provide the quantitative results in Tab. 2. In this table, we present two versions of our pipeline, one is a single stage, where the landmark localization follows directly after the ROI localization step, and also a two-stage pipeline that includes ROI localization as a first step, initial inference of the landmark points as a second step, and re-centering of the ROI to the predicted tibial center and a second pass of landmark localization model as a third step.
Testing with Respect to the presence of Radiographic Osteoarthritis
To better understand the behaviour of our model on the test datasets, we investigated the performance of our 2-stage pipeline and BoneFinder for cases having KL and KL , respectively. These results are presented in Fig. 6. Our method performs on par with BoneFinder for Dataset A and even exceeds its localization performance for precision thresholds above mm for radiograhic OA. In Dataset B, on average, our method performs better than BoneFinder when both methods are benchmarked for both non-OA and OA cases. To provide better insights into the performance of our method for different stages of OA severity, we show examples of landmark localization done by our method, BoneFinder and manually (Fig. 7).
5 Conclusions
In this paper, the problem of anatomical landmark localization in knee radiographs was addressed. We proposed a new method that combines the power of latest advances in facial landmark localization and pose estimation that allowed us to accurately localize the landmarks on the unseen data.
Compare to the current state-of-the-art [24, 25], our method generalized better to the unseen test datasets that had completely different acquisition settings and patient populations. Consequently, these results suggest that our new method may be easily applicable to various tasks in clinical and research settings.
Our study has still some limitations. Firstly, the comparison with BoneFinder should ideally be conducted when it is trained on the same mm resolution data with the same KL grade-wise stratification, or at full image resolution. However, we did not have access to the training code of BoneFinder, thereby, leaving more systematic comparison to future studies. Another limitation of this study is the ground truth annotation process. Specifically, we used BoneFinder to pre-annotate the landmark for all the images in both train and test sets. In theory, this might give an advantage to BoneFinder compared to our method. On the other hand, all the landmarks were still manually refined, which should decrease this advantage.
The core methodological novelties of the study were in adapting the MixUp, soft-argmax layer and transfer learning from low-cost annotations for training our model. We think that the latter has applications in other, even non-medical domains, such as human pose estimation and facial landmark localization. It was shown that compared to RFRV-CLM, Deep Learning methods scale with the amount of training data, and therefore, we also expect our method to yield even better results when it is trained on a larger datasets [12]. Besides, we also expect semi-supervised learning [18] to help in this task.
To summarize, we developed a robust method for anatomical landmark localization that has potential to scale with the amount of training data and be applied in the other domains. Our source codes and the annotations made for OAI dataset will be made publicly available.
6 Acknowledgements
This study was supported by KAUTE foundation, Infotech Oulu, University of Oulu strategic funding and Sigrid Juselius Foundation.
The OAI is a public-private partnership comprised of five contracts (N01- AR-2-2258; N01-AR-2-2259; N01-AR-2- 2260; N01-AR-2-2261; N01-AR-2-2262) funded by the National Institutes of Health, a branch of the Department of Health and Human Services, and conducted by the OAI Study Investigators. Private funding partners include Merck Research Laboratories; Novartis Pharmaceuticals Corporation, GlaxoSmithKline; and Pfizer, Inc. Private sector funding for the OAI is managed by the Foundation for the National Institutes of Health.
Development and maintenance of VGG Image Annotator (VIA) is supported by EPSRC programme grant Seebibyte: Visual Search for the Era of Big Data (EP/M013774/1).
We thank Dr. Claudia Lindner for providing BoneFinder.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J. Alabort-i Medina, E. Antonakos, J. Booth, P. Snape, and S. Zafeiriou. Menpo: A comprehensive platform for parametric image alignment and visual deformable models. In Proceedings of the 22nd ACM international conference on Multimedia , pages 679–682. ACM, 2014.
- 2[2] K. D. Allen and Y. M. Golightly. Epidemiology of osteoarthritis: state of the evidence. Current opinion in rheumatology , 27(3):276, 2015.
- 3[3] J. Antony, K. Mc Guinness, K. Moran, and N. E. O’Connor. Automatic detection of knee joints and quantification of knee osteoarthritis severity using convolutional neural networks. In International conference on machine learning and data mining in pattern recognition , pages 376–390. Springer, 2017.
- 4[4] J. Antony, K. Mc Guinness, N. E. O’Connor, and K. Moran. Quantifying radiographic knee osteoarthritis severity using deep convolutional neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR) , pages 1195–1200. IEEE, 2016.
- 5[5] J. T. Barron. A general and adaptive robust loss function. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 4331–4339, 2019.
- 6[6] A. Brahim, R. Jennane, R. Riad, T. Janvier, L. Khedher, H. Toumi, and E. Lespessailles. A decision support tool for early detection of knee osteoarthritis using x-ray imaging and machine learning: Data from the osteoarthritis initiative. Computerized Medical Imaging and Graphics , 73:11–18, 2019.
- 7[7] A. Bulat and G. Tzimiropoulos. Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources. In Proceedings of the IEEE International Conference on Computer Vision , pages 3706–3714, 2017.
- 8[8] O. Chapelle and M. Wu. Gradient descent optimization of smoothed information retrieval metrics. Information retrieval , 13(3):216–235, 2010.
