Confident Head Circumference Measurement from Ultrasound with Real-time   Feedback for Sonographers

Samuel Budd; Matthew Sinclair; Bishesh Khanal; Jacqueline Matthew,; David Lloyd; Alberto Gomez; Nicolas Toussaint; Emma Robinson; Bernhard; Kainz

arXiv:1908.02582·eess.IV·August 8, 2019

Confident Head Circumference Measurement from Ultrasound with Real-time Feedback for Sonographers

Samuel Budd, Matthew Sinclair, Bishesh Khanal, Jacqueline Matthew,, David Lloyd, Alberto Gomez, Nicolas Toussaint, Emma Robinson, Bernhard, Kainz

PDF

TL;DR

This paper introduces a real-time deep learning system for fetal head circumference measurement from ultrasound, providing confidence feedback to improve measurement accuracy and consistency among sonographers.

Contribution

A novel probabilistic deep learning approach that offers real-time fetal head circumference estimates with confidence metrics to guide ultrasound scanning.

Findings

01

Predicted HC within 1.81mm of ground truth

02

50% of images fully within confidence margins

03

Average deviation from margins is 1.82mm

Abstract

Manual estimation of fetal Head Circumference (HC) from Ultrasound (US) is a key biometric for monitoring the healthy development of fetuses. Unfortunately, such measurements are subject to large inter-observer variability, resulting in low early-detection rates of fetal abnormalities. To address this issue, we propose a novel probabilistic Deep Learning approach for real-time automated estimation of fetal HC. This system feeds back statistics on measurement robustness to inform users how confident a deep neural network is in evaluating suitable views acquired during free-hand ultrasound examination. In real-time scenarios, this approach may be exploited to guide operators to scan planes that are as close as possible to the underlying distribution of training images, for the purpose of improving inter-operator consistency. We train on free-hand ultrasound data from over 2000 subjects…

Tables2

Table 1. Table 1: Single sample results of three U-Net’s. Baseline : Trained on Dataset A data only. Dataset A + HC18 : Trained on Dataset A data and HC18 Challenge data transformed to same format as Dataset A data. Dropout : Trained on Dataset A and HC18 Challenge data with dropout ( p = 0.6 𝑝 0.6 p=0.6 value found to be best performing in variety of dropout configurations). We compare the Mean absolute difference between the final HC measurement, the DICE overlap of the fitted ellipse with the ground truth ellipse, and the Hausdorff distance between the outline of the fitted ellipse and the outline of the ground truth ellipse. Results calculated on Dataset A test data.

	Mean abs difference
	Mean DICE
	Mean Hausdorff distance
	$\pm$ std (mm)
Baseline	2.09 $\pm$ 1.97	0.982 $\pm$ 0.011	1.289 $\pm$ 0.880
Dataset A + HC18	1.90 $\pm$ 1.90	0.982 $\pm$ 0.010	1.292 $\pm$ 0.791
Dropout $p = 0.6$	1.808 $\pm$ 1.65	0.982 $\pm$ 0.008	1.295 $\pm$ 0.664

Table 2. Table 2: Multi-Sampling results for the two methods. We report the performance measures of a single-sampled point-predictor ( Det. (Deterministic) ), mean/median of N = 10 𝑁 10 N=10 samples from the Probabilistic U-Net ( Prob. U-Net (Probabilistic U-Net) ), and our previous best U-Net with Monte-Carlo dropout during inference ( MC(inf.) (Monte Carlo dropout during inference) , p = 0.6 𝑝 0.6 p=0.6 ). We report the % ground truth HC values that lie in the calculated upper/lower bound range. This percentage varies significantly with N 𝑁 N , for MC(inf.) : N = 2 𝑁 2 N=2 : 14.8%; N = 1000 𝑁 1000 N=1000 : 50.4%. See Supplementary Material Figures 1-3.

	Mean abs difference
	Mean DICE
	Mean Hausdorff distance
	$L B \leq H C_{g t} \leq U B (%)$
Det.
MC $p = 0.6$	1.81 $\pm$ 1.65	0.982 $\pm$ 0.008	1.295 $\pm$ 0.664	N/A
Prob. UNet
Mean	2.22 $\pm$ 2.15	0.980 $\pm$ 0.011	1.413 $\pm$ 0.751	20.4
Median	2.21 $\pm$ 2.15	0.980 $\pm$ 0.011	1.410 $\pm$ 0.748	20.4
MC(inf.)
Mean	2.15 $\pm$ 2.09	0.981 $\pm$ 0.010	1.313 $\pm$ 0.613	27.8
Median	2.15 $\pm$ 2.07	0.981 $\pm$ 0.010	1.307 $\pm$ 0.604	27.8

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

11institutetext: Imperial College London, Dept. Computing, BioMedIA, London, UK 22institutetext: King’s College London, ISBE, London, UK 33institutetext: NAAMII, Kathmandu, Nepal

33email: [email protected]

Confident Head Circumference Measurement from Ultrasound with Real-time Feedback for Sonographers

Samuel Budd 11

Matthew Sinclair 11

Bishesh Khanal 3322

Jacqueline Matthew 22

David Lloyd 22

Alberto Gomez 22

Nicolas Toussaint 22

Emma Robinson 22

Bernhard Kainz 11

Abstract

Manual estimation of fetal Head Circumference (HC) from Ultrasound (US) is a key biometric for monitoring the healthy development of fetuses. Unfortunately, such measurements are subject to large inter-observer variability, resulting in low early-detection rates of fetal abnormalities. To address this issue, we propose a novel probabilistic Deep Learning approach for real-time automated estimation of fetal HC. This system feeds back statistics on measurement robustness to inform users how confident a deep neural network is in evaluating suitable views acquired during free-hand ultrasound examination. In real-time scenarios, this approach may be exploited to guide operators to scan planes that are as close as possible to the underlying distribution of training images, for the purpose of improving inter-operator consistency. We train on freehand ultrasound data from over 2000 subjects (2848 training/540 test) and show that our method is able to predict HC measurements within $1.81\pm 1.65mm$ deviation from the ground truth, with 50% of the test images fully contained within the predicted confidence margins, and an average of $1.82\pm 1.78mm$ deviation from the margin for the remaining cases that are not fully contained.

1 Introduction

Fetal Ultrasound (US) scanning is a vital part of ensuring good health of mothers and fetuses during and after pregnancy. Accurate anomaly detection and assessment of fetal development from US scans are required to ensure that the best care is given at the earliest identifiable stage. In many countries a mid-trimester US scan is carried out between 18-22 weeks gestation as a part of standard prenatal care. ‘Standardized plane’ views are used to acquire images in which distinct anatomical features can be extracted [13]. From some of these standard plane views, measurements of the head, abdomen and femur are most commonly used to predict fetal age and weight, and are the key biometrics identified from US. Biometrics acquired longitudinally can be used to predict the fetal development trajectory. Unfortunately, rates for early detection of fetal abnormalities are low, largely due to the high level of skill required by the sonographer to perform such scans and extract the relevant biometrics [12].

Recently, automatic US scanning approaches have been developed using deep learning [2], which mitigate the problems of manual US measurement through automatic detection of diagnostically relevant anatomical planes. Such systems have allowed development of robust automated methods for estimation of anatomical biometrics [14, 16] in diverse acquisition conditions with various imaging artefacts, outperforming non-deep learning approaches [3, 8, 11]. Critically, such methods only provide point estimates of HC without confidence or uncertainty measures, and do not provide any means to assess the quality of individual measurements during real-time scans. This can lead to many, potentially contradicting, measurements without any means to control the trustworthiness of the predictions during examination or retrospectively.

To this end, several approaches have been proposed for estimation of uncertainty in Deep Networks. These include Monte-Carlo Dropout (MC Dropout), the most common dropout method which has been shown to model a posterior mixture of Gaussians well. Weights in a deep neural network are ‘dropped’ randomly during inference with a given probability $p$ which has been shown to approximate Bayesian inference in deep Gaussian processes [5]. In addition, ensemble approaches produce $N$ prediction samples per input image by training a set of $N$ separate networks for the same task. The results are then combined to produce a final segmentation which seems to offer a good trade-off between robustness and accuracy [6]. Finally, the Probabilistic U-Net represents a generative segmentation model based on a combination of a U-Net with a conditional variational autoencoder. This is capable of producing an unlimited number of plausible hypotheses, reproducing the possible segmentation variants as well as the frequencies with which they occur [7].

Contribution: In this paper, we extend upon a state-of-the-art convolutional Deep Learning approach for automatic fetal HC measurement [14] to develop a new approach for automated probabilistic fetal HC with real-time feedback on measurement robustness. Two probabilistic deep learning methods are evaluated: MC Dropout during inference and Probabilistic U-Net. These are used to return an ensemble of segmentations, from which upper and lower bounds on the measurement are generated. In addition, we propose the derivation of a ‘variance score’, used to reject acquired images that produce sub-optimal HC measurements. In this way, the system will guide operators towards acquiring optimal US views, resulting in more consistent and accurate measurements.

2 Method

Biometric estimation: Our HC estimation builds on the approach developed in [14] which achieves human level performance. First, a U-Net [10] segmentation network masks out the head from an US image. Then, an ellipse is fitted to the segmented contours [4] from which the ellipse parameters can be obtained in mm. We extract ellipse centroid co-ordinates ( $c_{x}$ and $c_{y}$ ), major and minor axis radii ( $a$ and $b$ ) each in pixels, and the angle of rotation ( $\alpha$ ) and estimate HC using the Ramanujan approximation II [1] as $HC=\pi(a+b)(1+\frac{3h}{10+\sqrt{4-3h}})s_{xy}$ where $h=\frac{(a-b)^{2}}{(a+b)^{2}}$ . The error of this approximation is $O(h^{10})$ which for more circular ellipses is negligible. This ellipse fitting process mimics the sonographer’s manual actions when extracting a HC measurement during fetal US screening.

Probabilistic segmentation: Given the inherent variability between sonographers’ annotations in the training data, we generate a set of $N$ plausible segmentations from a single input using the following methods:

i) MC Dropout: We randomly drop weights of the network with probability $p$ to predict $N$ segmentation samples. Here, single-sample experiments ( $N=1$ ) were used to optimise the configuration of the network. This led to implementation of a single dropout layer ( $p=0.6$ ) before the bottleneck layer of the U-Net during inference.

ii) Probabilistic U-Net: We sample a set of $N$ plausible segmentations using this method [7] where we follow the same training scheme as [7].

Variance Estimation: With a probabilistic mapping function $g_{P}(X)=\hat{X}_{i}$ , in our case a deep probabilistic neural network, we can map a continuous input image to a possible segmentation mask $\hat{X}_{i}$ . We assume a deterministic function $f(\hat{X}_{i})=[a,b,\theta,x_{c},y_{c}]^{T}$ , with semi-major axis length $a$ , semi-minor axis length $b$ , angle of orientation $\theta$ and center $C(x_{c},y_{c})$ , which provides a least square solution to the ellipse fitting problem to the set of points $\hat{X}$ as proposed by [9]. Based on $f(\hat{X}_{i})$ we can evaluate hypotheses for their suitability to act as a metric to measure robustness during inference given $N$ prediction samples from $g_{P}(X)$ . These proposed metrics are

h1) Ellipse parameter variance: $\sum^{5}_{i}(\mathrm{Var}(f(\hat{X}_{n})_{i}))$ ;

h2) Total ring area: $\sum(f(\bigcup_{i=1}^{N}\hat{X}_{i})-f(\bigcap_{i=1}^{N}\hat{X}_{i}))\cdot s_{xyz}$ , where $s_{xyz}$ scales $\hat{X}_{i}$ to world space in $mm$ ;

h3) Mask classification entropy: $\sum_{x,y}^{K}\underline{\hat{X}}(x,y)\log(\underline{\hat{X}}(x,y))$ , where $K$ is the number of pixels in $\underline{\hat{X}}\in\mathbb{Z}_{2}$ after $argmax(\hat{X}_{i})$ class assignment and $\underline{\hat{X}}=\frac{1}{N}\cdot\sum^{N}_{i}\hat{X}_{i}$ ; and

h4) Softmax confidence entropy: given $\hat{X}_{i}\in\mathbb{R}$ before class assignment, after conversion of the network’s final layer’s logits with $Softmax(x_{i})=\frac{\exp(x_{i})}{\sum^{i}\exp(x_{i})}$ , the resulting $\hat{X}_{i}^{\ast}$ can be interpreted as two-element prediction confidence $[p_{f},p_{b}]_{i}=\hat{X}_{i}^{\ast}(x,y)$ for foreground $p_{f}$ and background $p_{b}$ . Thus we can estimate class-agnostic prediction entropy by $\sum_{i}^{K}p_{i}\log(p_{i})$ where $p_{i}=\sum_{i}^{N}\max([p_{f},p_{b}]_{i})$ .

3 Experiments and Results

Data: Our base dataset, named subsequently as Dataset A, consists of 2,724 two-dimensional US examinations from volunteers at 18-22 weeks gestation, acquired and labelled during routine screening by 45 expert sonographers. Several images were taken during each session, including the standard transverse brain view at the posterior horn of the ventricle (TV) plane used for HC measurement. This data was combined with the HC18 Challenge [15] dataset which consists of 1334 two-dimensional US images of the standard plane that is used to measure HC, each image is 800x540 pixels with a pixel size ranging from 0.052mm to 0.326mm. Each image in the training set has an accompanying manual annotation of the HC (ellipse outline) performed by a single trained sonographer [15]. We resample all images to $320\times 384$ pixels, and produce a head mask from the expert ground truth delineation. Training data is randomly flipped both horizontally and vertically, and a random rotation ( $\pm 5^{\circ}$ )is performed.

Single-Sampling Experiments: In the first instance, single-sample experiments, generating a single segmentation and HC measurement ( $N=1$ ) per subject, were used to verify the performance of the proposed model against the state-of-the-art [14]. Table 3 reports performance measures for all single-sampling experiments. These show comparable performance relative to [14] for our U-Net implementation, trained on Dataset A. This result improves further when the same model is trained on Dataset A and HC18 data. MC dropout during training further improves the result. For subsequent analysis, all experiments for MC Dropout (during inference) use the combined data and are trained using MC dropout.

Bibliography16

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Barnard, R.W., Pearce, K., Schovanec, L.: Inequalities for the Perimeter of an Ellipse. Journal of Mathematical Analysis and Applications 260 (2), 295–306 (8 2001). https://doi.org/10.1006/JMAA.2000.7128
2[2] Baumgartner, C.F., et al.: Sono Net: Real-Time Detection and Localisation of Fetal Standard Scan Planes in Freehand Ultrasound. IEEE Trans Med Imag 36 (11), 2204–2215 (11 2017). https://doi.org/10.1109/TMI.2017.2712367
3[3] Carneiro, G., Georgescu, B., Good, S., Comaniciu, D.: Detection and Measurement of Fetal Anatomies from Ultrasound Images using a Constrained Probabilistic Boosting Tree. IEEE Trans on Med Imag 27 (9), 1342–1355 (9 2008). https://doi.org/10.1109/TMI.2008.928917
4[4] Fitzgibbon, A., Pilu, M., Fisher, R.: Direct least squares fitting of ellipses. In: 13th ICPR’96. pp. 253–257. IEEE (1996). https://doi.org/10.1109/ICPR.1996.546029
5[5] Gal, Y., Ghahramani, Z.: Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In: ICLR’16. pp. 1050–1059 (2016)
6[6] Kamnitsas, K., et al.: Ensembles of Multiple Models and Architectures for Robust Brain Tumour Segmentation. In: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries. pp. 450–462 (09 2018). https://doi.org/10.1007/978-3-319-75238-9_38
7[7] Kohl, S., et al.: A probabilistic U-Net for segmentation of ambiguous images. In: Advances in Neural Information Processing Systems. pp. 6965–6975 (2018)
8[8] Li, J., et al.: Automatic Fetal Head Circumference Measurement in Ultrasound Using Random Forest and Fast Ellipse Fitting. IEEE J Biomed Health Inform 22 (1), 215–223 (1 2018). https://doi.org/10.1109/JBHI.2017.2703890