Deep learning in ultrasound imaging

Ruud JG van Sloun; Regev Cohen; Yonina C Eldar

arXiv:1907.02994·eess.SP·July 30, 2019

Deep learning in ultrasound imaging

Ruud JG van Sloun, Regev Cohen, Yonina C Eldar

PDF

TL;DR

This paper reviews how deep learning can enhance ultrasound imaging by improving signal processing, adaptive techniques, and image quality, highlighting recent advances and potential impacts across the imaging pipeline.

Contribution

It provides a comprehensive overview of deep learning applications in ultrasound, including novel methods for adaptive processing and structured signal recovery.

Findings

01

Deep learning improves adaptive beamforming and spectral Doppler.

02

Learned compressive encodings enhance color Doppler imaging.

03

Frameworks for structured signal recovery enable super-resolution ultrasound.

Abstract

We consider deep learning strategies in ultrasound systems, from the front-end to advanced applications. Our goal is to provide the reader with a broad understanding of the possible impact of deep learning methodologies on many aspects of ultrasound imaging. In particular, we discuss methods that lie at the interface of signal acquisition and machine learning, exploiting both data structure (e.g. sparsity in some domain) and data dimensionality (big data) already at the raw radio-frequency channel stage. As some examples, we outline efficient and effective deep learning solutions for adaptive beamforming and adaptive spectral Doppler through artificial agents, learn compressive encodings for color Doppler, and provide a framework for structured signal recovery by learning fast approximations of iterative minimization problems, with applications to clutter suppression and…

Equations37

\hat{w} = w ar g min

\hat{w} = w ar g min

w^{H} a = 1,

f (x) = [\frac{x - x ˉ}{∣ ∣ x - x ˉ ∣ ∣ _{2}}]_{+} [- \frac{x - x ˉ}{∣ ∣ x - x ˉ ∣ ∣ _{2}}]_{+},

f (x) = [\frac{x - x ˉ}{∣ ∣ x - x ˉ ∣ ∣ _{2}}]_{+} [- \frac{x - x ˉ}{∣ ∣ x - x ˉ ∣ ∣ _{2}}]_{+},

L =

L =

+ ∣ log_{10} ([- \hat{y}]_{+}) - log_{10} ([- y]_{+}) ∥_{2}^{2},

\hat{w}_{ω} = w_{ω} ar g min

\hat{w}_{ω} = w_{ω} ar g min

w_{ω}^{H} e_{ω} = 1,

D (x, z, t) = L (x, z, t) + S (x, z, t),

D (x, z, t) = L (x, z, t) + S (x, z, t),

D = L + S.

D = L + S.

L, S min \frac{1}{2} ∣∣ D - (L + S) ∣ ∣_{F}^{2} + λ_{1} ∣∣ L ∣ ∣_{*} + λ_{2} ∣∣ S ∣ ∣_{1, 2},

L, S min \frac{1}{2} ∣∣ D - (L + S) ∣ ∣_{F}^{2} + λ_{1} ∣∣ L ∣ ∣_{*} + λ_{2} ∣∣ S ∣ ∣_{1, 2},

L^{k + 1} = S T_{λ_{1} /2} (\frac{1}{2} L^{k} - S^{k} + D), S^{k + 1} = M T_{λ_{2} /2} (\frac{1}{2} S^{k} - L^{k} + D) .

L^{k + 1} = S T_{λ_{1} /2} (\frac{1}{2} L^{k} - S^{k} + D), S^{k + 1} = M T_{λ_{2} /2} (\frac{1}{2} S^{k} - L^{k} + D) .

L^{k + 1} = S T_{λ_{1} /2} (W_{1} D + W_{3} S^{k} + W_{5} L^{k}), S^{k + 1} = M T_{λ_{2} /2} (W_{2} D + W_{4} S^{k} + W_{6} L^{k}) .

L^{k + 1} = S T_{λ_{1} /2} (W_{1} D + W_{3} S^{k} + W_{5} L^{k}), S^{k + 1} = M T_{λ_{2} /2} (W_{2} D + W_{4} S^{k} + W_{6} L^{k}) .

L^{k + 1} = S T_{λ_{1}^{k}} (W_{1}^{k} * D + W_{3}^{k} * S^{k} + W_{5}^{k} * L^{k}), S^{k + 1} = M T_{λ_{2}^{k}} (W_{2}^{k} * D + W_{4}^{k} * S^{k} + W_{6}^{k} * L^{k}) .

L^{k + 1} = S T_{λ_{1}^{k}} (W_{1}^{k} * D + W_{3}^{k} * S^{k} + W_{5}^{k} * L^{k}), S^{k + 1} = M T_{λ_{2}^{k}} (W_{2}^{k} * D + W_{4}^{k} * S^{k} + W_{6}^{k} * L^{k}) .

E (θ) = \frac{1}{2 N} (i = 1 \sum N ∣∣ S_{i} - \hat{S}_{i} (θ) ∣ ∣_{F}^{2} + ∣∣ L_{i} - \hat{L}_{i} (θ) ∣ ∣_{F}^{2})

E (θ) = \frac{1}{2 N} (i = 1 \sum N ∣∣ S_{i} - \hat{S}_{i} (θ) ∣ ∣_{F}^{2} + ∣∣ L_{i} - \hat{L}_{i} (θ) ∣ ∣_{F}^{2})

CNR = \frac{∣ μ _{s} - μ _{b} ∣}{σ _{s}^{2} + σ _{b}^{2}}, CR = \frac{μ _{s}}{μ _{b}},

CNR = \frac{∣ μ _{s} - μ _{b} ∣}{σ _{s}^{2} + σ _{b}^{2}}, CR = \frac{μ _{s}}{μ _{b}},

y = Ax + w,

y = Ax + w,

\hat{x} = ar g x min ∣∣ y - Ax ∣ ∣_{2}^{2} + λ ∣∣ x ∣ ∣_{1},

\hat{x} = ar g x min ∣∣ y - Ax ∣ ∣_{2}^{2} + λ ∣∣ x ∣ ∣_{1},

L (Y, X_{t} ∣ θ) = ∥ f (Y ∣ θ) - G (σ) * X_{t} ∥_{2}^{2} + γ ∥ f (Y ∣ θ) ∥_{1},

L (Y, X_{t} ∣ θ) = ∥ f (Y ∣ θ) - G (σ) * X_{t} ∥_{2}^{2} + γ ∥ f (Y ∣ θ) ∥_{1},

x^{k + 1} =

x^{k + 1} =

x^{k + 1} =

x^{k + 1} =

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Deep learning in Ultrasound Imaging

Deep learning is taking an ever more prominent role in medical imaging. This paper discusses applications of this powerful approach in ultrasound imaging systems along with domain-specific opportunities and challenges.

Ruud J.G. van Sloun1, Regev Cohen2 and Yonina C. Eldar3 R. J. G. van Sloun is with the department of Electrical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands (email: [email protected])R. Cohen is with the department of Electrical Engineering, Technion, IsraelY. C. Eldar is with the Faculty of Mathematics and Computer science, Weizmann Institute of Science, Rehovot, Israel (email: [email protected])Accepted for publication in the Proceedings of the IEEE. ©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.”

ABSTRACT $|$ We consider deep learning strategies in ultrasound systems, from the front-end to advanced applications. Our goal is to provide the reader with a broad understanding of the possible impact of deep learning methodologies on many aspects of ultrasound imaging. In particular, we discuss methods that lie at the interface of signal acquisition and machine learning, exploiting both data structure (e.g. sparsity in some domain) and data dimensionality (big data) already at the raw radio-frequency channel stage. As some examples, we outline efficient and effective deep learning solutions for adaptive beamforming and adaptive spectral Doppler through artificial agents, learn compressive encodings for color Doppler, and provide a framework for structured signal recovery by learning fast approximations of iterative minimization problems, with applications to clutter suppression and super-resolution ultrasound. These emerging technologies may have considerable impact on ultrasound imaging, showing promise across key components in the receive processing chain.

KEYWORDS $|$ Deep learning; ultrasound imaging; image reconstruction; beamforming, Doppler, compression, deep unfolding, super resolution.

I Introduction

Diagnostic imaging plays a critical role in healthcare, serving as a fundamental asset for timely diagnosis, disease staging and management as well as for treatment choice, planning, guidance, and follow-up. Among the diagnostic imaging options, ultrasound imaging [1] is uniquely positioned, being a highly cost-effective modality that offers the clinician an unmatched and invaluable level of interaction, enabled by its real-time nature. Its portability and cost-effectiveness permits point-of-care imaging at the bedside, in emergency settings, rural clinics, and developing countries. Ultrasonography is increasingly used across many medical specialties, spanning from obstetrics to cardiology and oncology, and its market share is globally growing.

On the technological side, ultrasound probes are becoming increasingly compact and portable, with the market demand for low-cost ‘pocket-sized’ devices expanding [2, 3]. Transducers are miniaturized, allowing e.g. in-body imaging for interventional applications. At the same time, there is a strong trend towards 3D imaging [4] and the use of high-frame-rate imaging schemes [5]; both accompanied by dramatically increasing data rates that pose a heavy burden on the probe-system communication and subsequent image reconstruction algorithms. Systems today offer a wealth of advanced applications and methods, including shear wave elasticity imaging [6], ultra-sensitive Doppler [7], and ultrasound localization microscopy for super-resolution microvascular imaging [8].

With the demand for high-quality image reconstruction and signal extraction from unfocused planar wave transmissions that facilitate fast imaging, and a push towards miniaturization, modern ultrasound imaging leans heavily on innovations in powerful receive channel processing. In this paper, we discuss how artificial intelligence and deep learning methods can play a compelling role in this process, and demonstrate how these data-driven systems can be leveraged across the ultrasound imaging chain. We aim to provide the reader with a broad understanding of the possible impact of deep learning on a variety of ultrasound imaging aspects, placing particular emphasis on methods that exploit both the power of data and signal structure (for instance sparsity in some domain) to yield robust and data-efficient solutions. We believe that methods that exploit models and structure together with learning from data can pave the way to interpretable and powerful processing methods from limited training sets. As such, throughout the paper, we will typically first discuss an appropriate model-based solution for the problems considered, and then follow by a data-driven deep learning solution derived from it.

We start by briefly describing a standard ultrasound imaging chain in Section II. We then elaborate on several dedicated deep learning solutions that aim at improving key components in this processing pipeline, covering adaptive beamforming (Section III-A), adaptive spectral Doppler (Section III-B), compressive tissue Doppler (Section III-C), and clutter suppression (Section III-D). In Section IV, we show how the synergetic exploitation of deep learning and signal structure enables robust super-resolution microvascular ultrasound imaging. Finally, we discuss future perspectives, opportunities, and challenges for the holistic integration of artificial intelligence and deep learning methods in ultrasound systems.

II The Ultrasound Imaging Chain

at a glance

II-A Transmit schemes

The resolution, contrast, and overall fidelity of ultrasound pulse-echo imaging relies on careful optimization across its entire imaging chain. At the front-end, imaging starts with the design of appropriate transmit schemes.

At this stage, crucial trade-offs are made, in which the frame rate, imaging depth, and attainable axial and lateral resolution are weighted carefully against each other: improved resolution can be achieved through the use of higher pulse modulation frequencies; yet, these shorter wavelengths suffer from increased absorption and thus lead to reduced penetration depth. Likewise, high frame rate can be reached by exploiting parallel transmission schemes based on e.g. planar or diverging waves. However, use of such unfocused transmissions comes at the cost of loss in lateral resolution compared to line-based scanning with tightly focused beams. As such, optimal transmit schemes depend on the application.

Today, an increasing amount of ultrasound applications rely on high frame-rate (dubbed ultrafast) imaging. Among these are e.g. ultrasound localization microscopy (see Section IV), highly-sensitive Doppler, and shear wave elastography. Where the former two mostly exploit the incredible vastness of data to obtain accurate signal statistics, the later leverages high-speed imaging to track ultrasound-induced shear waves propagating at several meters per second.

With the expanding use of ultrafast transmit sequences in modern ultrasound imaging, a strong burden is placed on the subsequent receive channel processing. High data-rates not only raise substantial hardware complications related to power consumption, data storage and data transfer, the corresponding unfocused transmissions require much more advanced receive beamforming and clutter suppression to reach satisfactory image quality.

II-B Receive processing, sampling and beamforming

Modern receive channel processing is shifting towards the digital domain, relying on computational power and very-high-bandwidth communication channels to enable advanced digital parallel (pixel-based) beamforming and coherent compounding across multiple transmit/receive events. For large channel counts, e.g. in dense matrix probes that facilitate high-resolution 3D imaging, the number of coaxial cables required to connect all probe elements to the back-end system quickly becomes infeasible. To address this, dedicated switching and processing already takes place in the probe head, e.g. in the form of multiplexing or microbeamforming. Slow-time111In ultrasound imaging we make a distinction between slow- and fast-time: slow-time refers to a sequence of snapshots (i.e., across multiple transmit/receive events), at the pulse repetition rate, whereas fast-time refers to samples along depth. multiplexing distributes the received channel data across multiple transmits, by only communicating a subset of the number of channels to the back-end for each such transmit. This consequently reduces the achieved frame rate. In microbeamforming, an analog pre-beamforming step is performed to compress channel data from multiple (adjacent) elements into a single focused line. This however impairs flexibility in subsequent digital beamforming, limiting the achievable image quality. Other approaches aim at mixing multiple channels through analog modulation with chipping sequences [9]. Additional analog processing includes signal amplification by a low-noise amplifier (LNA) as well as depth (i.e. fast-time) dependent gain compensation (TGC) for attenuation correction.

Digital receive beamforming in ultrasound imaging is dynamic, i.e. receive focusing is dynamically optimized based on the scan depth. The industry standard is delay-and-sum beamforming, where depth-dependent channel tapering (or apodization) is optimized and fine-tuned based on the system and application. Delay-and-sum beamforming is commonplace due to its low complexity, providing real-time image reconstruction, albeit at a high sampling rate and non-optimal image quality.

Performing beamforming in the digital domain requires sampling the signals received at the transducer elements and transmitting the samples to a back-end processing unit. To achieve sufficient delay resolution for focusing, the received signals are typically sampled at 4-10 times their bandwidth, i.e., the sampling rate may severely exceed the Nyquist rate. A possible approach for sampling rate reduction is to consider the received signals within the framework of finite rate of innovation (FRI) [10, 11]. Tur et al. [12] modeled the received signal at each element as a finite sum of replicas of the transmitted pulse backscattered from reflectors. The replicas are fully described by their unknown amplitudes and delays, which can be recovered from the signals’ Fourier series coefficients. The latter can be computed from low-rate samples of the signal using compressed sensing (CS) techniques [10, 13]. In [14, 15], the authors extended this approach and introduce compressed beamforming. It was shown that the beamformed signal follows an FRI model and thus it can be reconstructed from a linear combination of the Fourier coefficients of the received signals. Moreover, these coefficients can be obtained from low-rate samples of the received signals taken according to the Xampling framework [16, 17, 18]. Chernyakova et al. showed this Fourier domain relationship between the beam and the received signals holds irrespective of the FRI model. This leads to a general concept of frequency domain beamforming (FDBF) [3] which is equivalent to beamforming in time. FDBF allows to sample the received signals at their effective Nyquist rate without assuming a structured model, thus, it avoids the oversampling dictated by digital implementation of beamforming in time. Furthermore, when assuming that the beam obeys a FRI model, the received signals can be sampled at sub-Nyquist rates, leading to up to 28 fold reduction in sampling rate [19, 20, 21].

II-C B-mode, M-mode, and Doppler

Ultrasound imaging provides anatomical information through the so-called Brightness-mode (B-mode). B-mode imaging is performed by envelope-detecting the beamformed signals, e.g. through calculation of the magnitude of the complex in-phase and quadrature (IQ) data. For visualization purposes, the dynamic range of these envelope-detected signals is subsequently compressed via a logarithmic transformation, or specifically-designed compression curves based on a look-up table. Scan conversion then maps these intensities to the desired (Cartesian) pixel coordinate system. The visualization of a single B-mode scan line (i.e. brightness over fast time) across multiple transmit-receive events (i.e. slow-time), is called motion-mode (M-mode) imaging.

Beyond anatomical information, ultrasound imaging also permits the measurement of functional parameters related to blood flow and tissue displacement. The extraction of such velocity signals is called Doppler processing. We distinguish between two types of velocity estimators: Color Doppler and Spectral Doppler. Color Doppler provides an estimate of the mean velocity through evaluation of the first lag of the autocorrelation function for a series of snapshots across slow-time [22]. Spectral Doppler provides the entire velocity distribution in a specified image region through estimation of the full power spectral density, and visualizes its evolution over time in a spectrogram [23]. Spectral Doppler methods are relevant for e.g. detecting turbulent flow in stenotic arteries or across heart valves. Besides assessing blood flow, Doppler processing also finds applications in measurement of tissue velocities (tissue Doppler), e.g. for assessment of myocardial strain.

II-D Advanced applications

In addition to B-mode, M-mode, and Doppler scanning, ultrasound data is used in a number of advanced applications. For instance, Elastography methods aim at measuring mechanical parameters related to tissue elasticity, and rely on analysis of displacements following some form of imposed stress. Stress may be delivered manually (through gentle pushing), naturally (e.g in the myocardium of a beating heart) or acoustically, as done in acoustic radiation force impule imaging (ARFI) [24]. Alternatively, the speed of laterally traveling shear waves induced by an acoustic push-pulse can be measured, with this speed being directly related to the shear modulus [6]. Shear wave elasticity imaging (SWEI) also permits measurement of tissue viscosity in addition to stiffness through assessment of wave dispersion [25]. All the above methods rely on adequate measurement of local tissue velocity or displacement through some form of tissue Doppler processing.

While Doppler methods enable estimation of blood flow, detection of low-velocity microvascular flow is challenging since its Doppler spectrum overlaps with that of the strong tissue clutter. Contrast-enhanced ultrasound (CEUS) permits visualization and characterization of microvascular perfusion through the use of gas-filled microbubbles [26, 27]. These intravascular bubbles are sized similarly to red blood cells, reaching the smallest capillaries in the vascular net, and exhibit a particular nonlinear response when insonified. The latter is specifically exploited in contrast-enhanced imaging schemes, which aim at isolating this nonlinear response through dedicated pulse sequences. Unfortunately, this does not lead to complete tissue suppression, since tissue itself also generates harmonics [28]. Thus, clutter rejection algorithms are becoming increasingly popular, in particular when used in conjunction with ultrafast imaging [29].

Recent developments also leverage the microbubbles used in CEUS to yield super-resolution imaging [30, 31, 32, 33]. Ultrasound localization microscopy (ULM) is a particularly popular approach to achieve this [8]. ULM methods rely on adequate detection, isolation and localization of the microbubbles, typically achieved through precisely tuned tissue clutter suppression algorithms and by posing strong constraints on the allowable concentrations. We will further elaborate on this approach and its limitations in Section IV, where we discuss a dedicated deep learning solution for super resolution ultraound that aims at addressing some of these disadvantages.

III Deep learning for (Front-end)

ultrasound processing

The effectiveness of ultrasound imaging and its applications is dictated by adequate front-end beamforming, compression, signal extraction (e.g. clutter suppression) and velocity estimation. In this section we demonstrate how neural networks, being universal function approximators [34], can learn to act as powerful artificial agents and signal processors across the imaging chain to improve resolution and contrast, adequately suppress clutter, and enhance spectral estimation. We here refer to artificial agents [35] whenever these learned networks impact the processing chain by actively and adaptively changing the settings or parameters of a particular processor depending on the context.

Deep learning is the process of learning a hierarchy of parameterized nonlinear transformations (or layers) such that it performs a desired function. These elementary nonlinear transformations in a deep network can take many forms and may embed structural priors. A popular example of the latter is the translational invariance in images that is exploited by convolutional neural networks, but we will see that in fact many other structural priors can be exploited.

The methods proposed throughout this work are both model-based and learn from data. We complement this approach with a-priori knowledge on signal structure, to develop deep learning models that are both effective and data-efficient, i.e. ‘fast learners’. An overview is given in Fig. 1. We assume that the reader is familiar with the basics of (deep) neural networks. For a general introduction to deep learning, we refer the reader to [36].

III-A Beamforming

III-A1 Deep neural networks as beamformers

The low complexity of delay-and-sum beamforming has made it the industry standard and commonplace for real-time ultrasound beamforming. There are however a number of factors that cause deteriorated reconstruction quality of this naive spatial filtering strategy. First, the channel delays for time-of-flight correction are based on the geometry of the scene and assume a constant speed of sound across the medium. As a consequence, variations in speed of sound and resulting aberrations impair proper alignment of echoes stemming from the same scatterer [37]. Second, the a-priori determined channel weighting (apodization) of pseudo-aligned echoes before summation requires a trade-off between main-lobe width (resolution) and side-lobe level (leakage) [38].

Delay-and-sum beamformers are typically hand-tailored based on knowledge of the array geometry and medium properties, often including specifically designed array apodization schemes that may vary across imaging depth. Interestingly, it is possible to learn the delays and apodizations from paired channel-image data through gradient-descent by dedicated “delay layers” [39]. To show this, unfocused channel data was obtained from echocardiography of six patients for both single-line and multi-line acquisitions. While the latter allows for increased frame rates, it leads to deteriorated image quality when applying standard delay-and-sum beamforming. The authors therefore propose to train a more appropriate delay-and-sum beamforming chain that takes multi-line channel data as an input, and produces beamformed images that are as close as possible to those obtained from single-line-acquisitions, minimizing their $\ell_{1}$ distance. Since the introduced delay and apodization layers are differentiable, efficient learning is enabled through backpropagation. Although such an approach potentially enables discovery of a more optimal set of parameters dedicated to each application, the fundamental problem of having a-priori-determined static delays and weights remains.

Several other data-driven beamforming methods have recently been proposed. In contrast to [39], these are mostly based on “general-purpose” deep neural networks, such as stacked autoencoders [40], encoder-decoder architectures [41], and fully-convolutional networks that map pre-delayed channel data to beamformed outputs [42]. In the latter, a 29-layer convolutional network was applied to a 3D stack of array response vectors for all lateral positions and a set of depths, to yield a beamformed in-phase and quadrature output for those lateral positions and depths. Others exploit neural networks to process channel data in the Fourier domain [43]. To that end, axially gated sections of pre-delayed channel data first undergo discrete Fourier-transformation. For each frequency bin, the array responses are then processed by a separate fully connected network. The frequency spectra are subsequently inverse Fourier-transformed and summed across the array to yield a beamformed radiofrequency signal associated to that particular axial location. The networks were specifically trained to suppress off-axis responses (outside the first nulls of the beam) from simulations of ultrasound channel data for point targets.

Beyond beamforming for suppression of off-axis scattering, the authors in [44] propose deep convolutional neural networks for joint beamforming and speckle reduction. Rather than applying the latter as a post-processing technique, it is embedded in the beamforming process itself, permitting exploitation of both channel and phase information that is otherwise irreversibly lost. The network was designed to accept 16 beamformed subaperture radio frequency (RF) signals as an input, and outputs speckle-reduced B-mode images. The final beamformed images exhibit comparable speckle-reduction as post-processed delay-and-sum images using the optimized Bayesian nonlocal means algorithm [45], yet at an improved resolution. Additional applications of deep learning in this context include removal of artifacts in time-delayed and phase-rotated element-wise I/Q data in multi-line acquisitions for high-frame-rate imaging [46], and synthesizing multi-focus images from single-focus images through generative adversarial networks [47]. In [48], such generative adversarial networks were used for joint beamforming and segmentation of cyst phantoms from unfocused RF channel data acquired after a single plane-wave transmission.

While the flexibility and capacity of very deep neural networks in principle allows for learning context-adaptive beamforming schemes, such highly overparameterized networks notoriously rely on vast RF channel data to yield robust inference under a wide range of conditions. Moreover, large networks have a large memory footprint, complicating resource-limited implementations.

III-A2 Leveraging model-based algorithms

One approach to constraining the solution space while explicitly embedding adaptivity is to borrow concepts from model-based adaptive beamforming methods. These techniques steer away from the fixed-weight presumption and calculate an array apodization depending on the measured signal statistics. In the case of pixel-based reconstruction, apodization weights can be adaptively optimized per pixel. A popular adaptive beamforming method is the minimum variance distortionless response (MVDR), or Capon, beamformer, where optimal weights are defined as those that minimize signal variance/power, while maintaining distortionless response of the beamformer in the desired source direction. This amounts to solving:

[TABLE]

where $\mathbf{R}_{x}$ denotes the covariance matrix calculated over the receiving array elements and $\mathbf{a}$ is a steering vector. When receive signals are already time-of-flight corrected, $\mathbf{a}$ is a unity vector.

Solving (1) involves the inversion of $\mathbf{R}_{x}$ , whose computational complexity grows cubically with the number of array elements [50]. To improve stability, it is often combined with subspace selection through eigendecomposition, further increasing the computational burden. Another problem is the accurate estimation of $\mathbf{R}_{x}$ , typically requiring some form of averaging across sub-arrays and the fast- and slow-time scales. While this implementation of MVDR beamforming is impractical for typical ultrasound arrays (e.g $256$ elements) or matrix-transducers (e.g $64\times 64$ elements), it does provide a framework in which deep neural networks can be leveraged efficiently and effectively.

Instead of attempting to replace the beamforming process entirely, a neural network can be used specifically to act as an artificial agent that calculates the optimal apodization weights $\mathbf{w}$ for each pixel, given the received pre-delayed channel signals at the array. By only replacing this bottleneck component in the MVDR beamformer, and constraining the problem further by enforcing close-to-distortionless response during training (i.e. $\Sigma_{i}w_{i}\approx 1$ ), this solution is highly data-efficient, interpretable, and has the ability to learn powerful models from only few images [49].

The neural network proposed in [49] is compact, consisting of four fully connected layers comprising 128 nodes for the input and output layers, and 32 nodes for the hidden layers. This dimensionality reduction enforces compact representation of the data, mitigating the impact of noise. Between every fully connected layer, dropout is applied with a probability of 0.2. The input of the network is the pre-delayed (focused) array response for a particular pixel (i.e. a vector of length $N$ , with $N$ being the number of array elements), and its outputs are the corresponding array apodizations $\mathbf{w}$ . This apodization is subsequently applied to the network inputs to yield a beamformed pixel. Since pixels are processed independently by the network, a large amount of training data is available per acquisition. Inference is fast and real-time rates are achievable on a GPU-accelerated system. For an array of 128 elements, adaptive calculation of a set of apodization weights through MVDR requires $>N^{3}(=2,097,152)$ floating point operations (FLOPS), while the deep-learning architecture only requires 74656 FLOPS [49], leading to a more than $400\times$ speed-up in reconstruction time. Additional details regarding the adopted network and training strategy are given in Section III-III-A3.

Fig. 2 exemplifies the effectiveness of this approach on plane-wave ultrasound acquisitions obtained using a linear array transducer. Compared to standard delay-and-sum, adaptive beamforming with a deep network serving as an artificial agent visually provides reduced clutter and enhanced tissue contrast. Quantitatively it yields a slightly elevated contrast-to-noise ratio (10.96 dB vs 11.48 dB), along with significantly improved resolution (0.43 mm vs 0.34 mm, and 0.85 mm vs 0.70 mm in the axial and lateral directions, respectively).

Interestingly, the neural network exhibits increased stability and robustness compared to the MVDR weight estimator. This can be attributed to its small bottleneck latent space, enforcing apodization weight realizations that are represented in a compact basis.

III-A3 Design and training considerations

The large dynamic range and modulated nature of radio-frequency ultrasound channel data motivates the use of specific nonlinear activation functions. While rectified linear units (ReLUs) are typically used in image processing, popular for their sparsifying nature and ability to avoid vanishing gradients due to their positive unbounded output, it inherently causes many ‘dying nodes’ (neurons that do no longer update since their gradient is zero) for ultrasound channel data, as a ReLU does not preserve (the abundant) negative values. To circumvent this, a hyperbolic tangent function could be used. Unfortunately, the large dynamic range of ultrasound signals makes it difficult to be in the ‘sweet spot’, where gradients are sufficiently large, thereby avoiding vanishing gradients during backpropagation across multiple layers.

A powerful alternative that is by nature unbounded and preserves both positive and negative values is the class of concatenated rectified linear units [51]. A particular case is the anti-rectifier function:

[TABLE]

where $[\cdot]_{+}=\textrm{max}(\cdot,0)$ is the positive part operator, $\mathbf{x}$ is a vector containing the linear responses of all neurons (before activation) at a particular layer, and $\bar{\mathbf{x}}$ is its mean value across all those neurons. The anti-rectifier does not suffer from vanishing gradients, nor does it lead to dying nodes for negative values, yet provides the nonlinearity that facilitates learning complex models and representations. This dynamic-range preserving activation scheme is therefore well-suited for processing radio-frequency or IQ-demodulated ultrasound channel data, and is also used for the results presented in Fig. 2. These advantages come at the cost of a higher computational complexity compared to a standard ReLU activation.

When training a neural-network-based ultrasound beamforming algorithm, it is important to consider the impact of subsequent signal transformations in the processing chain. In particular, envelope-detected beamformed signals typically undergo significant dynamic range compression (e.g. through a logarithmic transformation) to project the high dynamic range of backscattered ultrasound signals onto the limited dynamic range of a display, and allow for improved interpretation and diagnostics. To incorporate this aspect in the neural network’s training loss, beamforming errors can be transformed to attain a mean squared logarithmic error:

[TABLE]

where $\hat{\mathbf{y}}$ is a vector containing the neural-network-based prediction of the beamformed responses for all pixels, and $\mathbf{y}$ contains the target beamformed signals. For our model-based adaptive beamforming solution [49], $\mathbf{y}$ contains the MVDR beamformer outputs for each pixel, and $\mathbf{y}$ is the corresponding set of pixel responses after application of the apodization weights calculated by the neural network.

III-B Adaptive spectral estimation for spectral Doppler

As mentioned in Section II, beamformed ultrasound signals are not only used to visualize anatomical information in B-mode, they also permit the extraction of velocities by processing subsequent frames across slow-time.

Spectral Doppler ultrasound enables measurement of blood (and tissue) velocity distributions through the generation of a Doppler spectrogram from slow-time data sequences, i.e. a series of subsequent pulse-echo snapshots. In commercial systems, spectra are estimated using Fourier-transform-based periodogram methods, e.g. the standard Welch approach. Such techniques however require long observation windows (denoted as ‘coherent processing intervals’) to achieve high spectral resolution and mitigate spectral leakage. This deteriorates the temporal resolution.

Data-adaptive spectral estimators alleviate the strong time-frequency resolution tradeoff, providing superior spectral estimates and resolution for a given temporal resolution [52]. The latter is determined by the coherent processing interval, which is in turn defined by the pulse repetition frequency and the number of slow-time snapshots required for a spectral estimate. Adaptive approaches steer away from the standard periodogram methods, and rely on content-matched filterbanks. The filter coefficients for each frequency of interest $\omega$ are adaptively tuned to e.g. minimize signal energy while being constrained to unity frequency response. This Capon spectral estimator is given by solving [52]:

[TABLE]

where $\mathbf{R}_{y}$ is the covariance matrix of the (slow-time) input signal vector $\mathbf{y}$ , and $\mathbf{e}_{\omega}$ is the corresponding Fourier vector. While this adaptive spectral estimator indeed improves upon standard approaches and significantly lowers the required observation window while gaining spectral fidelity, it unfortunately suffers from high computational complexity stemming from the need for inversion of the signal covariance matrix.

As for the MVDR beamformer (Section III-III-A), we here demonstrate that neural networks can also be exploited to provide fast estimators for the optimal matched filter coefficients, acting as an artificial agent. An overview of this approach is given in Fig. 3, for a pulsed-wave phantom data for the arteria femoralis [53]. The neural network takes a beamformed slow-time RF signal as input, and outputs a set of filter coefficients for each filter in the filterbank. The slow-time input signal is then passed through this filterbank to attain a spectral estimate. The neural network is trained by minimizing the mean squared logarithmic error (III-A3) between the resulting spectrum and the output spectrum of the high-quality adaptive Capon spectral estimator. It comprised 128 4-layer fully-connected subnetworks, each of those predicting the coefficients for one of the 128 filters in the filterbank. The optimization problem is then regularized by penalizing deviations from unity frequency response (4). The length of the slow-time observation window was only 64 samples, taken from a single depth sample. Compared to Welch’s periodogram-based method, adaptive spectral estimation by deep learning achieves far less spectral leakage, and higher spectral resolution (Fig. 3b and c).

Training the artificial agent is subject to similar considerations outlined in Section III-A4. First, slow-time input samples have a large dynamic range such that a non-saturating activation scheme is preferred (2). Second, Doppler spectra are typically presented in decibels, advocating for the use of a log-transformed training loss as in (III-A3). Third, training is regularized by adding an additional loss to penalize predicted filterbanks that deviate from unity frequency response.

The above approach is designed to processes uniformly sampled slow-time signals. In practice, there is a desire to expand these techniques to estimators that have the ability to cope with ‘gaps’, or even sparsely sampled signals, since spectral Doppler processing is typically interleaved with B-mode imaging for navigation purposes (Duplex mode). To that end, extensions of data-adaptive estimators for periodically gapped data [54], and recovery for nested slow-time sampling [55] can be used.

III-C Compressive encodings for tissue Doppler

From a hardware perspective, a significant challenge for the design of ultrasound devices and transducers is coping with the limited cable bandwidth and related connectivity constraints [56]. This is particularly troublesome for catheter transducers used in interventional applications (e.g. intra-vascular ultrasound or intra-cardiac echography), where data needs to pass through a highly restricted number of cables. While this is less of a concern for transducers with only few elements, the number of transducer elements have expanded greatly in recent devices to facilitate high-resolution 2D or 3D imaging [57]. Beyond the limited capacity of miniature devices, (future) wireless transducers will pose similar constraints on data rates [58]. Today, front-end connectivity and bandwidth challenges are addressed through e.g. application-specific integrated circuits that perform microbeamforming [59] or simple summation of the receive signals across neighbouring elements [60] to compress the full channel data into a manageable amount, and multiplexing of the receive signals. This inherently entails information loss, and typically leads to reduced image quality.

Instead of Nyquist-rate sampling of pre-beamformed and multiplexed channel data, compressive sub-Nyquist sampling methods permit reduced-rate imaging without sacrificing quality [3, 19]. After (reduced-rate) digitization, additional compression may be achieved through neural networks that serve as application-specific encoders. Advances in low-power neural edge computing may permit placing such a trained encoder at the probe, further alleviating probe-scanner communication, and a subsequent high-end decoder at the remote processor [61].

Instead of aiming at decoding the full input signals from the encoded representation, one can also envisage decoding only a specific signal or source that is to be extracted from the input. This may enable stronger compression during encoding whenever this component has a more restricted entropy than the full signal. In ultrasound imaging, such signal-extracting compressive deep encoder-decoders can e.g. be used for velocity estimation in colour Doppler [62]. Fig. 4 shows how these networks enable decoding of tissue Doppler signals from encoded IQ-demodulated input data acquired in an in-vivo open-chest experiment of a porcine model, using intra-cardiac diverging-wave imaging in the right atrium at a frame rate of 474 Hz.

Here the encoding neural network comprised a series of three identical blocks, each composed of two subsequent convolutional layers across fast- and slow-time, followed by an aggregation of this processing through spatial downsampling (max pooling). The decoder had a similar, mirrored, architecture. The degree of IQ data compression achieved by the encoder can be changed by varying the number of channels (in the context of image processing often referred to as feature maps) at the latent layer. The encoder and decoder network parameters can then be learnt by mimicking the phase (and therewith, velocity) estimates obtained using the well-know Kasai autocorrelator on the full input data (see Fig. 4b). Interestingly, IQ compression rates as high as 32 can be achieved (see Fig. 4c), while retaining reasonable Doppler signal quality, yielding a relative phase root-mean-squared-error of approximately $0.02$ . These errors drop when requiring lower compression rates. Higher compression rates lead to an increased degree of spatial consistency, displaying fewer spurious variations which could not be represented in the compact latent encoding.

The design of traditional Doppler estimators involves careful optimization of the slow- and fast-time range gates across which the estimation is performed, amounting to a trade-off between the estimation quality and spatiotemporal resolution [22]. For many practical applications, the optimal settings not only vary across measurements and desired clinical objectives, but also within a single measurement. In contrast, a convolutional encoder-decoder network can learn to determine the effective spatiotemporal support of the given input data required for adequate Doppler encoding and prediction.

III-D Unfolding Robust PCA for clutter suppression

An important ultrasound-based modality is contrast-enhanced ultrasound (CEUS) [63], which allows the detection and visualization of small blood vessels. In particular, CEUS is used for imaging perfusion at the capillary level [64, 65], and for estimating different properties of the blood such as relative volume, velocity, shape and density. These physical parameters are related to different clinical conditions, including cancer [66].

The main idea behind CEUS is the use of encapsulated gas microbubbles, serving as ultrasound contrast agents (UCAs), which are injected intravenously and can flow throughout the vascular system due to their small size [67]. To visualize them, strong clutter signals originating from stationary or slowly moving tissues must be removed as they introduce significant artifacts in the resulting images [68]. The latter poses a major challenge in ultrasonic vascular imaging and various methods have been proposed to address it. In [69], an high-pass filtering approach was presented to remove tissue signals using finite impulse response (FIR) or infinite impulse response (IIR) filters. However, this approach is prone to failure in the presence of fast tissue motion. An alternative strategy is second harmonic imaging [70] which exploits the non-linear response of the UCAs to separate them from the tissue. This technique, however, does not remove the tissue completely as it also exhibits a nonlinear response.

One of the most popular approaches for clutter suppression is spatio-temporal filtering based on the singular value decomposition (SVD). This strategy has led to various techniques for clutter removal [68, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80]. SVD filtering includes collecting a series of consecutive frames, stacking them as vectors in a matrix, performing SVD of the matrix and removing the largest singular values, assumed to be related to the tissue. Hence, a crucial step in SVD filtering is determining an appropriate threshold which discriminates between tissue related and blood related singular values. However, the exact setting of this threshold is difficult to determine and may vary dramatically between different scans and subjects, leading to significant defects in the constructed images.

To overcome these limitations, in [81, 82, 83], the task of clutter removal was formulated as a convex optimization problem by leveraging a low-rank-and-sparse decomposition. The authors of [81] then proposed an efficient deep learning solution to this convex optimization problem through an algorithm-unfolding strategy [84]. To enable explicit embedding of signal structure in the resulting network architecture, the following model for the signal after beamforming was proposed.

Denote the received beamformed signal at snapshot time $t$ by $D(x,z,t)$ , where $(x,z)$ are image coordinates. Then we may write:

[TABLE]

where the term $L(x,z,t)$ represents the tissue and $S(x,z,t)$ is the signal stemming from the blood. Similar to SVD filtering, a series of consecutive snapshots ( $t=1,...,T$ ) is acquired and stacked as vectors into a matrix, leading to the matrix model:

[TABLE]

The tissue exhibits high spatio-temporal coherence, hence, the matrix $\bf L$ is assumed to be low rank. The matrix $\bf S$ is considered to be sparse, since small blood vessels sparsely populate the image plane.

These assumptions on the rank of $\bf L$ and the sparsity in $\bf S$ enable formulation of the task of clutter suppression as a robust principle component analysis (RPCA) problem [85]:

[TABLE]

where $\lambda_{1}$ and $\lambda_{2}$ are threshold parameters. The symbol $||\cdot||_{*}$ stands for the nuclear norm, which sums the singular values of $\bf L$ . The term $||\cdot||_{1,2}$ is the mixed $l_{1,2}$ norm [33, 86], which promotes sparsity of the blood vessels along with consistency of their locations over consecutive frames. RPCA is widely used in the area of computer vision, and can be solved iteratively using the fast iterative shrinkage/soft-thresholding algorithm (FISTA) [87], leading to the following update rules

[TABLE]

Here $\mathcal{MT}_{\alpha}({\bf X})$ is the mixed $\ell_{1,2}$ soft-thresholding operator which applies the function $\max(0,1-\frac{\alpha}{||{\bf x}||}){\bf x}$ on each row $\bf x$ of the input matrix $\bf X$ . Assuming the input matrix is given by its SVD ${\bf X=U\Sigma V}^{H}$ , the singular value thresholding (SVT) is defined as $\mathcal{ST}_{\alpha}({\bf X)=U\mathcal{S}_{\alpha}(\Sigma)V}^{H}$ where $\mathcal{S}_{\alpha}(x)=\max(0,x-\alpha)$ is applied point-wise on $\bf\Sigma$ . A diagram of this iterative solution is given in Fig. 5a.

As shown in Fig. 5c, the iterative solution (8) outperforms SVD filtering and leads to improved clutter suppression. However, it suffers from two major drawbacks. The threshold parameters $\lambda_{1},\lambda_{2}$ need to be properly tuned as they have a significant impact on the final result. Moreover, depending on the dynamic range between the tissue and the blood, FISTA may require many iterations to converge, thus, making it impractical for real-time imaging. This motivates the pursuit of a solution with fixed complexity in which the threshold parameters are adjusted automatically.

Such a fixed-complexity solution can be attained through unfolding [88, 89], in which a known iterative solution is unrolled as a feedforward neural network. In this case, the iterative solution is the FISTA algorithm (8), which can be rewritten as

[TABLE]

Here ${\bf W}_{1}={\bf W}_{2}={\bf I}$ , ${\bf W}_{3}={\bf W}_{6}=-{\bf I}$ and ${\bf W}_{4}={\bf W}_{5}=\frac{1}{2}{\bf I}$ . From this, the deep multi-layer network takes a form in which the $k$ th layer is given by

[TABLE]

In (10), the matrices $\left({\bf W}_{1}^{k},\cdots,{\bf W}_{6}^{k}\right)$ and the regularization parameters $\lambda_{1}^{k}$ and $\lambda_{2}^{k}$ differ from one layer to another and are learned during training. Moreover, $\left({\bf W}_{1}^{k},\cdots,{\bf W}_{6}^{k}\right)$ were chosen to be convolution kernels where $\ast$ denotes the convolution operator. The latter facilitates spatial invariance along with a notable reduction in the number of learned parameters. This results in a CNN that is specifically tailored for solving RPCA, whose non-linearities are the soft-thresholding and SVT operators, and is termed Convolutional rObust pRincipal cOmpoNent Analysis (CORONA). A diagram of a single layer from CORONA is given in Fig. 5b.

The training process of CORONA is performed by back-propagation in a supervised manner, leveraging both simulations, for which the true decomposition is known, and in-vivo data for which the decomposition of FISTA (8) is considered as the ground truth. Moreover, data augmentation is performed and the training is done on 3D patches extracted from the input measurements. The loss function was chosen as the sum of mean squared errors (MSE)

[TABLE]

where $\left\{{\bf S}_{i},{\bf L}_{i}\right\}_{i=1}^{N}$ are the ground truth and $\left\{\hat{\bf S}_{i},\hat{\bf L}_{i}\right\}_{i=1}^{N}$ are the network’s outputs. The learned parameters are denoted by ${\theta}=\left\{{\bf W}_{1}^{k},\cdots,{\bf W}_{6}^{k},\lambda_{1}^{k},\lambda_{2}^{k}\right\}_{k=1}^{K}$ where $K$ is the number of layers. Backpropagation through the SVD was done using PyTorch’s Autograd function [90].

Fig. 5 shows how CORONA effectively suppresses clutter on contrast-enhanced ultrasound scans of two rat brains, outperforming SVD filtering and RPCA through FISTA (8). The recovered CEUS (blood) signals are given in Fig. 5c, including enlarged views of regions of interest. Visually judging, FISTA achieves moderately better contrast than SVD filtering, while CORONA outperforms both approaches by a large margin. For a quantitative comparison, the contrast-to-noise ratio (CNR) and contrast ratio (CR) were assessed, defined as

[TABLE]

where $\mu_{s}$ and $\sigma_{s}^{2}$ are the mean and variance of the regions of interest in Fig. 5c, and $\mu_{b}$ and $\sigma_{b}^{2}$ are the mean and variance of the noisy reference area indicated by the yellow box. In both metrics, higher values imply higher contrast ratios, which suggests better noise suppression. FISTA obtained slightly better performance than SVD filtering (CR $\approx 4.6$ dB and $\approx 5.4$ dB, respectively) and CORONA outperformed both (CR $\approx 15$ dB). In most cases, the performance of CORONA was about an order of magnitude better than that of SVD. Thus, combining a model for the separation problem with a data-driven approach leads to improved separation of UCA and tissue signals, together with noise reduction as compared to the popular SVD approach.

The complexity of all three methods is governed by the singular-value decomposition which requires $O(MN^{2})$ FLOPS for an $M\times N$ matrix, where $M\geq N$ . However, FISTA may require thousands of iterations, i.e., thousands of such SVD operations. Hence, FISTA for RPCA is computationally significantly heavier than regular SVD-filtering. On the other hand, for CORONA, up to 10 layers were shown to be sufficient (i.e., up to 10 SVD operations), therewith offering a dramatic increase in performance at the expense of only a moderate increase in complexity. All three methods can benefit from using inexact decompositions that exhibit reduced computational load, such as the truncated SVD and randomized SVD.

IV Deep learning for super-resolution

IV-A Ultrasound localization microscopy

While the above described advances in front-end ultrasound processing can boost resolution, suppress clutter, and drastically improve tissue contrast, the attainable resolution of ultrasonography remains fundamentally limited by wave diffraction, i.e. the minimum distance between separable scatters is half a wavelength. Simply increasing the transmit frequency to shorten the wavelength unfortunately comes at the cost of reduced penetration depth, since higher frequencies suffer from stronger absorption compared to waves with a higher wavelength. This trade-off between resolution and penetration depth particularly hampers deep high-resolution microvascular imaging, being a cornerstone for many diagnostic applications.

Recently, this trade-off was circumvented by the introduction of ultrasound localization microscopy (ULM) [91, 92]. ULM leverages principles that formed the basis for the Nobel-prize-winning concept from optics of super-resolution fluoresence microscopy, and adapts these to ultrasound imaging: if individual point-sources are well-isolated from diffraction-limited scans, and their centers subsequently precisely pinpointed on a sub-diffraction grid, then the accumulation of many such localizations over time yields a super-resolved image. In optics, stochastic ‘blinking’ of subsets of fluorophores is exploited to provide such sparse point sources. In ULM, intravascular lipid-shelled gas microbubbles fulfill this role [93]. This approach permits achieving a resolution that is up to 10 times smaller than the wavelength [8].

Since the fidelity of ULM depends on the number of localized microbubbles and the localization accuracy, it gives rise to a new trade-off that balances the required microbubble sparsity for accurate localization and acquisition time. To achieve the desired signal sparsity for straightforward isolation of the backscattered echoes, ULM is typically performed using a very diluted solution of microbubbles. On regular ultrasound systems, this constraint leads to tediously long acquisition times (on the order of hours) to cover the full vascular bed. Using an ultrafast plane-wave ultrasound system rather than regular scanning, Errico et al. performed ultrafast ULM (uULM) in a rat brain [8]. Empowered by high frame rates (500 frames per second), the acquisition time was lowered to minutes instead of hours. Ultrafast imaging indeed enables taking many snapshots of individual microubbles as they transport through the vasculature, thereby facilitating very high-fidelity reconstruction of the larger vessels. Nevertheless, mapping the full capillary bed remains dictated by the requirement of microbubbles to pass through each of the capillaries. As such, long acquisitions of tens of minutes are required, even with uULM [94]. To boost the achieved coverage in a given time-span, methods that enable the use of higher concentrations can be leveraged [32, 33, 95, 96, 97].

IV-B Exploiting signal structure

To strongly relax the constraints on microbubble concentration and therewith cover more vessels in a shorter time, standard ULM can be extended by incorporating knowledge of the measured signal structure; in particular its sparsity in a transform domain. To that end, a received contrast-enhanced image frame can be modeled as:

[TABLE]

where $\mathbf{x}$ is a vector which describes the sparse microbubble distribution on a high-resolution image grid, $\mathbf{y}$ is the vectorized image frame of the ultrasound sequence, $\mathbf{A}$ is the measurement matrix where each column of $\mathbf{A}$ is the point-spread-function shifted by a single pixel on the high-resolution grid, and $\mathbf{w}$ is a noise vector.

Leveraging this signal prior, i.e., assuming that the microbubble distribution is sparse on a sufficiently high-resolution grid (or, the number of non-zero entries in $\mathbf{x}$ is low) we can formulate the following $\ell_{1}$ -regularized inverse problem:

[TABLE]

where $\lambda$ is a regularization parameter that weighs the influence of $||\textbf{x}||_{1}$ .

Equation (12) may be solved using a numerical proximal gradient scheme such as FISTA [87]. We will discuss this FISTA-based solution in Section IV-IV-C2. After estimating $\mathbf{x}$ for each frame, the estimates are summed across all frames to yield the final super-resolution image.

Beyond sparsity on a frame-by-frame basis, signal structure may also be leveraged across multiple frames. To that end, a multiple-measurement vector model [98] and its structure in a transformed domain can be considered, e.g. by assuming that a temporal stack of frames $\mathbf{x}$ is sparse in the temporal correlation domain [32, 33]. Considering the temporal dimension, sparse recovery may be improved by exploiting the motion of microbubbles, allowing the application of a prior on the spatial microbubble distribution through Kalman tracking [99].

Exploiting signal structure through sparse recovery indeed enables improved localization precision and recall for high microbubble concentrations [95, 97]. Unfortunately, proximal gradient schemes like FISTA typically require numerous iterations to converge (yielding a very time-consuming reconstruction process), and their effectiveness is strongly dependent on careful tuning of the optimization parameters (e.g. $\lambda$ and the step size). In addition, the linear model in (11) is an approximation of what is actually a nonlinear relation between the microbubble distribution and the resulting beamformed and envelope-detected image frame. While this approximation is valid for microbubbles that are sufficiently far apart, the significant image-domain implications of the radio-frequency interference patterns of very closely-spaced microbubbles cannot be neglected.

IV-C Deep learning for fast high-fidelity sparse recovery

IV-C1 Encoder-decoder architectures

In pursuit of fast and robust sparse recovery for the nonlinear measurement model, we leveraged deep learning to solve the complex inverse problem based on adequate simulations of the forward problem [95, 96]. This data-driven approach, named deep-ULM, harnesses a fully convolutional neural network to map a low-resolution input image containing many overlapping microbubble signals, to a high-resolution sparse output image in which the pixel intensities reflect recovered backscatter levels. This process is illustrated in Fig. 6a. The network comprises an encoder and a decoder, with the former expressing input frames in a latent representation, and the latter decoding such representation into a high-resolution output. The encoder is composed of a contracting path of 3 blocks, each block consisting of two successive $3\times 3$ convolution layers and one $2\times 2$ max-pooling operation. This is followed by two $3\times 3$ convolutional layers and a dropout layer that randomly disables nodes with a probability of $0.5$ to mitigate overfitting. The subsequent decoder also consists of 3 blocks; the first two blocks encompassing two $5\times 5$ convolution layers, of which the second has an output stride of 2, followed by $2\times 2$ nearest-neighbour up-sampling. The last block consists of two convolution layers, of which the second again has an output stride of 2, preceding another 5x5 convolution which maps the feature space to a single-channel image through a linear activation function. All other activation functions in the network were leaky rectified linear units [100]. The full deep encoder-decoder network (see Fig. 6a) effectively scales the input image dimensions up by a factor 8, and provides a powerful model that has the capacity to learn the sparse decoding problem, while yielding simultaneous denoising through the compact latent space.

The network is trained on simulations of contrast-enhanced ultrasound acquisitions, using an estimate of the real system point-spread-function, the RF modulation frequency, and pixel spacing. Noise, clutter and artifacts were included by randomly sampling from real measurements across frames in which no microbubbles are present. Similar to [101], we adopt a specific loss function that acts as a surrogate for the real localization error:

[TABLE]

where $\mathbf{Y}$ and $\mathbf{X}_{t}$ are the low-resolution input and sparse super-resolution target frames, respectively, $f(\mathbf{Y}|\theta)$ is the nonlinear neural network function, and $\mathbf{G}(\sigma)$ is an isotropic Gaussian convolution kernel. Jointly, the $\ell_{1}$ penalty that acts on the reconstructions and the kernel $\mathbf{G}(\sigma)$ that operates on the targets, yield a loss function that increases when the reconstructed images exhibit less sparsity and when the Euclidean distances between the localizations and the targets become larger. We note that selection of the relative weighting of this sparsity penalty by $\gamma$ is less critical than the thresholding parameter $\lambda$ adopted in the sparse recovery problem (12), since the measurement model $\mathbf{A}$ (characterized by the point-spread-function) exhibits a much smaller bandwidth than $G(\sigma)$ for low values of $\sigma$ as adopted here. Consequently, the degree of bandwidth extension necessary to yield sparse outputs is less in the latter case.

Fig. 6c displays the super-resolution ultrasound reconstruction of a rat spinal cord [102], qualitatively showing how deep-ULM achieves a significantly higher resolution and contrast than the diffraction-limited maximum intensity projection image (Fig. 6b). Deep-ULM achieves a resolution of about 20-30 $\mu$ m, being a 4-5 fold improvement compared to standard imaging with the adopted linear 15-MHz transducer [95]. In terms of speed, recovery on a $4096\times 1328$ grid takes roughly 100 milliseconds per frame using GPU acceleration, making it about four orders of magnitude faster than a Fourier-domain implementation of sparse recovery through the FISTA proximal gradient scheme [33].

IV-C2 Deep unfolding for robust and fast sparse decoding

While deep encoder-decoder architectures (as used in deep-ULM) serve as a general model for many regression problems and are widely used in computer vision, their large flexibility and capacity also likely make them overparameterized for the sparse decoding problem at hand. To promote robustness by exploiting knowledge of the underlying signal structure (i.e. microbubble sparsity), we propose using a dedicated and more compact network architecture that borrows inspiration from the proximal gradient methods introduced in Section IV-IV-B [87].

To do so, we first briefly describe the ISTA scheme for the sparse decoding problem in (12):

[TABLE]

where $\mu$ determines the step size, and $\mathcal{T}_{\lambda}(\mathbf{x})_{i}=(|x_{i}|-\lambda)_{+}\textrm{sgn}(x_{i})$ is the proximal operator of the $\ell_{1}$ norm. Equation (14) is compactly written as:

[TABLE]

with $\mathbf{W}_{1}=\mu\mathbf{A}^{T}$ , and $\mathbf{W}_{2}=\mathbf{I}-\mu\mathbf{A}^{T}\mathbf{A}$ . Similar to our approach to robust PCA in Section III-III-D, we can unfold this recurrent structure into a $K$ -layer feedforward neural network as in LISTA (‘learning ISTA’) [88], with each layer consisting of trainable convolutions $\mathbf{W}_{1}^{k}$ and $\mathbf{W}_{2}^{k}$ , along with a trainable shrinkage parameter $\lambda^{k}$ . This enables learning a highly-efficient fixed-length iterative scheme for fast and robust ULM, with an optimal set of kernels and parameters per iteration, which we term deep unfolded ULM. Different than LISTA, we avoid vanishing gradients in the ‘dead zone’ of the proximal soft-thresholding operator $\mathcal{T}$ , by replacing it by a smooth sigmoid-based soft-thresholding operation [103]. An overview of this approach is given in Fig. 7b, contrasting this dedicated sparse-decoding-inspired solution with a general deep encoder-decoder network architecture in Fig. 7a. Both networks are trained on the same, synthetically generated, data.

Tests on synthetic data show that both deep learning methods significantly outperform standard ULM and sparse decoding through FISTA for high microbubble concentrations (Fig. 7c). On such simulations, the deep encoder-decoder used in deep-ULM yields higher recall and lower localization errors compared to deep unfolded ULM. Interestingly, when applying the trained networks to in-vivo ultrasound data, we instead observe that deep unfolded ULM yields super-resolution images with higher fidelity. Thus it is capable of translating much better towards real acquisitions than the large deep encoder-decoder network (see Figs. 6c and 7d for comparison).

Our 10-layer deep unfolded ULM comprising $5\times 5$ convolutional kernels has much fewer parameters (merely 506, compared to almost 700000 for the encoder-decoder scheme), therefore exhibiting a drastically lower memory footprint and reduced power consumption, in addition to achieving higher inference rates. The encoder-decoder approach requires over 4 million FLOPS for mapping a low-resolution patch of 16 by 16 pixels into a super-resolution patch of 128 by 128 pixels. The unfolded ISTA architecture is much more efficient, requiring just over 1000 FLOPS.

The lower number of trainable parameters may also explain the improved robustness and better generalization towards real data compared to its over-parameterized counterpart. On the other hand, complex image artifacts such as the strong bone reflections visible in the bottom left of Fig. 7d remain more prominent using the compact unfolding scheme.

V Other applications of Deep Learning in Ultrasound

While this paper predominantly focuses on deep learning strategies for ultrasound-specific receive processing methods along the imaging chain, the initially most thriving application of deep learning in ultrasound was spurred by computer vision: automated analysis of the images obtained with traditional systems [104]. Such image analysis methods aim at dramatically accelerating (and potentially improving) current clinical diagnostics.

A classic application of ultrasonography lies in prenatal screening, where fetal growth and development is monitored to identify possible problems and aid diagnosis. These routine examinations can be complex and cumbersome, requiring years of training to swiftly identify the scan planes and structures of interest. The authors in [105] effectively leverage deep learning to drastically simplify this procedure, enabling real-time detection and localization of standard fetal scan planes in freehand ultrasound. Similarly, in [106],[107], deep learning was used to accelerate echocardiographic exams by automatically recognizing the relevant standard views for further analysis, even permitting automated myocardial strain imaging [108]. In [109], a CNN was trained to perform thyroid nodule detection and recognition. Similar applications of deep learning include automated identification and segmentation of tumors in breast ultrasound [110], [111], [112], localization of clinically relevant B-line artifacts in lung ultrasonography [113], and real-time segmentation of anatomical zones on transrectal ultrasound (TRUS) scans [114]. In [115], the authors show how such anatomical landmarks and boundaries can be exploited by a deep neural network to attain accurate voxel-level registration of TRUS and MRI.

Beyond these computer-vision applications, other learning-based techniques aim at extracting relevant medium parameters for tissue characterization. Among such approaches is data-driven elasticity imaging [116],[117]. In these works, the authors propose neural-network-based models that produce spatially-varying linear elastic material properties from force-displacement measurements, free from prior assumptions on the underlying constitutive models or material properties. In [118], a deep convolutional neural network is used for speed-of-sound estimation from (single-sided) B-mode channel data. In [119], the authors address the problem by introducing an unfolding strategy to yield a dedicated network based on the iterative wave reflection tracking algorithm. The ability to measure speed of sound not only permits tissue characterization, but also adequate refraction-correction in beamforming.

VI Discussion and future perspectives

Over the past years, deep learning has revolutionized a number of domains, spurring breakthroughs in computer vision, natural language processing and beyond. In this paper we aimed to signify the potential that this powerful approach carries when leveraged in ultrasound image and signal reconstruction. We argue and show that deep learning methods profit considerably when integrating signal priors and structure, embodied by the proposed deep unfolding schemes for clutter suppression and super-resolution imaging, and the learned beamforming approaches. In addition, several ultrasound-specific considerations regarding suitable activation and loss functions were given.

We designed and showcased a number of independent building blocks, with trained artificial agents and neural signal processors dedicated to distinct applications. Some of the presented methods operate on images (Sections III-D and IV) or IQ data (Section III-C), while others process channel data directly (Sections III-A and III-B). A full processing chain may easily comprise a number of such components, which can be optimized holistically. This proposition enables imaging chains that are dedicated to the application and fully adaptive.

Designing neural networks that can efficiently process channel data in real-time comes with a number of challenges. First, in contrast to images, channel data has a very large dynamic range and is radio-frequency modulated. This makes typical activation functions as used in image analysis (often ReLUs or hyperbolic tangents) less suited. In Section III-A3, we argue that the class of concatenated rectified linear units provides a possible alternative. Second, channel data is extremely large, in particular for large arrays or matrix transducers and when sampled at the Nyquist rate. This may be alleviated significantly by leveraging sub-Nyquist sampling schemes [3, 14, 15, 17, 55], permitting high-end processing of low-rate channel data after (wireless) transfer to a remote (or cloud) processor. Such a new scheme, with a wireless probe that streams low-rate channel data for subsequent deep learning in the cloud, would open up many new possibilities for intelligent image formation and advanced processing in ultrasonography.

Deep learning typically relies on vast amounts of training data. Although several approaches to make learning more data-efficient and robust have been discussed throughout this paper, a significant amount of data is still required. In the framework of supervised learning, training data typically consists of input data and desired targets. What these targets are, and how they should be obtained, depends on the application and goal. Sometimes it is for instance desirable to mimic an existing high-performance algorithm that is too complex and costly to implement in real time. Examples of this are the adaptive beamforming and spectral Doppler applications described in Sections III-A and III-B, respectively. At other times, training data may only be obtainable through simulations or measurements on well-characterized in-vitro phantoms. In such cases, the performance of a deep learning algorithm on in-vivo data stands or falls with the realism of these training data and its coverage of the real-world data distribution. As shown in Section IV-C2, leveraging structural signal priors in the network architecture strongly aids generalization beyond simulations.

Once trained, inference can be fast through the exploitation of high-performance GPUs. While advanced high-end imaging systems may be equipped with GPUs to facilitate the deployment of deep neural networks at the remote processor, FPGAs or ASICSs may be more appropriate for resource-limited low-power settings [120]. In the consumer market, small neural- and tensor-processing units (NPUs and TPUs, respectively) are enabling neural network inference at the edge [121] - one can envisage a similar paradigm for front-end ultrasound processing. As such, the relevance of designing compact and efficient neural networks for memory-constrained (edge) settings is considerable and becomes particularly relevant for miniature and highly-portable ultrasound systems, where memory size, inference speed, and network bandwidth are all strictly constrained. This may be achieved by favouring (multiple) artificial agents that have very specific and well-defined tasks (Sections III-A and III-B), as opposed to a single highly complex end-to-end deep neural network. We also showed that embedding signal priors in neural architectures permits drastically reduced memory footprints. In that context, the difference between a deep convolutional encoder-decoder network (no prior) and a deep unfolded ISTA network (structural sparsity prior) is illustrative; where the former consists of almost 700000 parameters, the latter can perform super-resolution recovery with just over 500. Additional strategies to condense large models include knowledge distillation [122] and parameter pruning, as well as weight quantization [123].

Once deployed in the field, artificial agents in next-generation ultrasound systems ultimately should be able to embrace the vastness of data at their disposal, to continuously learn throughout their ‘lifetime’. To that end, unsupervised or self-supervised learning become increasingly relevant [124]. This holds true for many artificial intelligence applications, and extends beyond ultrasound imaging.

The promise that deep learning holds for ultrasound imaging is significant; it may spur a paradigm shift in the design of ultrasound systems, where smart wireless probes facilitated by sub-Nyquist and neural edge computing are connected to the cloud, and with AI-driven imaging modes and algorithms that are dedicated to specific applications. Empowered by deep-learning, next-generation ultrasound imaging may become a much stronger modality with devices that continuously learn to provide better images and clinical insight, leading to improved and more widely accessible diagnostics through cost-effective, highly-portable and intelligent imaging.

Acknowledgements

The authors would like to thank Ben Luijten, Frederik de Bruijn and Harold Schmeitz for their contribution to the adaptive beamforming and spectral Doppler applications. They also want to thank Matthew Bruce and Zin Khaing for acquiring the spinal cord data used to evaluate the super-resolution algorithms.

Bibliography124

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Thomas L Szabo. Diagnostic ultrasound imaging: inside out . Academic Press, 2004.
2[2] Jonathan M Baran and John G Webster. Design of low-cost portable ultrasound systems. In Annual International Conference of the IEEE Engineering in Medicine and Biology Society , pages 792–795. IEEE, 2009.
3[3] Tanya Chernyakova and Yonina C Eldar. Fourier-domain beamforming: the path to compressed ultrasound imaging. IEEE transactions on ultrasonics, ferroelectrics, and frequency control , 61(8):1252–1267, 2014.
4[4] Jean Provost, Clement Papadacci, Juan Esteban Arango, Marion Imbault, Mathias Fink, Jean-Luc Gennisson, Mickael Tanter, and Mathieu Pernot. 3D ultrafast ultrasound imaging in vivo. Physics in Medicine & Biology , 59(19):L 1, 2014.
5[5] Mickael Tanter and Mathias Fink. Ultrafast imaging in biomedical ultrasound. IEEE transactions on ultrasonics, ferroelectrics, and frequency control , 61(1):102–119, 2014.
6[6] Jérémy Bercoff, Mickael Tanter, and Mathias Fink. Supersonic shear imaging: a new technique for soft tissue elasticity mapping. IEEE transactions on ultrasonics, ferroelectrics, and frequency control , 51(4):396–409, 2004.
7[7] Charlie Demené, Thomas Deffieux, Mathieu Pernot, Bruno-Félix Osmanski, Valérie Biran, Jean-Luc Gennisson, Lim-Anna Sieu, Antoine Bergel, Stephanie Franqui, Jean-Michel Correas, et al. Spatiotemporal clutter filtering of ultrafast ultrasound data highly increases doppler and fultrasound sensitivity. IEEE transactions on medical imaging , 34(11):2271–2285, 2015.
8[8] Claudia Errico, Juliette Pierre, Sophie Pezet, Yann Desailly, Zsolt Lenkei, Olivier Couture, and Mickael Tanter. Ultrafast ultrasound localization microscopy for deep super-resolution vascular imaging. Nature , 527(7579):499, 2015.