Probabilistic Neural-Network Based 2D Travel Time Tomography

Stephanie Earp; Andrew Curtis

arXiv:1907.00541·physics.comp-ph·May 8, 2020

Probabilistic Neural-Network Based 2D Travel Time Tomography

Stephanie Earp, Andrew Curtis

PDF

TL;DR

This paper introduces a fast, probabilistic 2D travel time tomography method using neural networks, which significantly reduces computational costs compared to traditional Monte Carlo approaches and highlights the impact of prior information on uncertainty estimates.

Contribution

It presents the first fully non-linear, rapid Bayesian inversion technique for 2D travel time tomography using mixture density networks, demonstrating the importance of prior information.

Findings

01

Neural networks enable rapid probabilistic tomography after initial training.

02

Prior information significantly influences uncertainty estimates in the inversion.

03

The method achieves high-dimensional inversions in about a second on a standard desktop.

Abstract

Travel time tomography for the velocity structure of a medium is a highly non-linear and non-unique inverse problem. Monte Carlo methods are becoming increasingly common choices to provide probabilistic solutions to tomographic problems but those methods are computationally expensive. Neural networks can often be used to solve highly non-linear problems at a much lower computational cost when multiple inversions are needed from similar data types. We present the first method to perform fully non-linear, rapid and probabilistic Bayesian inversion of travel time data for 2D velocity maps using a mixture density network. We compare multiple methods to estimate probability density functions that represent the tomographic solution, using different sets of prior information and different training methodologies. We demonstrate the importance of prior information in such high dimensional…

Tables2

Table 1. Table 1: Network configurations of the networks with 4 fully connected (FC) layers. Each row in the table represent a separate networks trained. Eight networks were trained for the 8 x 8 models and four networks for the 16 x 16 models.

Size of model	FC 1	FC 2	FC 3	FC 4	Total No. of Parameters
	270	1000	380	600	1,544,765
	100	500	450	550	1,622,685
	800	325	100	300	1,165,660
8 x 8	200	400	200	50	334,335
	300	250	200	50	331,685
	900	700	70	550	2,077,505
	200	250	200	50	274,185
	300	400	200	50	406,835
	375	500	300	600	5,265,470
16 x 16	300	250	200	50	625,445
	200	400	200	50	628,095
	800	1000	500	550	6,076,995

Table 2. Table 2: Network configurations of the convolutional networks with three convolutional (Conv) layers and 4 fully connected (FC) layers. Each row in the table represent a separate networks trained.

Conv 1		Conv 2		Conv 3		FC 1	FC 2	FC 3	FC 4	Total No. of Parameters
Filter	Kernel	Filter	Kernel	Filter	Kernel					8x8	16x16
128	5	128	5	64	1	800	150	600	1500	4,717,405	13,363,165
32	9	32	5	16	1	500	300	600	1500	4,354,183	12,999,943
32	9	32	5	16	1	500	200	2000	1250	5,641,438	12,847,243
32	9	8	5	16	1	500	300	600	1750	4,986,575	15,054,335
32	9	32	5	16	1	500	1500	50	1250	3,528,333	10,734,093

Equations22

p (m ∣ d) = k p (d ∣ m) p (m)

p (m ∣ d) = k p (d ∣ m) p (m)

p (m^{i} ∣ d) = \int_{m_{j} \neq = m_{i}} p (m ∣ d) d m_{j}

p (m^{i} ∣ d) = \int_{m_{j} \neq = m_{i}} p (m ∣ d) d m_{j}

p (m ∣ d) ≃ i = 1 \sum M α_{i} (d) Θ_{i} (m ∣ d)

p (m ∣ d) ≃ i = 1 \sum M α_{i} (d) Θ_{i} (m ∣ d)

Θ_{i} (m ∣ d) = \frac{1}{\prod _{k = 1}^{c} ( 2 π σ _{ik} ( d ))} e x p {- \frac{1}{2} k = 1 \sum c \frac{( m _{i} - μ _{ik} ( d ) ) ^{2}}{σ _{ik}^{2} ( d )}}

Θ_{i} (m ∣ d) = \frac{1}{\prod _{k = 1}^{c} ( 2 π σ _{ik} ( d ))} e x p {- \frac{1}{2} k = 1 \sum c \frac{( m _{i} - μ _{ik} ( d ) ) ^{2}}{σ _{ik}^{2} ( d )}}

p (m ∣ d) ≃ i = 1 \sum M j = 1 \sum c \frac{E _{i} α _{ij}}{\sum _{k = 1}^{M} E _{k}} (d) Θ_{ij} (m ∣ d)

p (m ∣ d) ≃ i = 1 \sum M j = 1 \sum c \frac{E _{i} α _{ij}}{\sum _{k = 1}^{M} E _{k}} (d) Θ_{ij} (m ∣ d)

D_{KL} (p (m ∣ d), p (m)) = \int_{- \infty}^{\infty} p (m ∣ d) l n (\frac{p ( m ∣ d )}{p ( m )}) d x

D_{KL} (p (m ∣ d), p (m)) = \int_{- \infty}^{\infty} p (m ∣ d) l n (\frac{p ( m ∣ d )}{p ( m )}) d x

l (x, y) = \frac{2 μ _{x} μ _{y} + C _{1}}{μ _{x}^{2} + μ _{y}^{2} + C _{1}}

l (x, y) = \frac{2 μ _{x} μ _{y} + C _{1}}{μ _{x}^{2} + μ _{y}^{2} + C _{1}}

c (x, y) = \frac{2 σ _{x} σ _{y} + C _{2}}{σ _{x}^{2} + σ _{y}^{2} + C _{2}}

c (x, y) = \frac{2 σ _{x} σ _{y} + C _{2}}{σ _{x}^{2} + σ _{y}^{2} + C _{2}}

s (x, y) = \frac{σ _{x y} + C _{3}}{σ _{x} σ _{y} + C _{3}}

s (x, y) = \frac{σ _{x y} + C _{3}}{σ _{x} σ _{y} + C _{3}}

S S I M (x, y) = [l (x, y)^{α} \cdot c (x, y)^{β} \cdot s (x, y)^{γ}]

S S I M (x, y) = [l (x, y)^{α} \cdot c (x, y)^{β} \cdot s (x, y)^{γ}]

S S I M (x, y) = \frac{( 2 μ _{x} μ _{y} + C _{1} ) ( 2 σ _{x y} + C _{2} )}{( μ _{x}^{2} + μ _{y}^{2} + C _{1} ) ( σ _{x}^{2} + σ _{y}^{2} + C _{2} )}

S S I M (x, y) = \frac{( 2 μ _{x} μ _{y} + C _{1} ) ( 2 σ _{x y} + C _{2} )}{( μ _{x}^{2} + μ _{y}^{2} + C _{1} ) ( σ _{x}^{2} + σ _{y}^{2} + C _{2} )}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Probabilistic Neural-Network Based 2D Travel Time Tomography

Stephanie Earp

Department of Geosciences

University of Edinburgh

[email protected]

\AndAndrew Curtis

Department of Geosciences

University of Edinburgh

and

Institute of Geophysics

ETH Zurich

[email protected]

Abstract

Travel time tomography for the velocity structure of a medium is a highly non-linear and non-unique inverse problem. Monte Carlo methods are becoming increasingly common choices to provide probabilistic solutions to tomographic problems but those methods are computationally expensive. Neural networks can often be used to solve highly non-linear problems at a much lower computational cost when multiple inversions are needed from similar data types. We present the first method to perform fully non-linear, rapid and probabilistic Bayesian inversion of travel time data for 2D velocity maps using a mixture density network. We compare multiple methods to estimate probability density functions that represent the tomographic solution, using different sets of prior information and different training methodologies. We demonstrate the importance of prior information in such high dimensional inverse problems due to the curse of dimensionality: unrealistically informative prior probability distributions may result in better estimates of the mean velocity structure, however the uncertainties represented in the posterior probability density functions then contain less information than is obtained when using a less informative prior. This is illustrated by the emergence of uncertainty loops in posterior standard deviation maps when inverting travel time data using a less informative prior, which are not observed when using networks trained on prior information that includes (unrealistic) a priori smoothness constraints in the velocity models. We show that after an expensive program of training the networks, repeated high-dimensional, probabilistic tomography is possible on timescales of the order of a second on a standard desktop computer.

1 Introduction

Seismic travel time tomography is often used to reconstruct images of the interior of the Earth (Aki et al., 1977; Dziewonski and Woodhouse, 1987; Mordret et al., 2013), but is a significantly non-linear and non-unique inverse problem. To find solutions with minimal computation, the physics relating local wave speed to measured travel times is usually simplified by linearization (Rawlinson et al., 2010), but this creates large differences between linearized and true probabilistic solutions (Galetti et al., 2015). Increases in compute power now allow fully nonlinear Monte Carlo sampling solutions to be found without linearisation, to solve problems in 2D (Bodin and Sambridge, 2009; Galetti et al., 2015) and 3D (Hawkins and Sambridge, 2015; Piana Agostinetti et al., 2015; Zhang et al., 2018, 2019). Using Bayesian methods, such solutions provide samples (example tomographic models) that fit the data to within their measurement uncertainties, are consistent with available prior information, and are distributed according to the posterior probability density function (pdf) across the parameter space; this pdf constitutes the full solution of tomographic problems. Nevertheless, such solutions are acquired at significant expense, typically requiring weeks of compute time for realistic data sets and expensive storage of large sample sets.

An alternative approach to estimate the posterior pdf is to use prior sampling (Devilee et al., 1999; Käufl et al., 2016). In this case samples are created before inference using only available prior knowledge. The set of samples can then be interrogated for examples that are consistent with any particular data set (a method called resampling (Sambridge, 1999)) or used to parametrise a function that relates data to models which can then be used to solve the inverse or inference problem (Roth and Tarantola, 1994).

In this work we use a neural network-based method to perform the inversion. Neural networks (NNs) can approximate any nonlinear relationship between two parameter spaces, given a so-called training set of example pairs of dependent and independent parameter values under that relationship (Bishop, 1995). In travel time tomography the forward solution is known and calculable, but the inverse solution is highly non-linear and non-unique. In such cases the forward computation can be used to create the prior set of samples known as a training set, of random models drawn from the prior pdf; these can be used to train the neural networks to approximate the inverse mapping. The prior samples are only needed during the training process which needs only to be performed once - thereafter NNs can be evaluated relatively efficiently. This allows the inference step to be run rapidly for any new data set on standard desktop computers, and the overall cost of the method per tomographic problem decreases rapidly with the number of problems to be solved.

Neural network-based inversion methods have been applied to various tomography problems in the past. Roth and Tarantola (1994) first used NNs to estimate subsurface velocity structure from active source seismic waveforms, Moya and Irikura (2010) performed velocity inversion with a neural network using waveform data from earthquakes, and Araya-Polo et al. (2018) used semblance gathers as input to a network to invert for velocity structure. Gupta et al. (2018) used a convolutional network to learn an ensemble of simpler mappings in a low-dimensional space before reconstructing the image by combining the mappings. Dictionary learning methods (Mairal et al., 2014) create sparse representations of the data and can be used to create a set of representations of features. Bianco and Gerstoft (2018) performed linearized 2D surface wave travel time tomography using dictionary learning to regularise the inversion.

The methods mentioned above and in Kong et al. (2018) all provide only deterministic solutions to the inversion. Since the solution to tomographic problems is always non-unique, in order to assess the worth of any model estimate we require that neural networks produce full probabilistic information about the set of models in the inverse problem solution (the posterior pdf). Devilee et al. (1999) solved the first probabilistic geophysical inverse problem using NNs. They proposed a variety of methods to train NNs to provide discretised Bayesian posterior pdfs. Mixture density networks (MDNs) are a class of augmented neural networks that output a probability distribution that is defined as a sum of analytic pdf kernels such as Gaussians (Bishop, 1995). MDNs can be trained such that for any input data this distribution approximates the posterior pdf. These methods have been used at a global scale to invert surface wave velocities for global crustal thicknesses and seismic velocities (Meier et al., 2007a, b) and for water content in the mantle transition zone (Meier et al., 2009), at a reservoir scale to infer petrophysical parameters from velocities (Shahraeeni and Curtis, 2011; Shahraeeni et al., 2012), for earthquake source parameter estimation (Käufl et al., 2014, 2015) and to assess the uncertainty in model parameters of the Earth’s global average (1-dimensional) radial velocity structure from P-wave travel time curves (De Wit et al., 2013). They have also been used in conjunction with Markov random fields and other statistical and graphical models to solve geophysical inverse problems with spatially sophisticated prior information (Nawaz and Curtis, 2017, 2018, 2019). They have been used in conjunction with seismic gradiometry to perform near-real time 3D surface wave tomography (Cao et al., 2020). These studies demonstrated that the pdf obtained from an MDN is comparable to a Monte Carlo sampling solution but is obtained at much lower computational cost in the cases where similar inverse problems must be solved repeatedly with different data sets, and that at the moment of application MDNs provide probabilistic solutions almost instantaneously.

We show for the first time that MDNs can perform fully non-linear, rapid and probabilistic 2D tomography from travel time data. We compare different methods for creating the prior training set and performing the neural network inversion. The networks create approximate mean velocity models and estimates of the full marginal posterior pdf’s, virtually instantaneously. Thus, in return for accepting approximate posterior pdfs we obtain a significant computational saving compared to Monte Carlo methods.

2 METHOD

2.1 Bayesian Inference

We wish to solve tomographic inverse problems in a probabilistic framework to find the posterior distribution of velocity models m that fit some given data $\mathbf{d}$ , written as $p(\textbf{m}\mid\textbf{d})$ . This is defined as (Tarantola, 2005):

[TABLE]

where $p(\textbf{m})$ represents the prior probability density on the model space, $p(\textbf{d}\mid\textbf{m})$ represents the conditional probability of some data given the model (known as the likelihood) and $k$ is a normalisation constant. In multidimensional problems, where the dimensionality of m is greater than 1, we often need to make inferences about a single parameter with index $i$ and hence must calculate the marginal posterior distribution $p(m^{i}\mid\textbf{d})$ . This can be obtained by integrating over all parameters $j$ that are not of interest:

[TABLE]

In this study we focus on estimating marginal distributions $p(m^{i}\mid\textbf{d})$ , and posterior trade-offs between pairs of individual parameters.

2.2 Mixture Density Networks

Neural networks are essentially mathematical mappings that emulate the relationship between two parameter spaces. Given a set of $N$ data-model pairs $\{(\mathbf{d}_{i},\mathbf{m}_{i}):i=1,...,N\}$ , where $\mathbf{m}_{i}$ is the model used to generate the data $\mathbf{d}_{i}$ under some forward relation, NNs can be trained to model an arbitrary non-linear inverse function from $\mathbf{d}$ to some properties of the set of models $\mathbf{m}$ . In this paper we use a class of neural networks called mixture density networks, that can be trained to output the probability of any model $\mathbf{m}$ given some fixed (measured) data $\mathbf{d}$ , written as $p(\textbf{m}\mid\textbf{d})$ . The probability distribution is approximated using a sum (called a mixture) of Gaussians:

[TABLE]

where $\alpha_{i}$ is called the mixture parameter that attaches relative importance to each Gaussian kernel, $M$ is the number of Gaussians in the mixture, and $\Theta_{i}$ are here defined to be Gaussian kernels with a diagonal covariance matrix given by

[TABLE]

where $c$ is the dimensionality of m, $\mu_{ik}$ is the $k$ th element of the $i$ th kernel in the mixture, $\sigma_{ik}$ is the standard deviation of the $k$ th diagonal element of the $i$ th kernel in the mixture, and both $\mu_{ik}$ and $\sigma_{ik}$ are outputs of a trained NN. The network is trained by minimising the negative log likelihood of the pdf in Equation 4, equivalent to maximizing the likelihood of the pdf (Bishop, 1995). For a more comprehensive general introduction to MDNs we refer the reader to Bishop (1995), or to Meier et al. (2007a) and Shahraeeni and Curtis (2011) for detailed descriptions with applications in geophysics.

Network training is performed using gradient-based optimization of the network’s internal parameters. The particular trained NN obtained is therefore sensitive to the random parameter initialization and to the network configuration (internal structure). We train an ensemble of multiple networks with different configurations and combine them to give a group of networks - a so-called mixture of experts. In theory networks trained independently may make good predictions for different reasons and under different inputs (in our case, data vectors); using a combination of networks therefore often results in better generalisation of performance to unseen data and improves prediction accuracy (Dietterich, 2000). We construct the ensemble by a weighted average of network outputs, where each weight is determined by the performance of the associated network on the test data set (or simply test set). The posterior probability distribution is thus estimated by

[TABLE]

where $E_{i}$ is the negative exponential of the error on the test dataset of the $i$ th kernel. The final estimate of probability distribution $p(\textbf{m}\mid\textbf{d})$ contains $cM$ Gaussian kernels.

2.3 Model Parametrisation and Traveltime Data

We define the geometry of our tomography problem to be that shown in Figure 1. We fix the locations of 18 wave energy sources and receivers (shown in Figure 2), and parametrise the wave speed or velocity across the Model Volume within which the forward relationship predicts travel times of the first arriving energy between any source-receiver pair. Travel times $\textbf{d}_{i}$ between all possible source-receiver pairs are calculated using an eikonal raytracer (Rawlinson and Sambridge, 2004, 2005). The traveltimes from the 4 velocity models shown in Figure 2 are shown in Figure 3. Such travel times are used herein to image the velocity structure within the smaller Image Volume - wave speeds outside of that area are disregarded and thus constitute nuisance parameters. We use a larger volume to calculate the forward relationship to avoid raypaths travelling along the boundary of the model and causing misleading travel times.

We construct four separate training sets, each of 2.5 million discretised models where each model represents a 2D heterogeneous velocity structure. Two of these training sets are created on an 8 x 8 coarser grid of cells and two are created on a 16 x 16 finer grid of cells within the Image Volume (and the same resolution extends throughout the Model Volume). Each of the four datasets is created by selecting a random wave speed in each cell independently from the Uniform prior distribution $U(0.5km/s,2.5km/s)$ . All models in one finer data set and one coarser data set are then smoothed using a 2D averaging filter window which was square of size 5x5 cells for the finer model and 3x3 cells for the coarser model. Thereafter the velocities are normalised to the same absolute range as the original random models for ease of comparison of results. Then, the travel times between all source-receiver pairs are calculated for all models, in all four training sets (examples are given in Figure 3).

With this method we create training sets with two different amounts and types of prior information. The two sets of random unsmoothed velocity models have relatively weak prior information with no correlations between neighbouring cells. This has the advantage that any type of velocity contrast between neighbouring cells would be consistent with the prior pdf and hence can in principle be imaged using the associate trained network given sufficiently informative data (see below). This is demonstrated by the uniform distribution of the histogram in Figure 2c which shows the probability of the velocity of the adjacent cell given that the velocity of the central cell is 1.5km/s. On the other hand this implies that the prior pdf is Uniform over a 64- and 256- dimensional space for the coarser and finer training sets respectively; these spaces are therefore extremely sparsely sampled by the 2,500,000 training set models due to the curse of dimensionality (Curtis and Lomax, 2001). This implies that over most of these two spaces the prior pdf is entirely unrepresented by ‘proximal’ samples.

The two sets of smoothed velocity models embody stronger prior information as the speeds in neighbouring cells are correlated. This is demonstrated in Figure 2f where the distribution of possible velocities in adjacent cells given that the velocity of the central cell is 1.5km/s is approximately Gaussian. This means that models with larger velocity contrasts between neighbouring cells are not represented in the training data set and hence will be precluded from inversion results. This may or may not be advantageous depending on the true prior information about the form of the structure being imaged. However, it has the advantage that the effective space (manifold) of models consistent with the prior information is considerably smaller than that for the smoothed models, so that the finite-sized training set may better represent the form of the prior pdf.

3 RESULTS

3.1 Network Configurations

We train separate MDNs to predict the marginal probability distribution $p(m^{i}\mid\textbf{d})$ of velocity $m^{i}$ in cell $i$ in each of the two sizes of models. For the finer datasets we train 4 MDNs and for the coarser datasets we train 8 MDNs at each location $i$ . We use different configurations as well as randomly initialised internal network parameters (commonly referred to as weights and biases) for each network because diversity in the ensemble generally leads to better predictions (Dietterich, 2000). Appendix A outlines the different network configurations. For each network we use a Gaussian mixture consisting of 15 kernels. The precise number of kernels is not important as long as it is larger than the number required to represent the marginal posterior pdf in each model cell. The network can either reduce the amplitude of the mixture parameter $\alpha_{ij}$ to close to zero to remove unnecessary kernels, or can combine unnecessary kernels by giving them a similar $\mu$ and $\sigma$ to other kernels (Bishop, 1995). In practice we found the maximum number of kernels with significant weight used in any mixture was 8.

We also train networks to invert for the full model (velocities in all cells at once) using a single network. In this case we use a convolutional network with 3 convolutional layers followed by 3 fully connected layers and 15 kernels for the Gaussian mixture. We train 10 networks with 5 different network configurations (each configuration is trained twice with random weight initialisation). Layer sizes were selected using the python library hyperopt (Bergstra et al., 2015) and Appendix A gives further description of the networks used. The same network configurations were trained on all four training sets.

For every training run for each network configuration we use 85% of the training dataset to train the network, 10% of the dataset as a validation set during training, and 5% as a test set to evaluate the final network once training has finished. The training set is used in the optimisation of network parameters. The parameters are updated iteratively so that the output of the network best represents the training set sample distribution. To avoid over-fitting the network to the data the cost function is also periodically evaluated over the validation set; when the error on the validation set stops decreasing we end the training optimisation. Once all of the networks have been trained we evaluate the final network performance using the test set and sum the networks across the ensemble using equation 5.

3.2 Result Evaluation

We tested our trained networks by applying them to synthetic data sets calculated for velocity models created specifically to test the performance of each type of network. The quality of the mean of the inverted probability distributions of 2D velocity models (comprising 1D marginal posterior pdfs in each model cell in the cases where networks were trained for each cell individually) are compared against the true velocity model using the structural similarity index metric (SSIM). This metric is based on 3 relatively independent comparison measurements: luminance, contrast and structure (Appendix B). SSIM can assume values between -1 and 1: a value of 1 indicates the images are identical, 0 indicates no structural similarity and negative values occur when local structure is inverted. SSIM differs from other quality indicators such as mean squared error (MSE) in that it measures the quality of an image in structure and pixel value compared to a ground truth, rather than the absolute squared errors (which often do not mean much to someone who is trying to interpret the resulting images).

We compare the information gain between the prior $p(\textbf{m})$ and the posterior $p(\textbf{m}|\mathbf{d})$ distribution using the Kullback-Leibler (KL) divergence

[TABLE]

where a higher $D_{\mathrm{KL}}$ indicates that the posterior pdf has gained information over the prior and $D_{\mathrm{KL}}=0$ occurs when the two distributions are the same. This can be used as an indication of the effectiveness of the network: if $D_{\mathrm{KL}}$ is close to [math] then the network has been able to learn little, if anything at all, from the data.

3.3 Prior

To show the effect of the prior on our models we inverted synthetic data for the three velocity models shown in Figures 4a and 5a using networks trained with weak prior information (unsmoothed training models) in Figure 4 and those trained with stronger prior information (smoothed models) in Figure 5. The test models were defined on a grid finer than our training sets on a 32x32 grid, which is finer than either of our training sets; this ensures that we evaluate the networks using models that are outside the range of those used for training. For all test models it is clear that with stronger prior information the networks better resolve the velocity structure, shown generally by the much higher SSIM values in Figure 5b and 5c compared with the corresponding values in Figure 4b and 4c. This is true even though the test models contradict the stronger prior information: they all contain structures that are not smooth.

The velocity model in the left-hand column has a background velocity (cells surrounding the central anomaly) equal to the mean of the prior pdf and a circular low velocity, and is estimated well in both inversions using weaker prior information training sets (Figure 4). However, even a small increase in complexity in velocity models gives poor inversion results as shown by the central column of velocity models. For these, all the velocities are increased compared to the left column, and in particular the background velocity is increased away from the mean of the prior. In this case the networks with weaker prior information are unable to recover much, if any, of the true structure. If stronger prior information is included in the training set the networks accurately predict a larger variety of velocity models. The true structures of the two circular models in Figure 5 are closely reproduced in the inversion. Sharp contrasts in velocity in the true model are translated to more gradual changes in velocity in the estimates (for both grid sizes) due to the smoothness in the prior pdf. Despite this, the SSIM values show that results are very well correlated with the true model. For the more geologically reasonable model in the right column of Figure 5 which includes a structure that might be generated by a fault, networks trained using stronger prior information on both grid sizes produce models that are nearly identical to the true model. Even though the true model contains a sharp contrast boundary, the inverted models still contain a (slightly smoother) version and the overall structure of the true image is maintained.

The effect of stronger prior information is shown in the posterior pdfs in Figure 6. We display the posterior marginal pdfs at three locations indicated in the upper right hand model in Figure 4a: a location in the high velocity zone (triangle), the low velocity zone (circle), and at the edge of the sharp contrast where the inversion struggles to image correctly (star). The KL divergence values are shown above the corresponding posterior marginal pdf. The most striking feature is the much higher KL values for the networks trained with the stronger prior information (rows b and d) indicating a larger information gain in the posterior pdf compared to the prior pdf than is obtained when training with Uniformly random models. In fact, the low KL values for the latter cases imply that nearly no information was gained from the data, and even though a rough approximation of the mean can be found the uncertainties on those values remain large.

3.4 Model Resolution

Our networks are trained on two sizes of grid cell, a coarser 8 x 8 grid and a finer 16 x 16 grid. Figures 4 and 5 show the results for varying grid size. Training on the finer grid induces a factor of 4 more parameters to estimate from the same data. This means that a larger training set size would be needed to sample the increase in image dimensionality. It would be impossible to sample densely the 256-dimensional space spanned by a 16 x 16 grid, but as our examples show, the networks are still able to invert for some basic structural information (Figure 4c). When we train our networks with a stronger prior pdf we reduce the effective dimensionality of our problem by introducing a relationship between neighbouring pixels: essentially all prior models and hence most posterior models lie on a significantly lower dimensional manifold that is embedded with the 64- or 256- dimensional spaces. In that case we can obtain reasonable estimates of the true velocity models regardless of grid size (Figure 5c).

3.5 Type of network

For each of the four training sets we trained networks in two different ways. First we trained separate networks to estimate marginal pdf’s in each cell so that each network has fewer parameters ( $\alpha_{ij}$ , $\mu_{ij}$ , $\sigma_{ij}$ ) to estimate. Note that this does not reduce the dimensionality of the overall problem as each velocity cell in the model contributes to the travel time values, and the velocity in any cell depends on the cells surrounding it even if we do not directly invert for them within the same network. It is important to remember that in this case we do not obtain explicit information about trade-offs between neighbouring cells. Those trade-offs are already integrated into the marginal pdf’s in equation 2.

We also trained networks to invert for slowness in every cell of the model at once. This increases the number of parameters that the network must estimate but as a result the trade-off between velocity values in adjacent cells can be explored. Examples of the joint marginal pdfs from the central model in Figure 4a are shown in Figure 7: the 2D pdfs show few signs of non-linearity, and virtually no indication of the trade-offs that one would expect between velocities in neighbouring cells. This indicates that the results of these networks are unlikely to provide reliable uncertainties.

For models on a coarser grid (Figures 4 and 5 rows b and d), networks perform similarly when using the single cell networks or the full model networks. For models trained on a finer grid, the full model networks perform significantly better than the single cell network as shown in Figure 4. This is almost certainly because the dimensionality of the problem when training single-cell networks is too large, but by giving the network information about the velocities in neighbouring cells it can better resolve the velocities. This difference is less noticeable when using stronger prior information (Figure 5b and 5d).

3.6 Uncertainty Loops

A key problem in the field of nonlinear inversion is that there are no standard solutions to which estimated posterior pdf’s can be compared in order to verify their quality. In almost all papers that use synthetic tests to assess competing methodologies in high-dimensional problems, the main criterion applied is whether the mean or maximum-likelihood model fits the real (true synthetic) model that was used to generate the synthetic data. This provides no test at all on the rest of the pdf and indeed there is no reason why the mean should match the true model in unknown problems - the mean may even be a zero-probability solution (one precluded by the data) (Tarantola, 2005). The maximum likelihood (or maximum posterior probability) model is an alternative, but usually an extremely volatile statistic of pdf solutions since those solutions are necessarily formed by focusing across the whole pdf rather than simply on its modes. We therefore require some independent property of posterior pdfs, the existence of which we can use to assess their veracity.

Loops or halos of high uncertainty have been shown to exist in solutions to all travel time tomography problems around anomalies with a spatially sharp and strong contrast in velocity compared to their surroundings (Galetti et al., 2015). Uncertainty loops exist due to non-linear aspects of wave physics and represent uncertainty in the shape of such anomalies. They are observed most clearly in fully non-linearized tomographic inversion problems in which rays, velocities and travel times are all varied in concert for each sample considered. We can therefore use the existence of loops in posterior uncertainty as a criterion to check their quality in models with strong and spatially sharp contrasts.

Figure 8 shows the standard deviations (bottom row) for the results of networks trained on an 8 x 8 grid. Only the networks trained using the training set of Uniformly random velocities (Figure 8f and h) exhibit signs of an uncertainty loop. We include the mean (middle row) for comparison of the shape of the velocity anomaly to the loop that surrounds it. The difference between the two priors is clear when comparing Figures 8f, 8g and 8h: for a smoothed prior (Figure 8g) the maximum uncertainty is predicted to be in the centre of the anomaly as opposed to the other two images where the uncertainty is lowest at the centre of the anomaly and highest on the margins as expected. However, when inverting for the full-model in a single network (Figure 8h) the loop is not as well defined as in Figure 8f. Together with the lack of clear trade-off relations in Figure 7 this is evidence that the full-model inversions are less robust than single-cell inversions: as the networks invert for many more parameters at once, they appear not to have been trained so as to fully represent the correct physics of the tomography problem.

The separate-cell networks (one network trained for each cell in the velocity model) allow us to estimate the full marginal posterior probabilities for all cells in the model, and these posterior distributions show how the network represents uncertainty. We show the pdfs for 3 points in the model: inside the velocity anomaly (star), at the edge of the anomaly (triangle), and in the background velocity (circle), where the locations are shown in Figure 8a. We can see for the 8 x 8 model using the Uniformly random training set (Figure 9a and c) the posterior pdf at the edge of the anomaly has a larger uncertainty indicating that the range of possible velocities spans the velocity of the anomaly and that of the background velocity. This is expected at the edge of an anomaly, the boundaries of which are uncertain: the cells could either be inside or outside of the anomaly, and could therefore assume values of the anomaly (low velocity) or the background model (high velocity). This is the maximum range of velocities expected across the model, hence the largest uncertainties should be around anomaly edges (Galetti et al., 2015).

We do not see uncertainty loops in any model trained on the smoothed models. This makes sense because by imposing prior information that the model is relatively smooth we have removed the possibility to include the effect of spatially sharp contrasts between anomalies and the background velocity model, precluding the types of physical trade-offs that create uncertainty loops. This is represented in the pdfs (Figure 9b and d) where the uncertainty is much smaller than in (a) and (e) and where there is no noticeable increase in uncertainties at the boundary of the anomaly. Note that there is again a larger information gain for the results from the smooth training set as shown by the KL divergence values.

3.7 Realistic Velocity Models

Figures 10 and 11 show the results when applying the trained networks to other types of structures that might be encountered in geophysical or non-destructive testing applications. Figure 10 shows results using Uniformly random training set, whereas Figure 11 shows the equivalent results obtained using the smoothed training set. The models inverted on a coarser grid produce reasonable estimates of the velocity models using either prior pdf, however for the smoothed prior all the models, regardless of grid size, are recovered fairly well. Figure 12 shows the uncertainty maps for a coarse grid model trained using both types of prior information and inverted using the separate-cell MDN models. When inverting the models with a Uniformly random prior (Figure 12b) the uncertainty maps show a higher uncertainty at the anomaly interfaces (as expected by analogy with the uncertainty loops above), thus helping to define uncertainty in the model geometry, whilst the results from the smooth prior miss this extra information.

4 DISCUSSION

We compared different methods of mixture density network inversions to estimate tomographic posterior probability density functions. When using datasets with little prior information (Figure 2a and 2b) the networks struggle to estimate more than the simplest of velocity models: due to the curse of dimensionality it is simply not possible to provide a sufficient density of prior samples on which to train the MDN. Including stronger prior information in our examples by training on smoothed velocity models (Figure 2d and 2e) improves inversion results, although the networks are no longer able to image sharp velocity contrasts, nor estimate uncertainty in the shapes and locations of spatially sharp velocity anomalies, as information about such models is not contained in the training set. Our tests indicate that the prior pdf is the most important factor in improving a network performance since it restricts both the training set and inversion results to a more constrained (effectively lower-dimensional) manifold embedded within the high-dimensional parameter space. This manifold is more densely sampled than the full space thus improving network training and performance. All test models inverted using the stronger prior information give higher SSIM and KL divergence values compared to those using weaker priors, regardless of grid size or how many pixels were inverted with each network. Also, the two circular anomalies in Figure 5 are symmetrical and this symmetry is also shown in all of the smooth-prior inversion results which is not seen in the Uniform-prior results in Figure 4. Nevertheless, we show that when imposed prior information is false (if the true model is rough but the prior precludes such models) then uncertainty results will be compromised as in Figure 8g and 8i. It should be noted that neither of the training sets created in this study are fully representative of the true Earth. In reality actual geophyisical features of the Earth are neither uniformly random or smoothed, but dependant on geological charateristics that can include smooth variations or shar boundaries. the results in Figures 10 and 11 show that different structures not seen in the training set can be recovered using this neural network method. However, a clearly advantageous strategy for the future of neural network tomography is to invest effort in finding and using more sophisticated, and correct prior information (Curtis and Wood, 2004). Recent efforts in this direction include Walker and Curtis (2014a) who use expert elicitation to constrain prior multi-point geostatistics, Mosser et al. (2018) who use neural networks to parametrise geological prior information, and Nawaz and Curtis (2017, 2018, 2019) who use Markovian models and variational methods with embedded neural and mixture density networks to combine geological and geophysical information; these various directions appear to be strategically important for the future of this field.

We illustrate the differences in the KL divergence values in Figure 13. The top graph shows histograms of KL values obtained when networks are applied to all synthetic test data for the four different prior and network training types for the 8 x 8 grid model, and the bottom graph is similar but for 16 x 16 models. Both plots confirm that training with a stronger prior increases the information gain in the posterior as was indicated in Figure 6. Notice that this is not necessarily an intuitively obvious result: if prior information is weaker or less informative, we might expect the data to add relatively more information, compared to the case where prior information is stronger. We therefore suspect that this result indicates that we simply can not train the MDN’s in the case of weaker prior information and sparser training examples; even though by adding stronger prior information we should decrease the relative value of the data, this effect is out-weighed by the fact that we can better train the network and thereby extract more information from data.

The effect of increasing the number of cells in the model is also clearly highlighted: Figure 13a has higher KL values than Figure 13b. Interestingly, both plots show that training using a full-model inversion slightly increases the KL divergence, implying that the networks are making use of the relationship between adjacent pixels to better constrain the posterior pdfs.

4.1 Inference limits

When creating the training dataset we set hard bounds on the grid cell velocities, thus limiting the range of velocity models that should be found using the trained networks. Figure 14 shows the inversion of a model at the limits of all training datasets. The middle row shows results when using the Uniformly random training dataset: none of the inversions give reliable results. Although the network trained to invert the full model at once performs slightly better, all networks produce extremely poor results. This is expected as the velocity model has a background velocity at the lower limit of the training sets, $0.5km/s$ and an anomaly at the upper limit, $2.5km/s$ . This is an extreme example that is not likely to have proximal samples in the training set, therefore the results are expected to be poor.

The same model lies outwith the dataset with a stronger prior as well, but networks appear to recognise that there is a velocity anomaly. However, since the prior dataset used is smoothed, strong contrasts are precluded and none of the networks give accurate velocity information, despite being able to represent the geometry of the structure.

4.2 Inversion Speed

As this is a prior sampling method the training dataset must be created in advance. It took $t_{prior}=$ 11 hours, to create the training dataset of 2.5 million samples using 5 CPUs on a Dell PowerEdge R820. However, this only needs to be done once; even if more prior information becomes available we may be able update our prior using the prior replacement method of Walker and Curtis (2014b) or the resampling method of Sambridge (1999) rather than calculating entirely new training examples.

In this work each network took between 1-2 hours to train (converge) using 16GB of RAM over 2 NVIDIA TITAN X GPUs. For the 8 x 8 grid models with an ensemble of 8 networks when training the network for each grid cell separately, we required $8\times 8\times 8=512$ networks in total and for the 16 x 16 models with an ensemble of 4 networks we required $16\times 16\times 4=2048$ networks. However, the training of each network is independent of others so the process can easily be parallelised and using 50 cores a full training run for the larger 16 x 16 grid model takes $t_{train}=$ 80 hours of real clock time. For the full-model networks only one network is trained for all cells so the total training time is much lower: each network takes around 3 hours to train using 48GB of RAM oer 4 NVIDIA TITAN X GPUs so training 10 networks only takes 30 hours without running them in parallel. This process could be reduced to 3 hours by using only 10 cores and reduced further by training each network across cores. The advantage of an MDN is the speed of inversion after training: once a network is trained new inversions take a fraction of a second, even on a standard desktop computer. Computational efficiency is therefore gained only when the trained networks will be applied to many different data sets.

Monte Carlo methods are known to be computationally expensive (Bodin and Sambridge, 2009) and a fully non-linear Markov chain Monte Carlo (McMC) tomographic inversion can take weeks or months of compute time. Monte Carlo methods use posterior sampling so for every new inversion a new sample set must be performed. This is often a far less demanding sampling task than sampling with similar density of samples from the prior since high probability parts of the posterior pdf usually span a significantly smaller volume of parameter space. Nevertheless, neural network methods are advantageous over traditional Monte Carlo methods when $n$ repeated inversions of similar data types are to be performed provided that $n>\frac{(t_{prior}+t_{train})}{t_{MC}}$ , as the computationally expensive sampling step only needs to be performed once and the network-based inversion becomes faster. In a tomographic setting this could be useful for monitoring purposes, where data collected periodically from the same set of sources and receivers can be inverted with the same network(s) each time new data arrives. However it should be noted that despite the longer computation time Monte Carlo methods can be used to produce higher resolution 2D or 3D models (Galetti et al., 2017; Zhang et al., 2019). Mixture density networks have also been shown to give conservative uncertainty estimates compared to Monte Carlo methods (Käufl et al., 2016).

4.3 Training Flexibility

In this work we train networks assuming that the data (travel times) are recorded with exactly the same data acquisition geometry as was used for training. It would also be possible to train more flexible networks that account for missing data. For example, one could augment the training set with additional samples constructed from the same data-model pairs ${(\mathbf{d}_{i},\mathbf{m}_{i}):i=1,...,N}$ but with a certain number of travel time values in the dataset randomly set to [math], to indicate a missing value (De Wit et al., 2013). Then new datasets with a missing values (for example due to a noisy stations causing errors in travel times) can be inverted by the same network.

Data from new receivers added after training the network cannot be used. However we can create a new training set containing only the data from the added receiver station and fine tune the original network by using the original network parameter values as a starting point for training optimisation. This has the advantage that the training process will be much faster.

5 CONCLUSION

We present neural network-based, non-linear inversion methods applied to a 2D travel time tomography problem to estimate posterior probability density functions. The flexibility of mixture density networks mean that we can provide uncertainty estimates for 2D velocity maps. We show that the prior information used to create the training dataset is the most important factor in providing accurate velocity estimates and uncertainties as such information effectively reduces the dimensionality of the tomography problem. However, as with all Bayesian inversions if we impose false prior information we can lose important information about uncertainties. By training networks to invert for a full tomographic model at once, we can also understand the relationship between velocities in neighbouring pixels; however the number of parameters in the inversion increases substantially, and training for accurate models proves to be significantly more difficult. We compare the speed of neural network inversion to more standard Monte Carlo methods and determine that for many repeated inversions such as occur in monitoring situations, MDNs may out-perform Monte Carlo methods in terms of computational cost.

Acknowledgements

The authors thank the Edinburgh Interferometry Project sponsors (Equinor, Schlumberger Cambridge Research and Total) for supporting this research.

Appendix 1: Network configurations

The networks trained on individual cells used 4 fully connected layers (FC), where each node receives an input from every node in the previous layer. In between each node of the fully connected (FC) layers a rectified linear unit (ReLU) is used. The individual layer sizes and the total number of parameters to be trained in each networks is outlined in Table 1.

Networks trained on the whole model (all cells at once) used a convolutional network with 3 convolutional layers (Conv) and 4 fully connected layers. The sizes of each layer and the total number of parameters to be trained in each networks is outlined in Table 2.

Appendix 2: Structural Similarity Index Measure (SSIM)

We use the form of the SSIM metric described in Wang et al. (2004). Let $x$ and $y$ be a window of $N$ x $N$ size. We calculate the luminance $l(x,y)$ , contrast $c(x,y)$ and structure $s(x,y)$ defined as:

[TABLE]

where $\mu$ and $\sigma$ are the mean and variance of the windows $x$ or $y$ and $\sigma_{xy}$ is the covariance of $x$ and $y$ . To avoid instability in the division, constants $C_{1}$ , $C_{2}$ and $C_{3}$ are defined as $C_{1}=(k_{1}L)^{2}$ and $C_{2}=(k_{2}L)^{2}$ where $L$ is the dynamic range of the cell values while $k_{1}=0.01$ and $k_{2}=0.03$ , and $C_{3}=C_{2}/2$ . The three components are combined to give the full SSIM:

[TABLE]

where $\alpha$ , $\beta$ and $\gamma$ are weighting parameters. Setting $\alpha=\beta=\gamma=1$ we can simplify the expression to:

[TABLE]

We perform the calculation over sliding windows and take the mean of the resulting $SSIM(x,y)$ values. For the 8 x 8 models we use 3x3 windows and the 16 x 16 models use 7x7 windows, so that the windows cover a similar spatial area.

Bibliography45

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Aki et al. (1977) Aki, K., Christoffersson, A., and Husebye, E. S. (1977). Determination of the three-dimensional seismic structure of the lithosphere. Journal of Geophysical Research , 82 (2), 277–296.
2Araya-Polo et al. (2018) Araya-Polo, M., Jennings, J., Adler, A., and Dahlke, T. (2018). Deep-learning tomography. The Leading Edge , 37 (1), 58–66.
3Bergstra et al. (2015) Bergstra, J., Komer, B., Eliasmith, C., Yamins, D., and Cox, D. D. (2015). Hyperopt: a python library for model selection and hyperparameter optimization. Computational Science & Discovery , 8 (1), 014008.
4Bianco and Gerstoft (2018) Bianco, M. J. and Gerstoft, P. (2018). Travel time tomography with adaptive dictionaries. IEEE Transactions on Computational Imaging , 4 (4), 499–511.
5Bishop (1995) Bishop, C. M. (1995). Neural Networks for Pattern Recognition . Oxford University Press.
6Bodin and Sambridge (2009) Bodin, T. and Sambridge, M. (2009). Seismic tomography with the reversible jump algorithm. Geophysical Journal International , 178 (3), 1411–1436.
7Cao et al. (2020) Cao, R., Earp, S., de Ridder, S. A., Curtis, A., and Galetti, E. (2020). Near-real-time near-surface 3d seismic velocity and uncertainty models by wavefield gradiometry and neural network inversion of ambient seismic noise. Geophysics , 85 (1), KS 13–KS 27.
8Curtis and Lomax (2001) Curtis, A. and Lomax, A. (2001). Prior information, sampling distributions, and the curse of dimensionality. Geophysics , 66 (2), 372–378.