Unsupervised Learning of Probabilistic Diffeomorphic Registration for   Images and Surfaces

Adrian V. Dalca; Guha Balakrishnan; John Guttag; Mert R. Sabuncu

arXiv:1903.03545·cs.CV·July 26, 2019

Unsupervised Learning of Probabilistic Diffeomorphic Registration for Images and Surfaces

Adrian V. Dalca, Guha Balakrishnan, John Guttag, Mert R. Sabuncu

PDF

2 Repos

TL;DR

This paper introduces an unsupervised, probabilistic learning framework for diffeomorphic registration of images and surfaces, combining classical and modern CNN-based methods to achieve fast, accurate, and topology-preserving registration with uncertainty estimates.

Contribution

It presents a novel probabilistic generative model and inference algorithm that unites classical registration principles with deep learning, enabling unsupervised, diffeomorphic registration with uncertainty quantification.

Findings

01

Achieves state-of-the-art accuracy in 3D brain registration

02

Provides very fast registration runtimes

03

Guarantees topology preservation through diffeomorphic constraints

Abstract

Classical deformable registration techniques achieve impressive results and offer a rigorous theoretical treatment, but are computationally intensive since they solve an optimization problem for each image pair. Recently, learning-based methods have facilitated fast registration by learning spatial deformation functions. However, these approaches use restricted deformation models, require supervised labels, or do not guarantee a diffeomorphic (topology-preserving) registration. Furthermore, learning-based registration tools have not been derived from a probabilistic framework that can offer uncertainty estimates. In this paper, we build a connection between classical and learning-based methods. We present a probabilistic generative model and derive an unsupervised learning-based inference algorithm that uses insights from classical registration methods and makes use of recent…

Tables3

Table 1. Table 1: Summary of results: mean Dice scores over all anatomical structures and subjects (higher is better), mean runtime; mean Jacobian determinant; and mean number of locations with a non-positive Jacobian determinants of each registration field (lower is better). All methods have comparable Dice scores, while our method and the other VoxelMorph variants are orders of magnitude faster than ANTs or NiftiReg. Only our presented method, VoxelMorph-diff, achieves both high accuracy and fast runtime while also having nearly zero non-negative Jacobian locations. All methods have mean Jacobian determinants close to 1, indicating smooth deformations. Each aspect of these results is studied in more details in the rest of the experiments and figures.

Method	Avg. Dice	GPU sec	CPU sec	mean $\| J_{Φ} \|$	$\| J_{Φ} \| \leq 0$
Affine only	0.584 (0.157)	0	0	1	0
ANTs (SyN)	0.749 (0.136)	-	9059 (2023)	1.001 (0.036)	7523 (4790)
NiftyReg (CC)	0.755 (0.143)	-	2347 (202)	1.072 (0.131)	33838 (8307)
VoxelMorph (CC)	0.753 (0.145)	0.45 (0.01)	57 (1.0)	1.032 (0.074)	19715 (3540)
Supervised-diff	0.730 (0.144)	0.35 (0.03)	82.6 (3.8)	1.088 (0.121)	0.05 (0.5)
`VoxelMorph-diff`	0.754 (0.139)	0.47 (0.01)	84.2 (0.1)	1.075 (0.124)	0.2 (1.0)

Table 2. Table 2: Regularity measures for image and surface models on the test set. Leveraging diffeomorphic aspect of our joint image and surface model, VoxelMorph-surf preserved very low numbers of folding voxels even when training with example surfaces.

Method	$\| J \| \leq 0$	$% of \| J \| \leq 0$
ANTs SyN (CC)	9060 (4445)	0.545 (0.267)
NiftyReg (CC)	40425 (9901)	2.431 (0.595)
VoxelMorph (CC)	19077 (5928)	1.147 (0.360)
VoxelMorph-diff	0.1 (1.2)	6.1e-6 (7.6e-5)
VoxelMorph-surf (cer.w.m.)	3.0 (6.2)	1.8e-4 (3.8e-4)
VoxelMorph-surf (cer.cor.)	3.4 (6.4)	2.0e-4 (3.9e-4)
VoxelMorph-surf (lat.ven.)	4.0 (8.0)	2.3e-4 (4.8e-4)
VoxelMorph-surf (thalamus)	4.3 (8.2)	2.6e-4 (4.9e-4)
VoxelMorph-surf (hip.)	2.7 (5.8)	1.6e-4 (3.5e-4)

Table 3. Table 3: Accuracy and deformation regularity for the different variational approximations and two dramatically different values for smoothing parameter λ 𝜆 \lambda . We find that for a given parameter value, the approximations lead to comparable accuracy and number of folding voxels.

	Dice		$\| J \| \leq 0$
$λ$	diagonal	extension	diagonal	extension
$1$	0.74 (0.01)	0.74 (0.01)	2934 (2007)	2720 (1593)
$20$	0.75 (0.01)	0.75 (0.01)	0.32 (0.96)	0.16 (0.57)

Equations56

\frac{\partial ϕ ^{(t)}}{\partial t} = v (ϕ^{(t)})

\frac{\partial ϕ ^{(t)}}{\partial t} = v (ϕ^{(t)})

p (z) = N (z; 0, Σ_{z}),

p (z) = N (z; 0, Σ_{z}),

p (f ∣ z; m)

p (f ∣ z; m)

ψ min KL [q_{ψ} (z ∣ f; m) ∣∣ p (z ∣ f; m)]

ψ min KL [q_{ψ} (z ∣ f; m) ∣∣ p (z ∣ f; m)]

= ψ min I E_{q} [lo g q_{ψ} (z ∣ f; m) - lo g p (z ∣ f; m)]

= ψ min I E_{q} [lo g q_{ψ} (z ∣ f; m) - lo g p (z, f; m)] + lo g p (f; m)

= ψ min KL [q_{ψ} (z ∣ f; m) ∣∣ p (z)] - I E_{q} [lo g p (f ∣ z; m)] + const,

q_{ψ} (z ∣ f; m) = N (z; μ_{z ∣ m, f}, Σ_{z ∣ m, f}),

q_{ψ} (z ∣ f; m) = N (z; μ_{z ∣ m, f}, Σ_{z ∣ m, f}),

L (ψ; f, m) = - I E_{q} [lo g p (f ∣ z; m)]

L (ψ; f, m) = - I E_{q} [lo g p (f ∣ z; m)]

+ KL [q_{ψ} (z ∣ f; m) ∣∣ p (z)]

= \frac{1}{2 σ ^{2} K} k \sum ∣∣ f - m \circ ϕ_{z_{k}} ∣ ∣^{2}

+ \frac{1}{2} [tr (λ D Σ_{z ∣ x; y} - lo g Σ_{z ∣ x; y}) + μ_{z ∣ m, f}^{T} Λ_{z} μ_{z ∣ m, f}]

+ const,

\hat{z}_{k}

\hat{z}_{k}

p (s_{f} ∣ z; s_{m})

p (s_{f} ∣ z; s_{m})

ψ min KL [q_{ψ} (z ∣ f; m) ∣∣ p (z ∣ f, s_{f}; m, s_{m})]

ψ min KL [q_{ψ} (z ∣ f; m) ∣∣ p (z ∣ f, s_{f}; m, s_{m})]

=

- I E_{q} [lo g p (s_{f} ∣ z; s_{m})],

L (ψ; f, s_{f}, m, s_{m})

L (ψ; f, s_{f}, m, s_{m})

= \frac{1}{2} [tr (λ D Σ_{z ∣ x; y} - lo g Σ_{z ∣ x; y}) + μ_{z ∣ m, f}^{T} Λ_{z} μ_{z ∣ m, f}]

+ \frac{1}{2 σ _{I}^{2} K} k \sum ∣∣ f - m \circ ϕ_{z_{k}} ∣ ∣^{2}

+ \frac{1}{2 σ _{s}^{2} K} k \sum ∣∣ s_{f} - s_{m} \circ ϕ_{z_{k}} ∣ ∣^{2} .

∥ s_{f} - s_{m} \circ ϕ_{z_{k}} ∥^{2}

∥ s_{f} - s_{m} \circ ϕ_{z_{k}} ∥^{2}

\approx \frac{1}{2} n \sum sd (s_{f} [n] \circ ϕ_{z}^{- 1}, s_{m}) + n \sum sd (s_{m} [n] \circ ϕ_{z}, s_{f}) .

L (ψ; m, f)

L (ψ; m, f)

= - I E_{q} [lo g N (m; z \circ f; σ^{2} I)] + KL [N (z; μ_{z ∣ m, f}, Σ_{z ∣ m, f}) ∣∣ N (z; \b 0, Λ_{z})]

= \frac{1}{2} I E_{q} [lo g 2 π σ^{2 d} + \frac{1}{σ ^{2}} ∥ m - f \circ ϕ_{z} ∥^{2}] + KL [N (z; μ_{z ∣ m, f}, Σ_{z ∣ m, f}) ∣∣ N (z; \b 0, Λ_{z})]

= \frac{1}{2 σ ^{2}} I E_{q} [∥ m - f \circ ϕ_{z_{k}} ∥^{2}] + \frac{1}{2} [lo g \frac{∣ Λ _{z}^{- 1} ∣}{∣ Σ _{z ∣ m, f} ∣} - 3 d + tr (Λ_{z} Σ_{z ∣ m, f}) + μ_{z ∣ m, f}^{T} Λ_{z} μ_{z ∣ m, f}] + const .

L (ψ; m, f)

L (ψ; m, f)

ψ min KL [q_{ψ} (z ∣ f; m) ∣∣ p (z ∣ f, s_{f}; m, s_{m})]

ψ min KL [q_{ψ} (z ∣ f; m) ∣∣ p (z ∣ f, s_{f}; m, s_{m})]

= ψ min I E_{q} [lo g q_{ψ} (z ∣ f; m) - lo g p (z ∣ f, s_{f}; m, s_{m})]

= ψ min I E_{q} [lo g q_{ψ} (z ∣ f; m) - lo g p (z, f, s_{f}; m, s_{m})] + lo g p (f, s_{f}; m, s_{m})

= ψ min I E_{q} [lo g q_{ψ} (z ∣ f; m) - lo g p (z) - lo g p (f, s_{f} ∣ z; m, s_{m})] + const

= ⋆ ψ min I E_{q} [lo g q_{ψ} (z ∣ f; m) - lo g p (z) - lo g p (f ∣ z; m) - lo g p (s_{f} ∣ z; s_{m})] + const

= ψ min KL [q_{ψ} (z ∣ f; m) ∣∣ p (z)] - I E_{q} [lo g p (f ∣ z; m)] - I E_{q} [lo g p (s_{f} ∣ z; s_{m})] + const,

I E_{q} [lo g p (s_{f} ∣ z; s_{m})]

I E_{q} [lo g p (s_{f} ∣ z; s_{m})]

= \frac{1}{2} I E_{q} [lo g 2 π σ_{s}^{2 d} + \frac{1}{σ _{s}^{2}} ∥ s_{f} - s_{m} \circ ϕ_{z_{k}} ∥^{2}]

= \frac{1}{2} I E_{q} [\frac{1}{σ _{s}^{2}} ∥ s_{f} - s_{m} \circ ϕ_{z_{k}} ∥^{2}] + const.

L (ψ; f, s_{f}, m, s_{m})

L (ψ; f, s_{f}, m, s_{m})

+ \frac{1}{2} [tr (λ D Σ_{z ∣ x; y} - lo g Σ_{z ∣ x; y}) + μ_{z ∣ m, f}^{T} Λ_{z} μ_{z ∣ m, f}] + const .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Unsupervised Learning

of Probabilistic Diffeomorphic Registration

for Images and Surfaces

Adrian V. Dalca

MIT and MGH

[email protected]

&Guha Balakrishnan

MIT

[email protected]

&John Guttag

MIT

[email protected] &Mert R. Sabuncu

Cornell University

[email protected]

Abstract

Classical deformable registration techniques achieve impressive results and offer a rigorous theoretical treatment, but are computationally intensive since they solve an optimization problem for each image pair. Recently, learning-based methods have facilitated fast registration by learning spatial deformation functions. However, these approaches use restricted deformation models, require supervised labels, or do not guarantee a diffeomorphic (topology-preserving) registration. Furthermore, learning-based registration tools have not been derived from a probabilistic framework that can offer uncertainty estimates.

In this paper, we build a connection between classical and learning-based methods. We present a probabilistic generative model and derive an unsupervised learning-based inference algorithm that uses insights from classical registration methods and makes use of recent developments in convolutional neural networks (CNNs). We demonstrate our method on a 3D brain registration task for both images and anatomical surfaces, and provide extensive empirical analyses. Our principled approach results in state of the art accuracy and very fast runtimes, while providing diffeomorphic guarantees. Our implementation is available at http://voxelmorph.csail.mit.edu.

K****eywords medical image registration $\cdot$ diffeomorphic registration $\cdot$ invertible registration $\cdot$ probabilistic modeling $\cdot$ convolutional neural networks $\cdot$ variational inference $\cdot$ machine learning

1 Introduction

Deformable registration computes a dense correspondence between two images, and is fundamental to many medical image analysis tasks. Classical registration techniques have been rigorously developed and studied, but require computationally intensive optimization for each image pair, often requiring tens of minutes to hours of compute time on a CPU. Recent, learning-based registration methods achieve fast runtimes by building on machine learning developments, but largely omit rigorous theoretical treatment of deformations and topology-preserving guarantees. In this work, we present an approach that builds on the strengths of both paradigms, and overcomes these shortcomings. We provide a rigorous connection between probabilistic generative models for deformations and learning algorithms based on convolutional neural networks (CNNs). We also demonstrate that the learning can be done end-to-end in an unsupervised fashion for this model. The resulting framework provides registration for a new image pair in under a second on a GPU, while maintaining guarantees developed for classical methods.

Our formulation casts registration as variational inference on a probabilistic generative model. This framework naturally results in an algorithm that leverages a collection of images to learn a global convolutional neural network with an intuitive cost function. Importantly, we introduce diffeomorphic integration layers combined with a spatial transform layer to enable unsupervised end-to-end learning for diffeomorphic registration. We demonstrate that our algorithm achieves state-of-the-art registration accuracy while providing diffeomorphic deformations and fast runtime, and can estimate of registration uncertainty. In our experiments we focus on the example of registering 3D MR brain scans, using a multi-study dataset of over 3,500 scans. However, the method is broadly applicable to many registration tasks.

This paper extends a preliminary version of this work presented at the Medical Image Computing and Computer Assisted Intervention (MICCAI) 2018 conference [18]. We build on that work by providing theoretical extensions, new results, analysis, and discussion. We first expand the model, including a natural extension to anatomical surfaces. In our experiments, we add baselines, new experiments on registration of both images and surfaces, and provide an analysis of the effect of our diffeomorphic implementation on field regularity and runtime. We implement our method as part of the registration framework called VoxelMorph, which is available at http://voxelmorph.csail.mit.edu.

1.1 Related Works

1.1.1 Classical Registration Methods

Classical methods solve an optimization over the space of deformations [5, 7, 8, 11, 19, 28, 66, 69, 70]. Common representations are displacement vector fields, including elastic-type models [8, 21, 62], free-form deformations with b-splines [61], statistical parametric mapping [6], Demons [56, 66], and more recently discrete methods [19, 30, 28].

Constraining the allowable transformations to diffeomorphisms ensures certain desirable properties, such as preservation of topology. Diffeomorphic transforms have seen extensive methodological development, yielding state-of-the-art tools, such as Large Diffeomorphic Distance Metric Mapping (LDDMM) [11, 14, 15, 32, 37, 49, 55, 70], DARTEL [5], diffeomorphic Demons [67], and symmetric normalization (SyN) [7]. In general, these tools demand substantial time and computational resources for a given image pair.

Some recent GPU-based iterative algorithms use these frameworks to develop faster algorithms by requiring a GPU to be available for each registration [51, 50]. Recent learning-based registration methods have demonstrated that they can provide good initializations to iterative GPU methods [10] to further improve runtime.

Probabilitic image registration methods specify priors on the deformation between two images, and likelihood models that describe image intensities [63, 70, 31, 58, 3]. These formulations also lead to iterative optimization methods, but can yield distributions of deformation fields. In this paper, we build on these models by presenting a general variational inference strategy to optimize a global neural network that efficiently outputs distributions of deformations.

1.1.2 Learning-based Registration

Recent methods have proposed to train neural networks that map a pair of input images to an output deformation. Most earlier approaches demonstrated the feasibility of deep learning based registration, and required ground truth registration fields [13, 42, 59, 64, 68]. Such ground truth deformations are often derived via more conventional registration tools or simulations, sometimes limiting their applicability.

Building on the successful demonstration of these methods, several recent papers [9, 10, 23, 22, 44] explore unsupervised, or end-to-end, strategies. These methods employ a neural network that computes spatial transformation [36] to warp one image to another, enabling end-to-end training. A recent approach builds on these methods by learning a spatially-adaptive regularizer within a registration model [54] These approaches use machine learning techniques to achieve efficient training and fast runtimes, but build on classical registration development, such as probabilistic models and diffeomorphic theory. In our work, we bridge these two paradigms to offer classical guarantees within a machine learning approach. We note the contemporaneous development of a method that uses a conditional variational auto encoder (CVAE) to learn diffeomorphic representations [43, 41]. Similar to our method, this approach uses a variational strategy to learn a network to predict a stationary velocity field (SVF). However, the authors focus on representing the SVF through the manifold of the CVAE, and focus on the anatomical variation captured through this encoding

Recent methods proposed using segmentation-based cost functions, such as Dice [25], to replace the image-based similarity term when segmentations are available during training for multi-modal registration, such as T2w MRI and 3D ultrasound, within the same subject [35, 34]. We extend this line of work by showing that our generative probabilistic model naturally describes the deformation of surfaces, therefore enabling the use of segmentations during training within a single cohesive framework. The extended model results in a combination of segmentation (surface) and image-based training losses.

1.1.3 Surface-based Registration

In this paper, we also present an extension to our main contribution which enables alignment of surfaces. In medical image registration, surface matching methods often use surface coordinates or geometric features extracted from anatomical structures [2, 26, 47, 57]. Several methods treat surfaces as 3D point sets with shape descriptors, and often use Iterated Closest Point based optimization methods to find the shape correspondences [12]. Currents, defined as unconnected oriented points, have been used to register surfaces, for example using Matching Pursuit algorithms [26]. Some methods combine volume and surface registration, often using surface registrations to initialize dense volumetric transforms [57]. Similar to volume registration, these classical surface matching methods use iterative optimization strategies, requiring significant computational resources. Building on these methods, we use a 3D point representation jointly with images to achieve fast registration with neural networks, enabled by a differentiable surface distance function.

1.2 Background: Diffeomorphic Registration

Although the method presented in this paper applies to a multitude of deformable representations, we choose to work with diffeomorphisms, and in particular with a stationary velocity field representation [5]. Diffeomorphic deformations are differentiable and invertible, and thus preserve topology. Let $\boldsymbol{\phi}:\mathbb{R}^{3}\rightarrow\mathbb{R}^{3}$ represent the deformation that maps the coordinates from one image to coordinates in another image. In our implementation, the deformation field is defined through the following ordinary differential equation (ODE):

[TABLE]

where $\boldsymbol{\phi}^{(0)}=Id$ is the identity transformation and $t$ is time. We integrate the stationary velocity field $\boldsymbol{v}$ over $t=[0,1]$ to obtain the final registration field $\boldsymbol{\phi}^{(1)}$ [53].

While we implement and evaluate several numerical integration techniques, we find scaling and squaring to be most efficient, and we briefly review the technique here [4]. The integration of a stationary ODE represents a one-parameter subgroup of diffeomorphisms. In group theory, $\boldsymbol{v}$ is a member of the Lie algebra and is exponentiated to produce $\boldsymbol{\phi}^{(1)}$ , which is a member of the Lie group: $\boldsymbol{\phi}^{(1)}=\text{exp}(\boldsymbol{v})$ . From the properties of one-parameter subgroups, for any scalars $t$ and $t^{\prime}$ , $\exp((t+t^{\prime})\boldsymbol{v})=\exp(t\boldsymbol{v})\circ\exp(t^{\prime}\boldsymbol{v})$ , where $\circ$ is a composition map associated with the Lie group. Starting from $\boldsymbol{\phi}^{(1/2^{T})}=\boldsymbol{p}+\boldsymbol{v}(\boldsymbol{p})/2^{T}$ where $\boldsymbol{p}$ is a map of spatial locations, we use the recurrence $\boldsymbol{\phi}^{(1/2^{t-1})}=\boldsymbol{\phi}^{(1/2^{t})}\circ\boldsymbol{\phi}^{(1/2^{t})}$ to obtain $\boldsymbol{\phi}^{(1)}=\boldsymbol{\phi}^{(1/2)}\circ\boldsymbol{\phi}^{(1/2)}$ . $T$ is chosen so that $\boldsymbol{v}/2^{T}$ is very small.

2 Methods

We let $\boldsymbol{f}$ and $\boldsymbol{m}$ be 3D images, such as MRI volumes, and let $\boldsymbol{z}$ be a latent variable that parametrizes a transformation function $\boldsymbol{\phi}_{\boldsymbol{z}}:\mathbb{R}^{3}\rightarrow\mathbb{R}^{3}$ . We propose a generative model that describes the formation of $\boldsymbol{f}$ by warping $\boldsymbol{m}$ via $\boldsymbol{m}\circ\boldsymbol{\phi}_{\boldsymbol{z}}$ . We propose a variational inference approach that leverages a convolutional neural network with diffeomorphic integration and spatial transform layers. We learn network parameters in an unsupervised fashion, without access to ground truth registrations. We describe how the network yields fast diffeomorphic registration of a new image pair $(\boldsymbol{f},\boldsymbol{m})$ , in a probabilistic framework. We expand this treatment by including anatomical surface alignment, which enables training the network given (optional) anatomical segmentations.

2.1 Generative Model

We model the prior probability of the parametrization $\boldsymbol{z}$ as:

[TABLE]

where $\mathcal{N}(\cdot;\boldsymbol{\mu},\boldsymbol{\Sigma})$ is the multivariate normal distribution with mean $\boldsymbol{\mu}$ and covariance $\boldsymbol{\Sigma}$ . Our work applies to a wide range of representations $\boldsymbol{z}$ . For example, $\boldsymbol{z}$ could be a dense displacement field, or a low-dimensional embedding of the displacement field. In this paper, we let $\boldsymbol{z}$ be a stationary velocity field that specifies a diffeomorphism through the ODE (1). We let $\boldsymbol{L}=\boldsymbol{D}-\boldsymbol{A}$ be the Laplacian of a neighborhood graph defined on the voxel grid, where $\boldsymbol{D}$ is the graph degree matrix, and $\boldsymbol{A}$ is a voxel neighbourhood adjacency matrix. We encourage spatial smoothness of the velocity field $\boldsymbol{z}$ by setting $\boldsymbol{\Sigma}_{z}^{-1}=\boldsymbol{\Lambda}_{z}=\lambda\boldsymbol{L}$ , where $\boldsymbol{\Lambda}_{z}$ is a precision matrix and $\lambda$ denotes a parameter controlling the scale of the velocity field $\boldsymbol{z}$ .

We let $\boldsymbol{f}$ be a noisy observation of warped image $\boldsymbol{m}$ :

[TABLE]

where $\sigma_{I}^{2}$ captures the variance of additive image noise.

We aim to estimate the posterior registration probability $p(\boldsymbol{z}|\boldsymbol{f};\boldsymbol{m})$ . Using this, we can obtain the most likely registration field $\boldsymbol{\phi}_{\boldsymbol{z}}$ for a new image pair $(\boldsymbol{f},\boldsymbol{m})$ via MAP estimation, along with an estimate of velocity field variance at each voxel. Figure 1 provides a graphical representation of our model.

2.2 Learning

Given our assumptions, computing the posterior probability $p(\boldsymbol{z}|\boldsymbol{f};\boldsymbol{m})$ is intractable. We use a variational approach, and introduce an approximate posterior probability $q_{\boldsymbol{\psi}}(\boldsymbol{z}|\boldsymbol{f};\boldsymbol{m})$ parametrized by $\boldsymbol{\psi}$ . We minimize the KL divergence

[TABLE]

which yields the negative of the variational lower bound of the model evidence [39]. We model the approximate posterior $q_{\boldsymbol{\psi}}(\boldsymbol{z}|\boldsymbol{f};\boldsymbol{m})$ as a multivariate normal:

[TABLE]

where we let $\boldsymbol{\Sigma}_{z|m,f}$ be diagonal. To understand the effects of this assumption, we explore a non-diagonal covariance in a later section. The statistics $\boldsymbol{\mu}_{z|m,f}$ and the diagonal of $\boldsymbol{\Sigma}_{z|m,f}$ can be interpreted as the voxel-wise mean and variance, respectively.

We estimate $\boldsymbol{\mu}_{z|m,f}$ , and $\boldsymbol{\Sigma}_{z|m,f}$ using a convolutional neural network $\text{def}_{\boldsymbol{\psi}}(\boldsymbol{f},\boldsymbol{m})$ parameterized by $\boldsymbol{\psi}$ , as described in the next section. We learn parameters $\boldsymbol{\psi}$ by optimizing the variational lower bound (4) using stochastic gradient methods. Specifically, for each image pair $(\boldsymbol{f},\boldsymbol{m})$ and sample $\boldsymbol{z}_{k}\sim q_{\psi}(\boldsymbol{z}|\boldsymbol{f};\boldsymbol{m})$ , we compute $\boldsymbol{m}\circ\boldsymbol{\phi}_{z_{k}}$ , with the resulting loss (detailed derivation in supplementary material):

[TABLE]

where $K$ is the number of samples used to approximate the expectation. The first term encourages image $\boldsymbol{f}$ to be similar to the warped image $\boldsymbol{m}\circ\boldsymbol{\phi}_{z_{k}}$ . The second term encourages the posterior to be close to the prior $p(\boldsymbol{z})$ . Although the variational covariance $\boldsymbol{\Sigma}_{z|m,f}$ is diagonal, the last term spatially smoothes the mean, which can be seen by expanding $\boldsymbol{\mu}_{z|m,f}^{T}\boldsymbol{\Lambda}_{z}\boldsymbol{\mu}_{z|m,f}=\frac{\lambda}{2}\sum\sum_{j\in N(I)}(\boldsymbol{\mu}[i]-\boldsymbol{\mu}[j])^{2}$ , where $N(i)$ are the neighbors of voxel $i$ . We treat $\sigma^{2}$ and $\lambda$ as fixed hyper-parameters that we investigate in our experiments, and use $K=1$ .

2.3 Neural Network Framework

We design the network $\text{def}_{\boldsymbol{\psi}}(\boldsymbol{f},\boldsymbol{m})$ that takes as input $\boldsymbol{f}$ and $\boldsymbol{m}$ and outputs $\boldsymbol{\mu}_{z|m,f}$ and $\boldsymbol{\Sigma}_{z|m,f}$ , based on a 3D UNet-style architecture [60]. The network includes a convolutional layer with 32 filters, four downsampling layers with 64 convolutional filters and a stride of two, and three upsampling convolutional layers with 64 filters. We only upsample three times to predict the velocity field (and following integration steps) at every two voxels, to enable these operations to fit in current GPU card memory.

To enable unsupervised learning of parameters $\boldsymbol{\psi}$ using (6), we must form $\boldsymbol{m}\circ\boldsymbol{\phi}_{z}$ and compute the data term. We first implement a layer that samples a new $\boldsymbol{z}_{k}\sim\mathcal{N}(\boldsymbol{\mu}_{z|m,f},\boldsymbol{\Sigma}_{z|m,f})$ using the “re-parameterization trick" [39]: $\boldsymbol{z}_{k}=\boldsymbol{\mu}_{z|m,f}+\sqrt{\boldsymbol{\Sigma}_{z|m,f}}\boldsymbol{r}$ , where $\boldsymbol{r}$ is a sample from the standard normal: $\boldsymbol{r}\sim\mathcal{N}(0,\boldsymbol{I})$ .

Given $\boldsymbol{z}_{k}$ , we need to compute $\boldsymbol{\phi}_{z_{k}}=\exp(\boldsymbol{z}_{k})$ as described in the introduction. We propose vector integration layers using scaling and squaring operations. Specifically, scaling and squaring operations involve compositions within the neural network architecture using a differentiable spatial transformation operation. Given two 3D vector fields $\boldsymbol{a}$ and $\boldsymbol{b}$ , for each voxel $p$ this operation computes $(\boldsymbol{a}\circ\boldsymbol{b})(p)=\boldsymbol{a}(\boldsymbol{b}(p))$ , a non-integer voxel location $\boldsymbol{b}(\boldsymbol{p})$ in $\boldsymbol{a}$ , using linear interpolation. Starting with $\boldsymbol{\phi}^{(1/2^{T})}=\boldsymbol{p}+\boldsymbol{z}_{k}/2^{T}$ , we compute $\boldsymbol{\phi}^{(1/2^{t-1})}=\boldsymbol{\phi}^{(1/2^{t})}\circ\boldsymbol{\phi}^{(1/2^{t})}$ recursively using these operations T times, leading to $\boldsymbol{\phi}^{(1)}\triangleq\boldsymbol{\phi}_{z_{k}}=\exp(\boldsymbol{z}_{k})$ . In our experiments, we extensively analyze the effect of the step size $T$ on the runtime of the network, the accuracy of the registration, and the regularity of the deformation. We also implement vector integration layers using quadrature and ODE solvers, and in the experiments show that these are significantly slower and can require significant memory.

Finally, we warp volume $\boldsymbol{m}$ according to the computed diffeomorphic field $\boldsymbol{\phi}_{z_{k}}$ using a spatial transform layer.

In summary, the network takes as input images $\boldsymbol{f}$ and $\boldsymbol{m}$ , computes statistics $\boldsymbol{\mu}_{z|m,f}$ and $\boldsymbol{\Sigma}_{z|m,f}$ , samples a new velocity field $\boldsymbol{z}_{k}\sim\mathcal{N}(\boldsymbol{\mu}_{k},\boldsymbol{\Sigma}_{k})$ , computes a diffeomorphic $\boldsymbol{\phi}_{z_{k}}$ and warps $\boldsymbol{m}$ . Since all the steps are designed to be differentiable, we learn the network parameters using stochastic gradient descent-based methods. This network results in three outputs, $\boldsymbol{\mu}_{z|m,f},\boldsymbol{\Sigma}_{z|m,f}$ and $\boldsymbol{m}\circ\boldsymbol{\phi}_{z_{k}}$ , which are used in the model loss (6). The framework is summarized in Figure 2.

2.4 Registration

Given learned parameters, we approximate registration of a new scan pair $(\boldsymbol{f},\boldsymbol{m})$ using $\boldsymbol{\phi}_{\hat{z}_{k}}$ . We first obtain the most likely velocity field $\hat{\boldsymbol{z}}_{k}$ using

[TABLE]

by evaluating the neural network $\text{def}_{\psi}(\boldsymbol{f},\boldsymbol{m})$ . We then compute $\boldsymbol{\phi}_{\hat{z}_{k}}$ using the scaling and squaring based integration, altogether requiring less than a second on a GPU. We highlight that at test time, the diagonal covariance $\boldsymbol{\Sigma}_{z|m,f}$ is not used, however it enables an estimation of the deformation uncertainty. Analysis of uncertainty is beyond the scope of this paper, and is an interesting avenue for future study.

Using a stationary velocity field representation, computing the inverse deformation field $\boldsymbol{\phi}^{-1}_{\boldsymbol{z}}$ can be achieved by integrating the negative of the velocity field: $\boldsymbol{\phi}^{-1}_{\boldsymbol{z}}=\boldsymbol{\phi}_{-\boldsymbol{z}}$ , since $\boldsymbol{\phi}_{\boldsymbol{z}}\circ\boldsymbol{\phi}_{-\boldsymbol{z}}=\exp(\boldsymbol{z})\circ\exp(-\boldsymbol{z})=\exp(\boldsymbol{z}-\boldsymbol{z})=Id$ [5, 52]. This enables the computation of both fields inside one efficient network when desired.

2.5 Implementation

We implement our method as part of the VoxelMorph package [9], available online at http://voxelmorph.csail.mit.edu, using neuron [20] and Keras [16] with a Tensorflow [1] backend. We use a learning rate of $1e-4$ for the Adam optimizer [38], a batch size of 1 due to memory constraints, and Glorot uniform initialization for the convolution weights. We use a single sample ( $K=1$ ), which has been shown to lead to useful gradients for optimization while maintaining the memory footprint and implementation complexity low [39]. For large volumes, the number of samples is often constrained by the available GPU memory.

3 Method Extensions

3.1 Surface-based Registration

In various instances, anatomical segmentation maps for specific structures of interest may also be available with some of the training images. Recent papers have demonstrated that the use of segmentations can help in registration [10, 34]. Here, we show that our proposed model naturally extends to handle surfaces, enabling the use of segmentations during training within the same principled framework.

We focus on the case where one anatomical structure is segmented in the image. Given a segmentation map where each voxel is assigned the desired anatomical label or background, we extract the anatomical surface and let $\boldsymbol{s}_{\boldsymbol{f}}$ represent the $N$ surface coordinates of the anatomical structure for image $\boldsymbol{f}$ , which can be stored as an $N\times 3$ matrix. Given the diffeomorphism $\boldsymbol{\phi}_{z}$ in the previous section, we model each surface location $\boldsymbol{s}_{f}[n]$ , as formed by displacing a matching surface location $\boldsymbol{s}_{m}[n]$ according to $\boldsymbol{\phi}_{z}$ , and adding (spatial) displacement noise:

[TABLE]

where the composition $\boldsymbol{s}_{\boldsymbol{m}}\circ\boldsymbol{\phi}_{\boldsymbol{z}}$ warps surface coordinates.

Given both images and segmentation maps during training, we extract surfaces of the desired structure and aim to estimate the posterior probability $p(\boldsymbol{z}|\boldsymbol{f},\boldsymbol{s}_{\boldsymbol{f}};\boldsymbol{m},\boldsymbol{s}_{\boldsymbol{m}})$ . As before, we use a variational approximation. Since segmentation maps, and hence surfaces, are usually derived from images, we assume that images are sufficient to approximate the posterior: $q(\boldsymbol{z}|\boldsymbol{f},\boldsymbol{s}_{\boldsymbol{f}};\boldsymbol{m},\boldsymbol{s}_{\boldsymbol{m}})=q_{\boldsymbol{\psi}}(\boldsymbol{z}|\boldsymbol{f};\boldsymbol{m})$ . As before, we minimize the KL divergence between the true and approximate posterior (derived in supplementary material):

[TABLE]

and arrive at the loss function:

[TABLE]

Compared to the original model loss (6), the additional third term encourages the deformation $\boldsymbol{\phi}_{z_{k}}$ to warp the moving surface close to the fixed surface $\boldsymbol{s}_{f}$ . As described in the generative model (8), this requires corresponding surface points in $\boldsymbol{s}_{\boldsymbol{f}}$ and $\boldsymbol{s}_{\boldsymbol{m}}$ . However, these correspondences are not available in practice, as segmentations are provided independently for each image. Therefore, the third term cannot be computed directly.

We propose an approximation of the surface term using surface distance transforms. Let $\text{sd}(x,\boldsymbol{s})$ be a surface distance function, which for location $x$ returns the Euclidean distance to the closest surface point in $\boldsymbol{s}$ (Figure 3Left).111Function $\text{sd}(x,\boldsymbol{s})$ is a generating function for a distance transform image for the surface $\boldsymbol{s}$ , by evaluating it at every grid point $x$ Noting that for two surfaces $\boldsymbol{a}$ and $\boldsymbol{b}$ , $\sum_{n}\text{sd}(\boldsymbol{a}[n],\boldsymbol{b})\neq\sum_{n}\text{sd}(\boldsymbol{b}[n],\boldsymbol{a})$ due to potential asymmetries in the surfaces (see Figure 3Right), we approximate the distance $\|\boldsymbol{s}_{\boldsymbol{f}}-\boldsymbol{s}_{\boldsymbol{m}}\circ\boldsymbol{\phi}_{z_{k}}\|^{2}$ by computing $\text{sd}(\cdot,\cdot)$ in both directions:

[TABLE]

We implement this function efficiently using distance transforms. Specifically, to compute $\text{sd}(\boldsymbol{s}_{\boldsymbol{m}}[n]\circ\boldsymbol{\phi}_{z},\boldsymbol{s}_{\boldsymbol{f}})$ , we first pre-compute distance transforms for the (fixed) given structure $\boldsymbol{s}_{f}$ . We then sample $100,000$ points along $\boldsymbol{s}_{m}$ , which we find to be sufficient to estimate accurate measures along the surface. We warp (move) them according to the deformation $\boldsymbol{\phi}_{z}$ , and compute the distance transform of $\boldsymbol{s}_{f}$ at these locations. We take advantage of our diffeomorphic representation that enables computing the inverse $\boldsymbol{\phi}_{z}^{-1}$ efficiently within the network to similarly compute $\text{sd}(\boldsymbol{s}_{\boldsymbol{f}}[n]\circ\boldsymbol{\phi}_{z}^{-1},\boldsymbol{s}_{\boldsymbol{m}})$ .

In summary, since to compute the posterior approximation $q_{\boldsymbol{\psi}}(\boldsymbol{z}|\boldsymbol{f};\boldsymbol{m})$ the neural network takes as input only the images $\boldsymbol{f}$ and $\boldsymbol{m}$ , images alone are required at test time. Given a diffeomorphism $\boldsymbol{\phi}_{z_{k}}$ , at training time the network uses both a warped image and a warped surface to evaluate the quality of the registration.

This model can also be used to register two surfaces when the images themselves are not available. The only modelling change required is removing the image likelihood terms and using the variational approximation $q_{\boldsymbol{\psi}}(\boldsymbol{z}|\boldsymbol{S}_{\boldsymbol{f}};\boldsymbol{S}_{\boldsymbol{m}})$ , which uses the segmentation maps $\boldsymbol{S}_{m}$ and $\boldsymbol{S}_{f}$ as input. Surface-only registration is beyond the scope of this paper, and we leave it for future work. However, registration with images and surfaces is described here as an example of possible extensions of the model, and surface-only registration is beyond the scope of this paper.

The complete neural network framework, including the surface loss, is illustrated in supplemental Figure 12.

3.2 Non-diagonal Covariance

Approximating the velocity field covariance $\boldsymbol{\Sigma}_{z|m,f}$ using a diagonal matrix is a strong assumption that ignores spatial smoothness. As seen in (6), the spatially-smooth prior $p(\boldsymbol{z})$ encourages a smooth mean velocity field $\boldsymbol{\mu}_{z|m,f}$ , but samples $\boldsymbol{z}_{k}\sim\mathcal{N}(\boldsymbol{\mu}_{z|m,f},\boldsymbol{\Sigma}_{z|m,f})$ might still be noisy. In this section, we investigate the effects of this restriction, by providing a model expansion that computes a less restrictive covariance. In our experiments below, we analyze the effects of these different approximations.

To evaluate the effects of the diagonal covariance, we explore a second approximation $\boldsymbol{\Sigma}_{z|m,f}=\boldsymbol{C}_{\sigma_{c}}\boldsymbol{G}\boldsymbol{G}^{T}\boldsymbol{C}_{\sigma_{c}}^{T}$ where $\boldsymbol{G}$ is a diagonal matrix returned by the neural network and $\boldsymbol{C}_{\sigma_{c}}$ is a fixed smoothing convolution matrix. Specifically, for each row of $\boldsymbol{C}_{\sigma_{c}}$ we create a flattened Gaussian smoothing kernel centered at a particular voxel, such that $\boldsymbol{C}_{\sigma_{c}}\boldsymbol{w}$ is equivalent to 3D convolution of image $\boldsymbol{w}$ by a gaussian filter with variance $\sigma_{c}^{2}$ . We choose $\sigma_{c}$ such that the smoothing operation matches the scale of the prior $p(\boldsymbol{z})$ determined by $\lambda$ : $\frac{1}{\sqrt{2\pi\sigma_{c}^{3/2}}}=(\lambda*6)^{-1}$ .

During training, sampling from the posterior is achieved using the reparametrization trick: $\boldsymbol{z}_{k}=\boldsymbol{\mu}_{z|m,f}+\boldsymbol{C}_{\sigma_{c}}\boldsymbol{D}\boldsymbol{r}$ , where $\boldsymbol{r}$ is a sample from the standard normal. Intuitively, compared to the diagonal $\boldsymbol{\Sigma}_{z|m,f}$ approximation, this sampling procedure smoothes the term $\boldsymbol{D}\boldsymbol{r}$ before adding the mean $\boldsymbol{\mu}_{z|m,f}$ .

In our experiments, we show that this approximation yields smoother velocity fields during training, and the effect diminishes with higher $\lambda$ values. However, the resulting deformation fields are diffeomorphic and accurate for both approximations, demonstrating that the diagonal covariance approximation is sufficient when working with diffeomorphisms.

4 Experiments

We perform a series of experiments demonstrating that the proposed probabilistic image registration framework achieves accuracy and runtime comparable to state-of-the-art methods while enabling diffeomorphic deformations. We also show the improvements enabled by the extended surface model, and analyze the effect of the various integration layers during test time.

We focus on atlas-based registration, a common task in population analysis. Specifically, we register each scan to an atlas computed using external data [27, 65]. Because we implement our algorithm as part of the VoxelMorph framework, we will refer to it as VoxelMorph-diff.

4.1 Experiment setup

4.1.1 Data and Preprocessing

We use a large-scale, multi-site dataset of 3731 T1-weighted brain MRI scans from eight publicly available datasets: OASIS [45], ABIDE [24], ADHD200 [48], MCIC [29], PPMI [46], HABS [17], and Harvard GSP [33]. Acquisition details, subject age ranges and health conditions are different for each dataset. We performed standard pre-processing steps on all scans, including resampling to $1$ mm isotropic voxels, affine spatial normalization and brain extraction for each scan using FreeSurfer [27]. We crop the final images to $160\times 192\times 224$ . Segmentation maps including 29 anatomical structures, obtained using FreeSurfer for each scan, are used in evaluating registration results. Each image contains roughly $\sim 1.6$ million brain voxels. We split the dataset into 3231, 250, and 250 volumes for train, validation, and test sets respectively, although we underscore that the training is unsupervised.

4.1.2 Evaluation Metrics

To evaluate a registration algorithm, we register each subject to an atlas, propagate the segmentation map using the resulting warp, and measure volume overlap using the Dice metric. For the surface experiments, we also employ the Euclidean surface distance, computed using the strategy described in (11).

We also evaluate the diffeomorphic property, a focus of our work. Specifically, the Jacobian matrix $J_{\phi}(p)=\nabla\boldsymbol{\phi}(p)\in\mathcal{R}^{3\times 3}$ captures the local properties of $\boldsymbol{\phi}$ around voxel $p$ . The local deformation is diffeomorphic, both invertible and orientation-preserving, only at locations for which $|J_{\phi}(p)|>0$ , where $|\cdot|$ is the determinant operator [5]. We count all other (folding) voxels, where $|J_{\phi}(p)|\leq 0$ .

4.1.3 Baseline Methods

We compare our approach with the popular ANTs software package using Symmetric Normalization (SyN) [7], a top-performing algorithm [40]. We found that the default ANTs settings are sub-optimal for our task, and performed a wide parameter and similarity metric search across several datasets. We used the default geodesic implementation of SyN, which is most faithful to theoretical diffeomorphic development. Other versions, such as greedy SyN, would yield a slightly faster runtime, while giving less diffeomorphic deformations. We identified and use top performing parameter values for the Dice metric using: the cross-correlation (CC) loss function, SyN step size of 0.25, Gaussian smoothing of (9, 0.2) and three scales of 201 iterations. We also test the NiftyReg package, for which we use a multi-threaded CPU implementation as a GPU implementation is not currently available.222We compiled the latest source code from March, 2018 (tree [4e4525]). We experimented with different parameter settings for improved behavior, and used the following setting: CC cost function, grid spacing of 5, and 500 iterations.

To compare with recent learning-based registration approaches, we also test our recent CNN-based method, VoxelMorph, which produces state-of-the-art fast and accurate registration, but does not yield diffeomorphic results [9, 10]. We sweep the regularization parameter using our validation set, and use the optimal regularization parameter of 1 in our results.

We also compute a supervised baseline by training a VoxelMorph-diff network using ground truth deformations. We build a ground truth dataset by registering over 650 atlas-MRI subject training pairs using NiftyReg with the described settings. We then train a neural network to predict the resulting deformation fields using a mean squared error (MSE) loss. We explored several variants, and found that doubling the model capacity by doubling the number of features at each layer, as well as penalizing the deformations fields only within the proximity of the atlas brain, yielded optimal results. To enable direct comparison, we used the VoxelMorph-diff architecture, but without sampling of the velocity field.

4.2 Image Registration

Table 1 provides a summary of the results on the held-out test set. Figure 5 and supplementary material Figure 13 show representative results. Figure 4 illustrates Dice results on several anatomical structures. For better visualization, we combine the same structures from the two hemispheres, such as the left and right hippocampus. Our algorithm, VoxelMorph-diff, achieve state-of-the-art Dice results and runtimes, but produces diffeomorphic registration fields (nearly no folding voxels per scan) in a probabilistic framework.

All methods achieve comparable Dice results on each structure and overall, except the supervised method. Despite training the latter on 650 subjects, we found that the supervised network leads to more diffeomorphic deformations than the training deformations, but results in a slight loss in Dice score. Learning-based methods require a fraction of the baseline runtimes to register two images: less than a second on a GPU, and less than a minute and a half on a CPU. Runtimes were computed for an NVIDIA TitanX GPU and a Intel Xeon (E5-2680) CPU, and exclude preprocessing common to all methods.

Our method outputs positive Jacobians at nearly all voxels, which we analyze in more detail in a later section. For VoxelMorph-diff, we find that for most scans, the deformation fields result in zero folding voxels. Very few volumes lead to a few or tens of grouped folding voxels, leading to a population average of less than a folding voxel per test scan. In contrast, the deformation fields resulting from the baseline methods contain a few thousand locations of non-positive Jacobians for each scan (Table 1), usually grouped in clusters. This may be alleviated with increased spatial regularization or more optimization iterations, but this in turn leads to a drop in performance on the Dice metric or even longer runtimes. The table also shows that, at the presented settings, all methods result in an average Jacobian determinant close to $1$ , with VoxelMorph-diff yielding smoothness statistics nearly identical to those given by NiftyReg, indicate smooth deformations.

4.3 Image and Surface Registration

In this section, we evaluate the generative surface model. We demonstrate the use of anatomical segmentation alongside images during training, and refer to this model as VoxelMorph-Surf. We focus on the setting where one structure of interest is available during training, and learn separate networks for the left white matter, gray matter, ventricle, thalamus and hippocampus. Our goal is to analyze how the additional surface model terms affect the accuracy and regularity of resulting deformations.

Figure 7 illustrates the behaviour of the model with respect to the hyper-parameter $\sigma_{s}$ on the validation set. For a range of $\sigma_{s}$ values, we find a significant improvement in terms of Dice for the desired structure. For very small values of $\sigma_{s}$ , the training becomes unstable leading to poor generalization. A very large $\sigma_{s}$ value leads to the model ignoring the surface term. Since the Dice scores are comparable in the range $\sigma_{c}\in[0.5,2]$ , for the rest of this section we use $\sigma_{s}=2$ , which exhibits slightly fewer folding voxels ( $\leq 5$ compared to $\sim 20$ for $\sigma_{c}=1$ ).

Figure 6 demonstrates the improvement on the test set in terms of Euclidean surface distance and Dice, compared to the image-only registration model VoxelMorph-diff. VoxelMorph-surf improves significantly in all measures for most desired structures. Additionally, Table 2 illustrates that with increased accuracy in both metrics, the number of folding voxels in the entire volume increases only very slightly (to an average 3.5 voxels per volume), which remains orders of magnitude fewer than the baseline methods (Table 1). Figure 14 in the supplementary material illustrates example results.

In summary, the principled joint diffeomorphic model enables the use of surfaces during training which dramatically improves registration near a given structure while preserving desired deformation properties. For example, given hippocampus surfaces at training, registration using VoxelMorph-Surf improves Dice by $\sim 9$ points over VoxelMorph-diff, improves maximum surface distance by more than three voxels, and preserving diffeomorphisms (less than three folding voxels per scan).

4.4 Analysis

4.4.1 Parameter Analysis

The two main hyper-parameters, smoothing precision $\lambda$ and image noise $\sigma_{I}^{2}$ , have physical meaning in our generative model. However, they share a single degree of freedom in the loss function. We set $\sigma_{I}^{2}=0.02$ , and vary the precision scale $\lambda$ between 0.5 and 100. Figure 9 shows average Dice scores for 50 validation set scans for different parameter values, showing that the results vary smoothly over a large range, with reasonable behavior even near $\lambda\sim 0$ . We use $\lambda=20$ in our experiments above.

4.4.2 C-shape Registration

We also perform analysis on controlled experiments with C-shape synthetic images with intensities in $[0,1]$ . Specifically, we train a VoxelMorph-diff network to learn to register a disk with a radius ranging from one third to one fifth of the image, to a C-shape with variable radius and thickness. The outer radius of the C shape is sampled uniformly in the range $[1/3.5,1/2.5]$ of the image size, whereas the inner radius is in the range $[1/6.5,1/5.5]$ . We increase hyper-parameter $\sigma_{s}=0.06$ to account for the increase in maximum intensity. Figure 8 illustrates representative images and deformation fields. To obtain the fields at intermediate time points between 0 and 1, we employ Tensorflow ODE solver. We find that all the deformation fields lead to accurate registration between disks and C shapes, and have no folding voxels. We also find that the deformation fields are invertible, bringing the grid back to identity when the transforms are composed.

4.4.3 Integration Steps

During training, we hold the number of scaling and squaring steps fixed. However, this number can be varied at test time, affecting aspects of the resulting deformation field. In this section, we analyze the effects of the number of steps on accuracy, runtime, field regularity and invertability. We perform this experiment using 50 validation subjects and the image registration network VoxelMorph-diff trained with $T=7$ integration steps and regularization parameter $\lambda=20$ . The velocity field is computed every two voxels, but all of the conclusions in this section are likely to apply to many reasonable field spacings.

Figure 10 summarizes the analysis results. The runtime increases modestly with the number of steps, and is overall significantly smaller than the cost of the rest of the network (i.e. the deformation network computation of the velocity field, and the spatial transform of the full moving image). After four scaling and squaring steps, the method achieved maximum Dice score. We observe a steep decline in the number of folding voxels (note the log-scale vertical axis), reaching less than five voxels after five scaling and squaring steps, compared to classical methods which can include thousands of such voxels (Table 1). Finally, we measure the average displacement error after inverting the deformation fields: $\Delta\boldsymbol{u}=|Id-\boldsymbol{\phi}\circ\boldsymbol{\phi}^{-1}|$ . We find that after five scaling and squaring steps, even the worst error is under a half voxel, indicating that five steps are sufficient to ensure invertible deformations.

In addition, we implemented the integration of the velocity field using Tensorflow ODE solvers and using standard quadrature, but found that these required significantly more runtime compared to the scaling and squaring strategy, consistent with literature findings [4, 5, 50]. Specifically, while five scaling and squaring operations required $0.06\pm 0.01$ seconds, equivalent quadrature integration required $64$ operations (occupying prohibitive amounts of memory) and $0.53\pm 0.01$ seconds, and ODE-solver based integration with default parameter required a single layer and $2.9\pm 0.1$ seconds. At comparable integration settings such as these, all three methods achieve similar Dice scores of $0.75\pm 0.01$ . While these alternative methods require significant resources, all three implementations are available in our source code for experimentation.

This analysis indicates that the proposed scaling and squaring network integration layer is efficient and accurate. Increasing the number of scaling and squaring layers incurs a negligible runtime cost while improving deformation field properties. We use $T=7$ squaring steps in the test experiments above.

4.4.4 Velocity Sampling and Uncertainty

We also evaluate the modeling assumptions of the variational covariance $\boldsymbol{\Sigma}_{z|m,f}$ . Figure 11 illustrates example samples of the velocity field $\boldsymbol{z}_{k}$ and voxel-wise empirical variance for the two $\boldsymbol{\Sigma}_{z|m,f}$ approximations: diagonal covariance and the extended approximation in Section 3.2 that smooths samples $\boldsymbol{z}_{k}$ . For under-regularized networks (very low values for hyper-parameter $\lambda$ ), the latter approximation yields smoother velocity fields. However, given a higher hyper-parameter $\lambda$ value, such as the one used in our experiments, the network learns smaller values for the diagonal $\boldsymbol{\Sigma}_{z|m,f}$ approximation, and yields smooth samples $\boldsymbol{z}_{k}$ with either method. Futhermore, despite the difference in smoothness of the velocity field samples $\boldsymbol{z}_{k}$ , the integration operation leads to equally regular and accurate deformations $\boldsymbol{\phi}_{z_{k}}$ for a given $\lambda$ (Table 3).

Therefore, although the diagonal covariance has the potential to add noise to velocity field samples, the loss function coupled with the integration operation lead to smooth and accurate deformation fields $\boldsymbol{\phi}_{z}$ at reasonable $\lambda$ values. Therefore, in the current setting, the diagonal and non-diagonal covariances give similar results. Nonetheless, in other applications the non-diagonal covariance might be important. For example, diagonal covariances would likely have negative effects in a different deformation model, for instance if $z$ was modelled as the displacement field itself.

5 Discussion and Conclusion

In this work, we build a principled connection between classical registration methods and recent learning-based approaches. We propose a probabilistic model for diffeomorphic image registration and derive a learning algorithm that leverages a convolutional neural network and unsupervised, end-to-end learning for fast runtime. To achieve diffeomorphic transforms, we integrate stationary velocity fields through novel scaling and squaring differentiable network operations, and provide implementation and analysis for other integration layers.

Although the simplifying diagonal approximation to the velocity covariance $\boldsymbol{\Sigma}_{z|m,f}$ adds voxel-independent noise to every velocity field sample $\boldsymbol{z}_{k}$ , the resulting deformation fields are well behaved because of our smoothing prior and diffeomorphic representation.

We also provide an anatomical surface deformation model. If image segmentations are available for a particular anatomical structure, the generative model incorporates them naturally in the same joint framework during training, while not requiring the surfaces at test time.

Our algorithm can infer the registration of new image pairs in under a second. Compared to traditional methods, our approach is significantly faster, and compared to recent learning based methods, our method offers diffeomorphic guarantees. We demonstrate that the surface extension to our model can help improve registration while preserving properties such as low runtime and diffeomorphisms.

Furthermore, several conclusions shown in recent papers apply to our method. For example, when only given very limited training data, deformation from VoxelMorph can still be used as initialization to a classical method, enabling faster convergence (Balakrishnan et al, 2019).

Our focus in this framework has been to present the technical connection between classical and learning paradigms, and show that diffeomorphisms are attainable in a very low runtime. Immediate extensions can enable other models and applications. For example, our derivation is generalizable to other formulations: $\boldsymbol{z}$ can be a low dimensional embedding representation of a deformation field, or the displacement field itself. Similarly, the variational covariance $\boldsymbol{\Sigma}_{z|m,f}$ enables an estimation of the uncertainty of the deformation field at each voxel, which can be informative in downstream tasks such as biomedical segmentation or population analysis. The model is also widely applicable to other applications, such as subject-to-subject registration, segmentation-only registration, or using multiple surfaces to improve image-based registration.

6 Acknowledgments

This research was funded by NIH grants R01LM012719, R01AG053949, and 1R21AG050122, NSF CAREER 1748377, NSF NeuroNex Grant 1707312, and Wistron Corporation. We thank John Ashburner for sharing code to simulate images for the C-shape controlled experiments.

Supplementary Material

Derivation of Main Loss

[TABLE]

Using the facts that $\log|\boldsymbol{\Lambda}_{z}|$ is constant, $\log|\boldsymbol{\Sigma}_{z|m,f}|=\text{tr}\log\boldsymbol{\Sigma}_{z|m,f}$ , and $\text{tr}(\boldsymbol{\Lambda}_{z}\boldsymbol{\Sigma}_{z|m,f})=\text{tr}((\lambda\boldsymbol{D}-\boldsymbol{A})\boldsymbol{\Sigma}_{z|m,f})=\text{tr}(\lambda\boldsymbol{D}\boldsymbol{\Sigma}_{z|m,f})$ , and approximating the expectation with $K$ samples $z_{k}\sim q_{z}$ , we obtain

[TABLE]

Derivation of Surface VLB and Loss

We derive the variational lower bound and loss for the generative surface model. We start by minimizing the KL divergence between the true and approximate posteriors:

[TABLE]

where in $\star$ we used the assumptions that the fixed image is independent of anatomical surfaces given the moving image and the deformation, and the fixed surface is independent of either image given the moving surface and the deformation. Following this variational lower bound, the loss follows the previous section closely (see (13)), with the additional term:

[TABLE]

Combining this term with (12), and approximating expectations with $k$ samples leads to the final loss (10):

[TABLE]

Overview figure with surface loss

Additional Figures

Bibliography70

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. ar Xiv preprint ar Xiv:1603.04467 , 2016.
2[2] Dror Aiger, Niloy J Mitra, and Daniel Cohen-Or. 4-points congruent sets for robust pairwise surface registration. In ACM transactions on graphics (TOG) , volume 27, page 85. Acm, 2008.
3[3] A Amir-Khalili, G Hamarneh, R Zakariaee, I Spadinger, and R Abugharbieh. Propagation of registration uncertainty during multi-fraction cervical cancer brachytherapy. Physics in Medicine & Biology , 62(20):8116, 2017.
4[4] Vincent Arsigny, Olivier Commowick, Xavier Pennec, and Nicholas Ayache. A log-euclidean framework for statistics on diffeomorphisms. In International Conference on Medical Image Computing and Computer-Assisted Intervention , pages 924–931. Springer, 2006.
5[5] J. Ashburner. A fast diffeomorphic image registration algorithm. Neuroimage , 38(1):95–113, 2007.
6[6] J. Ashburner and K. Friston. Voxel-based morphometry-the methods. Neuroimage , 11:805–821, 2000.
7[7] Brian B Avants, Charles L Epstein, Murray Grossman, and James C Gee. Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. Medical image analysis , 12(1):26–41, 2008.
8[8] R. Bajcsy and S. Kovacic. Multiresolution elastic matching. Computer Vision, Graphics, and Image Processing , 46:1–21, 1989.