Leveraging SO(3)-steerable convolutions for pose-robust semantic   segmentation in 3D medical data

Ivan Diaz; Mario Geiger; Richard Iain McKinley

arXiv:2303.00351·eess.IV·May 20, 2024

Leveraging SO(3)-steerable convolutions for pose-robust semantic segmentation in 3D medical data

Ivan Diaz, Mario Geiger, Richard Iain McKinley

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new family of 3D segmentation networks using SO(3)-steerable convolutions based on spherical harmonics, enhancing robustness to unseen poses and improving efficiency in medical imaging tasks.

Contribution

It presents a novel segmentation network architecture employing SO(3)-steerable convolutions that do not require rotation augmentation, improving robustness and efficiency in 3D medical image segmentation.

Findings

01

Enhanced robustness to unseen data poses.

02

Reduced need for data augmentation during training.

03

Improved segmentation accuracy and parameter efficiency.

Abstract

Convolutional neural networks (CNNs) allow for parameter sharing and translational equivariance by using convolutional kernels in their linear layers. By restricting these kernels to be SO(3)-steerable, CNNs can further improve parameter sharing. These rotationally-equivariant convolutional layers have several advantages over standard convolutional layers, including increased robustness to unseen poses, smaller network size, and improved sample efficiency. Despite this, most segmentation networks used in medical image analysis continue to rely on standard convolutional kernels. In this paper, we present a new family of segmentation networks that use equivariant voxel convolutions based on spherical harmonics. These networks are robust to data poses not seen during training, and do not require rotation-based data augmentation during training. In addition, we demonstrate improved…

Tables3

Table 1. Table 1 : An example to illustrate the notation of Equation 2 where F i subscript 𝐹 𝑖 F_{i} are the input channels and F j ′ subscript superscript 𝐹 ′ 𝑗 F^{\prime}_{j} are the output channels. Each channel is an irrep-field. In the example shown here, the input field F 3 subscript 𝐹 3 F_{3} is a vector field (because l 3 = 1 subscript 𝑙 3 1 l_{3}=1 ), it’s therefore a ℝ 3 ⟶ ℝ 3 ⟶ superscript ℝ 3 superscript ℝ 3 \mathbb{R}^{3}\longrightarrow\mathbb{R}^{3} function. Similarly, F 4 subscript 𝐹 4 F_{4} is a ℝ 3 ⟶ ℝ 5 ⟶ superscript ℝ 3 superscript ℝ 5 \mathbb{R}^{3}\longrightarrow\mathbb{R}^{5} function.

Input features	Output features
$F_{1}$ with $l_{1} = 0$	$F_{1}^{'}$ with $l_{1} = 0$
$F_{2}$ with $l_{2} = 0$	$F_{2}^{'}$ with $l_{2} = 1$
$F_{3}$ with $l_{3} = 1$	$F_{3}^{'}$ with $l_{3} = 2$
$F_{4}$ with $l_{4} = 2$	$F_{4}^{'}$ with $l_{4} = 2$
	$F_{5}^{'}$ with $l_{5} = 2$

Table 2. Table 2 : Examples of reduced tensor products for the group SO(3). Some of these, in the context of the convolution, can be related to differential operators. But note that the differential operators are local while the convolution is non local.

$\underset{0 \times 0 \overset{}{\to} 0}{\otimes}$	the normal multiplication of scalars
$\underset{0 \times 1 \overset{}{\to} 1}{\otimes}$	scalar times vector, same signature as the gradient $\nabla f$ .
$\underset{1 \times 0 \overset{}{\to} 1}{\otimes}$	vector time scalar.
$\underset{1 \times 1 \overset{}{\to} 0}{\otimes}$	dot product of vectors, same signature as the divergence $\nabla \cdot \vec{f}$ .
$\underset{1 \times 1 \overset{}{\to} 1}{\otimes}$	cross product of vectors, same signature as the rotational $\nabla \land \vec{f}$ .

Table 3. Table 3 : Dice score on the test set for the brain tumor segmentation task. nnUnet (da) denotes the non-equivariant reference network trained with data augmentation.

Model	Enhancing Tumor	Tumor Core	Whole Tumor
e3nn	$0.85 \pm 0.12$	$0.88 \pm 0.07$	$0.92 \pm 0.06$
nnUnet	$0.78 \pm 0.19$	$0.84 \pm 0.08$	$0.90 \pm 0.06$
nnUnet (da)	$0.76 \pm 0.02$	$0.83 \pm 0.07$	$0.90 \pm 0.06$

Equations8

8.433573 s u s (x + 1) s u s (1 - x)

8.433573 s u s (x + 1) s u s (1 - x)

sus(x)=\left\{\begin{array}[]{ll}e^{-1/x}&\quad x>0\\ 0&\quad x\leq 0\end{array}\right.

sus(x)=\left\{\begin{array}[]{ll}e^{-1/x}&\quad x>0\\ 0&\quad x\leq 0\end{array}\right.

F_{j}^{'} (x) = {i \times l j} \sum \int d a F_{i} (x + a) l_{i} \times l l_{j} \otimes Y^{l} (\frac{a}{∥ a ∥}) k \sum b_{k} (∥ a ∥) w (k, i \times l j)

F_{j}^{'} (x) = {i \times l j} \sum \int d a F_{i} (x + a) l_{i} \times l l_{j} \otimes Y^{l} (\frac{a}{∥ a ∥}) k \sum b_{k} (∥ a ∥) w (k, i \times l j)

∣ l_{i} - l_{j} ∣ \leq l \leq l_{i} + l_{j}

∣ l_{i} - l_{j} ∣ \leq l \leq l_{i} + l_{j}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

scan-nrad/e3nn_unet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Medical Image Segmentation Techniques · Radiomics and Machine Learning in Medical Imaging

Full text

An end-to-end SE(3)-equivariant segmentation network

\nameIvan Diaz \[email protected]

\addrSupport Center for Advanced Neuroimaging (SCAN), University Institute of Diagnostic and Interventional Neuroradiology, University of Bern, Inselspital, Bern University Hospital, Bern, Switzerland \AND\nameMario Geiger \[email protected]

\addrDepartment of Research Laboratory of Electronics, Massachusetts Institute of Technology, Boston MA \ANDRichard Iain McKinley\[email protected]

\addrSupport Center for Advanced Neuroimaging (SCAN), University Institute of Diagnostic and Interventional Neuroradiology, University of Bern, Inselspital, Bern University Hospital, Bern, Switzerland

Abstract

Convolutional neural networks (CNNs) allow for parameter sharing and translational equivariance by using convolutional kernels in their linear layers. By restricting these kernels to be SO(3)-steerable, CNNs can further improve parameter sharing and equivariance. These equivariant convolutional layers have several advantages over standard convolutional layers, including increased robustness to unseen poses, smaller network size, and improved sample efficiency. Despite this, most segmentation networks used in medical image analysis continue to rely on standard convolutional kernels. In this paper, we present a new family of segmentation networks that use equivariant voxel convolutions based on spherical harmonics, as well as equivariant pooling and normalization operations. These SE(3)-equivariant volumetric segmentation networks, which are robust to data poses not seen during training, do not require rotation-based data augmentation during training. In addition, we demonstrate improved segmentation performance in MRI brain tumor and healthy brain structure segmentation tasks, with enhanced robustness to reduced amounts of training data and improved parameter efficiency.

Code to reproduce our results, and to implement the equivariant segmentation networks for other tasks is available at http://github.com/SCAN-NRAD/e3nn_Unet.

Keywords: Image Segmentation, Rotation Equivariance, MRI, U-Net

1 Introduction

A symmetry of an object is a transformation of that object which leaves certain properties of that object unchanged. In the context of medical image segmentation, there are a number of obvious symmetries which apply to volumetric images and their voxel-level labels: namely translation, rotation, and (depending on the labels used) reflection across the body’s left-right axis of symmetry. In most cases patients are placed in an expected orientation within the scanner (with fetal imaging being a notable exception to this assumption), and deviations from the mean patient placement are typically moderate (typically up to 20 degrees). Nonetheless, given a small data set, the patient orientations seen may not be representative of the full range of poses seen in clinical practice.

An equivariant function is one where symmetries applied to an input lead to corresponding transformations of the output. The most prominent example of equivariance in deep learning is the translation-equivariance of the convolution operation. Equivariance should be contrasted to mere invariance, where a symmetry applied to an input leads to no change in a function’s output. The output of a segmentation model should not be invariant to symmetries of its input, but rather equivariant. Equivariance enables increased parameter sharing and enforces strong priors which can prevent overfitting and improve sample efficiency.

There have been numerous attempts to define convolutional feature extractors equivariant to rotational (and reflection) symmetry in three dimensional space. Since voxelized data (in contrast to point cloud data) only admits rotations through 90 degrees, an obvious place to start is the symmetries of the cube. Group equivariant convolutional networks (G-CNNs) (Cohen and Welling, 2016), in the context of 3D imaging, operate by applying transformed versions of a kernel according to a finite symmetry group $\mathcal{G}$ . This gives rise to an extra fiber/channel dimension with size $|\mathcal{G}|$ (24 in total if only considering orientation-preserving symmetries of the cube, or 48 if considering all symmetries), which permute under symmetries of the input. This results in an explosion in the number of convolutional operations and in the dimension of feature maps. G-pooling can be used to combat this explosion, by selecting the fiber channel which maximizes activation at each voxel. This reduces memory usage but comes at the cost of reducing the expressivity of the layer, potentially impacting performance (Cohen and Welling, 2016).

Steerable convolutions with full rotational equivariance to infinite symmetry groups in three dimensions were first developed for point cloud data (Thomas et al., 2018), and have subsequently been adapted to operate voxel convolutions on data lying in regular 3D grids (Weiler et al., 2018). These convolutional layers have the benefit, over G-CNN layers, of being equivariant to any 3D rotation, rather than a discrete group of rotations: in particular, the rotations likely to arise as a result of patient placement in a scanner. They are also more efficient in terms of convolution operations and memory usage. The e3nn (Geiger et al., 2020) pytorch library provides a flexible framework for building SE(3) equivariant (translation, rotation) as well as E(3) (translation, rotation and reflection) networks for both point cloud and voxel data, by providing implementations of SO(3) and O(3) steerable kernels111E(3) refers to the Euclidean group in 3 dimensions, SE(3) the special Euclidean group in 3 dimensions, O(3) the orthogonal group in 3 dimensions and SO(3) the special orthogonal group in three dimensions. . These kernels operate on irreducible representations (irreps), which provide a general description of equivariant features: any finite representation which transforms according to the group action of SO(3)/O(3) can be expressed as a direct sum of irreps.

Methods based on steerable filters have long been used in biomedical image analysis, but learnable steerable filters have not received much attention, despite the promised benefits. This may be because of perceived computational overheads, or the lack of available code for building such networks. Our goal in this paper is to show that the benefits of equivariance, sample efficiency and parameter efficiency can be made available in biomedical image analysis without sacrificing performance. To this end, we propose equivariant maxpooling and normalization layers and use them to define a recreation of a standard 3D Unet architecture in which each layer, and therefore the whole network, is equivariant. Our primary hypothesis is as follows: end-to-end rotation equivariant networks provide robustness to data orientations unseen during training without loss of performance on in-sample test data, beyond the robustness gained by using rotational data augmentation (Mei et al., 2021). We further hypothesize that equivariant networks have better sample efficiency than traditional Unets.

2 Building an equivariant segmentation network

In this paper we focus on SE(3)-equivariance. It is easy to extend our work to E(3) but we leave this for future work.

The Unet architecture (Ronneberger et al., 2015) consists of an encoding path and decoding path with multiple levels: on each level there are multiple convolutions and nonlinearities, followed by either a pooling or upsampling operation. To achieve SE(3) equivariance in a neural network it is necessary that each of these operations be equivariant. We use the steerable 3D convolution and gated nonlinearity described in (Weiler et al., 2018) and implemented in e3nn (Geiger et al., 2020) as the basis of our equivariant Unet. Here we describe how each layer in the UNet has been modified to be equivariant and we explain the details necessary to understand the application to voxelized 3D medical imaging data.

2.1 Irreducible Representations

Typical convolutional neural networks produce scalar-valued output from scalar-valued features. A scalar field $f:\mathbb{R}^{3}\to\mathbb{R}$ transforms in a very simple way under rotations: the field at the location $x$ after application of a rotation $r$ is given by $f(r^{-1}x)$ . However, an equivariant network based purely on scalar fields would have rather minimal representative power, suffering from similar problems as a G-CNN with G-pooling at every layer. Concretely, such a network would clearly be unable to detect oriented edges. To enable the learning of expressive functions requires the learning of more general features with a richer law of transformation under rotations. For example, a vector field assigns a value of $\mathbb{R}^{3}$ to each point of Euclidean space: one such example is the gradient $\nabla$ of a scalar field. Such features are expressive enough to detect oriented edges; here the orientation is explicit (the orientation of the gradient field). Under a rotation $r$ , a vector field $f$ transforms not as $f(r^{-1}x)$ but as $rf(r^{-1}x)$ .

Scalars and vectors are two well-known representations of SO(3), but there are many others. It’s worth noting that all finite representations of SO(3) can be broken down into a combination of simpler, indivisible representations known as ”irreducible representations”, as described in (Weiler et al., 2018) and (Thomas et al., 2018). In SO(3), each irrep is indexed by a positive integer $l=0,1,2,\dots$ and has dimension $d=2l+1$ . A major contribution of Weiler et al. (2018) was the formulation and solution of a constraint on kernels between irreps of order $l$ and $l^{\prime}$ , giving rise to a basis of all such kernels: this basis is implemented in the e3nn library. Networks defined using the operations of e3nn can have features valued in any irreps. For our experiments we consider features valued in scalars ( $l=0$ ), vectors ( $l=1$ ) and rank-2 tensors ( $l=2$ ).

2.2 Equivariant voxel convolution

Each layer of an equivariant network formulated in e3nn takes as input a direct sum of irreps and returns a direct sum of irreps (in our case, of orders, $l=$ 0, 1 or 2) See Fig. 1.

An equivariant convolutional kernel basis is described in Weiler et al. (2018): the basis functions are given by tensor products $\phi(\lVert x\rVert)Y^{l}(x/\lVert x\rVert)$ . Here $\phi:\mathbb{R}^{+}\to\mathbb{R}$ is an arbitrary continuous radial function describing how the kernel varies as a function of distance from the origin. $Y^{l}$ is the spherical harmonic of order $l$ , determining how the kernel varies within an orbit of SO(3) (a sphere centred on the origin). To enable learning of parameters, we characterise the radial function as a sum of smooth basis elements, see Figure 2. The equation is given by:

[TABLE]

with $sus$ (soft unit step) defined as follows:

[TABLE]

Equation 1 is a $C^{\infty}$ function and is strictly zero for $x$ outside the interval $[-1,1]$ . The prefactor $8.433573$ ensures proper normalization of the neural network and was obtained empirically.

The equivariant convolution is described by Equation 2. We introduce the terms in the equation in Table 1. $F$ and $F^{\prime}$ are the input and output fields. Each of them have an irrep that determines their dimension and how their transform under rotation. To calculate the output channel $F^{\prime}_{j}(x)$ we sum over the contributions from the input channels. Each contribution is characterized by an input channel index $i$ , an input irrep $l_{i}$ , an output channel index $j$ and its irrep $l_{j}$ and a spherical harmonic order $l$ satisfying the selection rule (Equation 3).

[TABLE]

Each incoming channel $F_{i}$ , and outgoing channel $F^{\prime}_{j}$ , has a specified irrep. In this notation, $\{{i\times l}\xrightarrow{}j\}$ denotes a ”path” from an input channel $i$ to an output channel $j$ via a spherical harmonics $l$ . All irreps $l$ satisfying the selection rule of the group SO(3) are the nonzero integer satifying

[TABLE]

where $l_{i}$ , $l_{j}$ are the irrep of the input and output channels. These are the allowed ”paths” between the input and output: all the ways in which a feature of irrep $l_{i}$ can yield a feature of irrep $l_{j}$ respecting SO(3) equivariance. The notation $\underset{l_{1}\times l_{2}\xrightarrow{}l_{3}}{\otimes}$ denotes the tensor product of irrep $l_{1}$ times irrep $l_{2}$ reduced into irrep $l_{3}$ : this is unique for the group $SO(3)$ (contrary to $SU(3)$ , the special unitary group of degree 3, for instance). Examples are listed in Table 2.

This calculation is implemented in e3nn by sampling the continuous kernel at the grid points of the voxel grid yielding an ordinary kernel: this kernel is then convolved over the input irreps. This means that efficient cuda implementations of convolutional layers can be used during training, and that at test time the (rather computationally expensive) tensor product operations can be avoided by precomputing an ordinary CNN from the equivariant network.

Since the radial basis functions all vanish at zero, the convolutional kernels yielded are necessarily zero at the origin: to account for this we also include, at each convolutional layer, a self connection layer, which is simply a pointwise weighted tensor product: this can be seen as the equivalent of a convolutional layer with 1 x 1 x 1 kernel. Our feature extractor is then the sum of the convolutional layer and the self connection layer, and it is this layer that we use to replace an ordinary convolution in the Unet architecture.

2.3 Pooling, upsampling, non-linearities and normalization layers

The crucial observation in creating equivariant layers is that while scalar features can be treated pointwise, as in an ordinary network, the components of vectors and tensors must be transformed together, rather than treated as tuples of scalar values.

In line with Weiler et al. (2018) we use gated nonlinearities, in which an auxiliary scalar feature calculated from the irreducible feature is used passed through a sigmoid nonlinearity, which is then multiplied with the feature to induce a nonlinear response.

For the encoding path, we apply maxpooling to the scalar valued feature components. For a vector or tensor valued component $v$ , we pool by keeping the vector with the greatest $l^{2}$ norm. Trilinear upsampling (used in the decoding path) is already an equivariant operation.

We apply ordinary instance normalization (Ulyanov et al., 2016) to the scalar features. Similarly, to instance-normalize a vector- or tensor-valued feature $v$ we divide by the mean $l^{2}$ norm of that feature per instance: $\textrm{norm(v)}:=v/{\mathbb{E}(\|v\|)}$

2.4 Related Work

Previously published rotation-equivariant Unets have been restricted to 2D data and G-CNN layers (Chidester et al., 2019; Linmans et al., 2018; Pang et al., 2020; Winkens et al., 2018). A preprint describing a segmentation network based on e3nn filters applied to multiple sclerosis segmentation for the specific use case of 6 six-dimensional diffusion MRI data is available (Müller et al., 2021): in this particular setting each voxel carries three dimensional q-space data, with the network capturing equivariance in both voxel space and q-space. In contrast to the current paper, ordinary (non-equivariant) networks were unable to adequately perform the required segmentation task (lesion segmentation from diffusion data). This leaves the question open of whether equivariant networks have advantages over plain CNNs in the case of more typical 3D medical data. Here we show that end-to-end equivariant networks are indeed advantageous even when operating on scalar-valued inputs and outputs.

Other works in the application of equivariant networks to 3D data have focused on classification rather than segmentation, primarily using G-CNNs (Andrearczyk et al., 2019).

3 Methods

3.1 Model architectures

3.1.1 Irreducible representations

The design of equivariant architectures offers somewhat more freedom than their non-equivariant counterparts, insofar as we have more degrees of freedom in specifying the feature dimension at each layer: not just how many features, but how many of each irreducible order. To keep our experiments simple, we chose to fix a ratio 8:4:2 of order 0, 1 and 2 irreps in each layer other than input and output. We also include equal number of odd and even irreps. In the notation of the e3nn library, this combination is denoted 8x0e + 4x1e + 2x2e, and corresponds to an ordinary feature depth of 30.

3.1.2 Kernel dimension and radial basis functions

Aliasing effects mean that if we choose a kernel which is too small, higher spherical harmonics may not contribute (or contribute poorly) to learning. For this reason, we choose a larger kernel (5x5x5) than often used in segmentation networks.

In addition to specifying the size of the convolutional kernel we must also specify which and how many radial basis functions are used to parameterize the radial component of the convolutional filters. We fix five basis functions for each equivariant kernel described in equation 1.

3.1.3 Reference and equivariant Unet architectures.

As a reference implementation of Unet we used the nnUnet library (Isensee et al., 2021), with $5^{3}$ convolutional kernels, instance normalization, and leaky ReLu activation after each convolutional layer. The network uses maxpooling layers for downsampling in the encoding path, trilinear upsampling in the decoding path, and has two convolutional blocks before every maxpooling layer and after every upsampling. The number of features doubles with every maxpooling and halved with every upsampling, in accordance with the usual Unet architecture.

We mirror this architecture in the equivariant Unet, simply replacing the ordinary convolutions with equivariant convolutions/self-connections (using the ratios of irreps specified above), equivariant instance normalization and gate activation after each convolution. The network uses equivariant maxpooling layers for downsampling in the encoding path, and trilinear upsampling in the decoding path, and the number of irreps of each order double at each maxpooling and halve with every upsampling.

4 Datasets and Experiments

We carried out a number of experiments to validate the hypothesis that equivariant Unet models are sample efficient, parameter efficient and robust to poses unseen during training. In all experiments, we used categorical cross entropy as loss function, with an Adam optimizer, a learning rate of $5\text{e-}3$ and early stopping on the validation loss with a patience of 25 epochs. Networks were trained on 128x128x128 voxel patches and prediction of the test volumes was performed using patch-wise prediction with overlapping patches and Gaussian weighting (Isensee et al., 2021). In all cases, we used the Dice similarity metric to compare the segmentation output of the network to the reference standard.

4.1 Medical Image Decathlon: Brain Tumor segmentation

484 manually annotated volumes of multimodal imaging data (FLAIR, T1 weighted, T1 weighted postcontrast and T2 weighted imaging) of brain tumor patients were taken from the Medical Segmentation Decathlon (Antonelli et al., 2021) and randomly separated as 340 train, 95 validation and 49 test volumes. The four imaging contrasts are illustrated in Fig. 3. We trained both an equivariant Unet and an ordinary Unet, each with three downsampling/upsampling layers, for the task of segmenting the three subcompartments of the brain tumor. The basic Unet had a feature depth of 30 in the convolutions of the top layer, with the equivariant network having an equivalent depth of 30 features. The Unet was trained both with and without rotational data augmentation (rotation through an angle $\in(0,360)$ with bspline interpolation), on both the full training set and also subsets of the training set (number of training samples was $2^{n}$ for $n$ between 0 and 9, inclusive). With this we aim to study the sample efficiency of the two architectures.

We do not expect orientation cues to be helpful in segmenting brain tumors (which are largly isotropic) and therefore expect that both the ordinary and equivariant Unet will maintain performance under rotation of the input volume, and that data augmentation will be primarily useful where amounts of training data are small.

4.2 Mindboggle101 dataset: Healthy appearing brain structure segmentation

From the 20 manually annotated volumes of the Mindboggle101 dataset (Klein and Tourville, 2012), we selected 7 volumes for training, 3 volumes for validation and 10 volumes for testing. The Mindboggle101 labelling contains a very large set of labels, including both cortical regions and subcortical structures. We defined the following subset of structures as target volumes for segmentation: cerebellum, hippocampus, lateral ventricles, caudate, putamen, pallidum and brain stem. These structures are shown in figure 4. Some of these structures (ventricles, cerebellum) can be easily identified by intensity or local texture, while others (caudate, putamen, pallidum) are difficult to distinguish except by spatial cues. As above, an equivariant Unet and an ordinary Unet, each with three downsampling/upsampling layers, were trained to segment these structures: again both networks had an equivalent feature depth (30 features). The ordinary Unet was trained without data augmentation and with full rotational data augmentation (rotation through an angle $\in(0,360)$ in either the axial, saggital or coronal plane, with bspline interpolation). The ordinary Unet was trained a third time with a data augmentation scheme closer to that seen in usual practice (rotation through an angle $\in(-20,20)$ in either the axial, saggital or coronal plane, with bspline interpolation). Once trained, these models were then applied to the testing set rotated through various angles $\in(0,180)$ , to test the sensitivity of the various models to variations in pose.

Finally, we trained variants of the Unet and equivariant Unet with increasing model capacity, as measured by the equivalent feature depth in the top layer, and applied these models to the unrotated test samples to assess the parameter efficiency of the two architectures. Here no data augmentation was employed.

5 Results

5.1 Brain Tumor segmentation

Table 3 shows the performance of the equivariant Unet (e3nn) versus a reference Unet (nnUnet) with and without data augmentation, over the 49 testing examples. In Figure 5 we show performance on the testing set for networks trained on subsets of the training volumes averaging over all compartments. The gap in performance between the equivariant and reference networks is largest where data is scarce, and as expected this is also where data augmentation has the largest effect on the performance of the ordinary Unet.

The performance of the equivariant network and the reference networks remained similar when the input images were rotated through an arbitrary angle. As expected, not only the equivariant network but also the reference network trained without data augmentation maintained good performance when fed orientations unseen during training, validating the hypothesis that orientation/location cues are not helpful for identifying brain tumors.

5.2 Healthy-appearing brain-structure segmentation

5.2.1 Performance on the test set

In Figure 6 we show the results of the segmentation of the various brain structures of the Mindboggle 101 dataset. For a non-rotated version of the test set, the equivariant network (e3nn) and the unaugmented reference network (nnUnet) performed similarly well in segmenting all structures except the pallidum and putamen, where the equivariant network showed a higher performance. Moderate data augmentation of angles less than 20∘ (nnUnet da rot 20) had a slight negative effect on performance in some structures when applied to the reference network. The non-equivariant UNet trained with full rotational data augmentation performed on par with the equivariant network but underperformed on the hippocampus and pallidum. It is worth noting that this network (nnUnet da) required 2.5 more training epochs to learn with respect to the non-augmented nnUnet and slightly augmented nnUnet nnUnet (da rot $\pm$ 20∘).

5.2.2 Performance on rotated inputs from the test set

When tested on rotated versions of the testing volumes the non-equivariant reference network’s performance smoothly declines even for angles $<$ 20∘. The equivariant network, as expected, is not affected by rotations: small fluctuations in Dice coefficient can be accounted for by interpolation artifacts in the input image. We only show the results of the rotation experiment when rotated in the axial plane but similar results were obtained when rotating in the coronal and sagittal plane (see appendix).

5.2.3 Performance as a function of number of parameters

We trained the SO(3) equivariant model and non-equivariant reference Unet with different numbers of input features. We used the dimension of input $dim=2l+1$ of each equivariant model’s top features to set the value of the reference Unet top features. Figure 7 shows the dice score vs number of parameters for various numbers of top level features of both models. The rotation-equivariant model has fewer parameters than any of the reference Unet implementations. We also included versions of the reference model trained on $3^{3}$ kernels, which is the kernel size generally-used.

6 Conclusion

In this paper we have presented a variant of the Unet architecture designed to be used for any volumetric segmentation task in which the predicted label set is invariant to euclidean rotations. The network can be used as a drop-in replacement for a regular 3D Unet without prior knowledge of the mathematics behind the equivariant convolutions, with equivalent or better performance on in-sample data, no need to train using (potentially computationally expensive) data augmentation and SO(3) or O(3) equivariance ”for free”. This equivariance mathematically guarantees good performance on data with orientations not seen during training. This effect is dramatically superior to usual data-augmentation strategies. As our experiments show, a small amount of augmentation may have no effect, a mild positive effect, or a mild negative effect: this may be due to competing effects: the addition of rotated examples to the training pool increases the total amount of information available to the classifier but may also introduce erroneous training examples owing to interpolation artifacts in the images, the labels, or both. This may explain the reduced performance of the baseline Unet with data augmentation in the case of Brain Tumor segmentation.

Our experiments also support the hypothesis that an equivariant network can learn from fewer training samples compared to a reference network, performs better in segmentation of oriented structures and has much fewer parameters than an equivalent non-equivariant model. While we have focused on a single architecture in this paper,the types and number of top level features, number of downsample operations, kernel size and normalization can be easily customized in our library. Also customizable is the kind of symmetry enforced by the network. The experiments focused in this paper on SO(3) rather than O(3) equivariance (enforcing equivariance also to inversions) but our implementation has the option to easily create models with O(3) equivariance as well.

Limitations

We consider in this paper only one kind of equivariant feature extractor: SO(3)-equivariant kernels based on one author’s previous work in Weiler et al. (2018). In particular, we do not compare to G-CNNs. Meanwhile a rich library for building G-CNNs on Euclidean data based on a variety of subgroups and discretizations of SE(3) and E(3) (and indeed subgroups of E(n) for any n) is available, and we would expect a Unet based on these convolutions to have many of the benefits seen in our setting. However, one substantial drawback of any G-CNN is the multiplicity of channels needed to represent a group convolution: given a discretization/subgroup $\mathcal{G}$ of SO(3), each convolution requires to compute and store $|\mathcal{G}|$ feature maps. Based on the discretizations examined in Cesa et al. (2021), this may entail an up to 192-fold increase in computational costs and memory footprint over an ordinary CNN. This may be feasible in the setting of classification and input data such as Modelnet (30 x 30 x 30 volumes) but in the case of 3D medical image segmentation implementation of this kind of equivariance would entail an infeasible amounts of GPU memory or a drastic reduction in the size of input patches.

We have limited ourselves to data with publicly available images and labels, in order to maximize reproducibility: in particular, the experiments on the Mindboggle experiments do not have sufficient statistical power to show a significant difference between the methods examined. Nonetheless, we believe the effects of equivariance on his publicly available data are compelling on their own and are confident that a reproduction on a much larger dataset (trained and evaluated on, for example, Freesurfer outputs) would show similar results, albeit in a somewhat less reproducible fashion.

We used a fixed learning rate and training strategy for each network: thorough hyperparameter tuning would almost certainly improve the performance of each network presented here. In particular, we did not adjust the ratio of $l=0$ , $l=1$ and $l=2$ features at any point, nor did we include features with a higher $l$ . Nonetheless, we believe the experiments here are sufficient to support our claims: that equivariant Unets can be used as a drop-in replacement for more commonly used Unets without loss of performance and with substantial advantages in data and parameter efficiency.

Remarks and Future Work

Code to build equivariant segmentation networks based on e3nn for other tasks is available at http://github.com/SCAN-NRAD/e3nn_Unet. This library supports not just scalar inputs and outputs, but also inputs and outputs valued in any irreducible representation. In the future it will be interesting to examine possibility of using odd-order scalar network outputs to segment structures with bilateral symmetry using E(3) equivariant networks, and to investigate whether equivariant networks with vector valued outputs are more robust than ordinary convolutional networks in, for example, the task of finding diffeomorphic deformation fields (Balakrishnan et al., 2019).

Acknowledgments

This work was supported by Spark Grant CRSK-3_195801 and by the Research Fund of the Center for Artificial Intelligence in Medicine, University of Bern, for 2022-23.

Ethical Standards

The work follows appropriate ethical standards in conducting research and writing the manuscript, following all applicable laws and regulations regarding treatment of animals or human subjects.

Conflicts of Interest

We declare we don’t have conflicts of interest.

7 Appendix

In the following figures, we show the dice performance of the rotation-equivariant model and reference network with no data-augmentation, with data augmentation up to 20∘ and full data augmentation on the test set rotated through angles from 0∘ to 20∘.

Bibliography18

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Andrearczyk et al. (2019) Vincent Andrearczyk, Julien Fageot, Valentin Oreiller, Xavier Montet, and Adrien Depeursinge. Exploring local rotation invariance in 3d cnns with steerable filters. In International Conference on Medical Imaging with Deep Learning , pages 15–26. PMLR, 2019.
2Antonelli et al. (2021) Michela Antonelli, Annika Reinke, Spyridon Bakas, Keyvan Farahani, Bennett A Landman, Geert Litjens, Bjoern Menze, Olaf Ronneberger, Ronald M Summers, Bram van Ginneken, et al. The medical segmentation decathlon. ar Xiv preprint ar Xiv:2106.05735 , 2021.
3Balakrishnan et al. (2019) Guha Balakrishnan, Amy Zhao, Mert R Sabuncu, John Guttag, and Adrian V Dalca. Voxelmorph: a learning framework for deformable medical image registration. IEEE transactions on medical imaging , 38(8):1788–1800, 2019.
4Cesa et al. (2021) Gabriele Cesa, Leon Lang, and Maurice Weiler. A program to build e(n)-equivariant steerable cnns. In International Conference on Learning Representations , 2021.
5Chidester et al. (2019) Benjamin Chidester, That-Vinh Ton, Minh-Triet Tran, Jian Ma, and Minh N. Do. Enhanced rotation-equivariant u-net for nuclear segmentation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , pages 1097–1104, 2019.
6Cohen and Welling (2016) Taco Cohen and Max Welling. Group equivariant convolutional networks. In International conference on machine learning , pages 2990–2999. PMLR, 2016.
7Geiger et al. (2020) Mario Geiger, Tess Smidt, Alby M., Benjamin Kurt Miller, Wouter Boomsma, Bradley Dice, Kostiantyn Lapchevskyi, Maurice Weiler, Michał Tyszkiewicz, Simon Batzner, Martin Uhrin, Jes Frellsen, Nuri Jung, Sophia Sanborn, Josh Rackers, and Michael Bailey. Euclidean neural networks: e 3nn, 2020. URL https://doi.org/10.5281/zenodo.5292912 . · doi ↗
8Isensee et al. (2021) Fabian Isensee, Paul F. Jaeger, Simon A. A. Kohl, Jens Petersen, and Klaus H. Maier-Hein. nn U-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods , 18(2):203–211, February 2021. ISSN 1548-7091, 1548-7105.