Unsupervised Deep Learning for Structured Shape Matching

Jean-Michel Roufosse; Abhishek Sharma; Maks Ovsjanikov

arXiv:1812.03794·cs.GR·August 23, 2019

Unsupervised Deep Learning for Structured Shape Matching

Jean-Michel Roufosse, Abhishek Sharma, Maks Ovsjanikov

PDF

4 Repos

TL;DR

This paper introduces SURFMNet, an unsupervised deep learning approach for 3D shape correspondence that achieves state-of-the-art results without requiring ground truth data, and is faster and more general than previous methods.

Contribution

It presents a novel unsupervised learning framework for shape matching based on functional maps, eliminating the need for ground truth correspondences and improving efficiency.

Findings

01

Achieves state-of-the-art unsupervised shape matching results.

02

Comparable to supervised methods in accuracy.

03

Significantly faster and more general than existing approaches.

Abstract

We present a novel method for computing correspondences across 3D shapes using unsupervised learning. Our method computes a non-linear transformation of given descriptor functions, while optimizing for global structural properties of the resulting maps, such as their bijectivity or approximate isometry. To this end, we use the functional maps framework, and build upon the recent FMNet architecture for descriptor learning. Unlike that approach, however, we show that learning can be done in a purely \emph{unsupervised setting}, without having access to any ground truth correspondences. This results in a very general shape matching method that we call SURFMNet for Spectral Unsupervised FMNet, and which can be used to establish correspondences within 3D shape collections without any prior information. We demonstrate on a wide range of challenging benchmarks, that our approach leads to…

Tables4

Table 1. Table 1 : Ablation study of penalty terms in our method on the FAUST benchmark.

Methods	E1+E2+E3+E4	E3	E1	E2	E4
Geodesic Error	0.020	0.073	0.083	0.152	0.252

Table 2. Table 2 : Runtime of different methods averaged over 190 shape pairs.

	Runtime
Methods	Pre-processing	Training	Testing	Post-processing	Total
FMNet	60s	1500s	0.3s	N/As	1650s
FMNet + PMF	60s	1500s	0.3s	30s	1680s
Fmap Basic	10s	N/A	60s	N/A	120s
BCICP	N/A	N/A	60s	180s	240s
SURFMNet	10s	25s	0.3s	N/A	35s
SURFMNet + ICP	10s	25s	0.3s	10s	45s

Table 3. Table 3 : Quantitative comparison on all three benchmark datasets for shape correspondence problem.

(Results are $\times 10^{- 3}$ )	FAUST 7k			FAUST 5k			SCAPE 5k
Supervised Methods	Mean	95th Percentile	Maximum	Mean	95th Percentile	Maximum	Mean	95th Percentile	Maximum
FMNet	25.01	63.11	1207.8	112.8	451.8	1280.6	172.6	543.8	1399.6
SURFMNet Subset	19.83	52.11	1204.0	92.09	493.6	1279.4	60.32	329.8	1068.7
FMNet + PMF	2.98	14.10	1222.7	83.61	395.7	1576.4	63.00	159.8	1561.5
SURFMNet-sub + PMF	5.33	22.90	1302.4	74.80	408.5	1619.3	51.03	111.5	1555.6
FMNet + ICP	11.16	27.91	1206.8	47.53	237.3	1348.6	81.76	341.4	1226.5
SURFMNet-sub + ICP	11.79	35.76	1088.4	30.47	95.64	1277.3	23.00	54.76	73.18
GCNN	_	_	_	50.49	206.3	1578.2	71.85	374.2	1523.7
Unsupervised Methods
BCICP	15.46	53.27	572.4	31.08	64.51	1149.9	22.28	50.60	107.5
PMF (Gaussian Kernel)	29.42	83.80	1168.1	75.13	236.9	1632.7	54.68	156.9	465.1
PMF (Heat Kernel)	17.26	25.06	1168.1	31.08	64.51	1150.0	47.23	133.4	802.1
Fmap Basic	457.56	1171.4	1568.4	366.2	1159.0	1549.1	383.0	1043.7	1280.3
Fmap Ours Opt	9.75	30.02	420.2	20.19	53.24	1169.5	13.98	31.16	86.45
SURFMNet-all	7.89	26.01	572.4	18.56	50.25	1156.3	17.50	42.50	228.8

Table 4. Table 4 : Ablation study of penalty terms in our method and comparison with the supervised FMNet on the FAUST benchmark.

Methods	E1+E2+E3+E4	E3	E1+E2+E3	E1+E3+E4	E1	E2+E3+E4	E1+E2+E4	E2	E4	FMNet	Ours-Sub	Ours-all
Mean Geodesic Error	0.044	0.073	0.081	0.077	0.111	0.079	0.126	0.135	0.330	0.025	0.020	0.008

Equations19

\displaystyle C_{\text{opt}}=\operatorname*{arg\,min}_{\mathbf{C}_{12}}E_{\text{desc}}\big{(}\mathbf{C}_{12}\big{)}+\alpha E_{\text{reg}}\big{(}\mathbf{C}_{12}\big{)},

\displaystyle C_{\text{opt}}=\operatorname*{arg\,min}_{\mathbf{C}_{12}}E_{\text{desc}}\big{(}\mathbf{C}_{12}\big{)}+\alpha E_{\text{reg}}\big{(}\mathbf{C}_{12}\big{)},

\displaystyle E_{\text{reg}}(C_{12})=\big{\|}\mathbf{C}_{12}\mathbf{\Lambda}_{1}-\mathbf{\Lambda}_{2}\mathbf{C}_{12}\big{\|}^{2}

\displaystyle E_{\text{reg}}(C_{12})=\big{\|}\mathbf{C}_{12}\mathbf{\Lambda}_{1}-\mathbf{\Lambda}_{2}\mathbf{C}_{12}\big{\|}^{2}

T min (S_{1}, S_{2}) \in Train \sum l_{F} (S o f t (C_{opt}), G T_{(S_{1}, S_{2})}), where

T min (S_{1}, S_{2}) \in Train \sum l_{F} (S o f t (C_{opt}), G T_{(S_{1}, S_{2})}), where

C_{opt} = C arg min ∥ C A_{T (D_{1})} - A_{T (D_{2})} ∥.

T min (S_{1}, S_{2}) \sum i \in penalties \sum w_{i} E_{i} (C_{12}, C_{21}), where

T min (S_{1}, S_{2}) \sum i \in penalties \sum w_{i} E_{i} (C_{12}, C_{21}), where

C_{12} = C arg min ∥ C A_{T (D_{1})} - A_{T (D_{2})} ∥,

C_{21} = C arg min ∥ C A_{T (D_{2})} - A_{T (D_{1})} ∥.

E_{1} = ∥ C_{12} C_{21} - I ∥^{2} + ∥ C_{21} C_{12} - I ∥^{2}

E_{1} = ∥ C_{12} C_{21} - I ∥^{2} + ∥ C_{21} C_{12} - I ∥^{2}

E_{2} = ∥ C_{12}^{⊤} C_{12} - I ∥^{2} + ∥ C_{21}^{⊤} C_{21} - I ∥^{2}

E_{2} = ∥ C_{12}^{⊤} C_{12} - I ∥^{2} + ∥ C_{21}^{⊤} C_{21} - I ∥^{2}

\displaystyle E_{3}=\big{\|}\mathbf{C}_{12}\mathbf{\Lambda}_{1}-\mathbf{\Lambda}_{2}\mathbf{C}_{12}\big{\|}^{2}+\big{\|}\mathbf{C}_{21}\mathbf{\Lambda}_{2}-\mathbf{\Lambda}_{1}\mathbf{C}_{21}\big{\|}^{2}

\displaystyle E_{3}=\big{\|}\mathbf{C}_{12}\mathbf{\Lambda}_{1}-\mathbf{\Lambda}_{2}\mathbf{C}_{12}\big{\|}^{2}+\big{\|}\mathbf{C}_{21}\mathbf{\Lambda}_{2}-\mathbf{\Lambda}_{1}\mathbf{C}_{21}\big{\|}^{2}

E_{4} = (f_{i}, g_{i}) \in Descriptors \sum ∣∣ C_{12} M_{f_{i}} - M_{g_{i}} C_{12} ∣ ∣^{2} + ∣∣ C_{21} M_{g_{i}} - M_{f_{i}} C_{21} ∣ ∣^{2}, M_{f_{i}} = Φ^{+} Diag (f_{i}) Φ, M_{g_{i}} = Ψ^{+} Diag (g_{i}) Ψ.

E_{4} = (f_{i}, g_{i}) \in Descriptors \sum ∣∣ C_{12} M_{f_{i}} - M_{g_{i}} C_{12} ∣ ∣^{2} + ∣∣ C_{21} M_{g_{i}} - M_{f_{i}} C_{21} ∣ ∣^{2}, M_{f_{i}} = Φ^{+} Diag (f_{i}) Φ, M_{g_{i}} = Ψ^{+} Diag (g_{i}) Ψ.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Unsupervised Deep Learning for Structured Shape Matching

Jean-Michel Roufosse

LIX, École Polytechnique

[email protected]

Abhishek Sharma

LIX, École Polytechnique

[email protected]

Maks Ovsjanikov

LIX, École Polytechnique

[email protected]

Abstract

We present a novel method for computing correspondences across 3D shapes using unsupervised learning. Our method computes a non-linear transformation of given descriptor functions, while optimizing for global structural properties of the resulting maps, such as their bijectivity or approximate isometry. To this end, we use the functional maps framework, and build upon the recent FMNet architecture for descriptor learning. Unlike that approach, however, we show that learning can be done in a purely unsupervised setting, without having access to any ground truth correspondences. This results in a very general shape matching method that we call SURFMNet for Spectral Unsupervised FMNet, and which can be used to establish correspondences within 3D shape collections without any prior information. We demonstrate on a wide range of challenging benchmarks, that our approach leads to state-of-the-art results compared to the existing unsupervised methods and achieves results that are comparable even to the supervised learning techniques. Moreover, our framework is an order of magnitude faster, and does not rely on geodesic distance computation or expensive post-processing.

1 Introduction

Shape matching is a fundamental problem in computer vision and geometric data analysis, with applications in deformation transfer [42] and statistical shape modeling [6] among other domains. During the past decades, a large number of techniques have been proposed for both rigid and non-rigid shape matching [44]. The latter case is both more general and more challenging since the shapes can potentially undergo arbitrary deformations (See Figure 1), which are not easy to characterize by purely axiomatic approaches. As a result, several recent learning-based techniques have been proposed for addressing the shape correspondence problem, e.g. [10, 25, 26, 51] among many others. Most of these approaches are based on the idea that the underlying correspondence model can be learned from data, typically given in the form of ground truth correspondences between some

shape pairs. In the simplest case, this can be formulated as a labeling problem, where different points, e.g., in a template shape, correspond to labels to be predicted [51, 27]. More recently, several methods have been proposed for structured map prediction, aiming to infer an entire map, rather than labeling each point independently [10, 23]. These techniques are based on learning pointwise descriptors, but, crucially, impose a penalty on the entire map, obtained using these descriptors, resulting in higher quality, globally consistent correspondences. Nevertheless, while learning-based methods have achieved impressive performance, their utility is severely limited by requiring the presence of high-quality ground truth maps between a sufficient number of training examples. This makes it difficult to apply such approaches to new shape classes for which ground truth data is not available.

In our paper, we show that this limitation can be lifted and propose a purely unsupervised strategy, which combines the accuracy of learning-based methods with the generality of axiomatic techniques for shape correspondence. The key to our approach is a bi-level optimization scheme, which optimizes for descriptors on the shapes, but imposes a penalty on the entire map, inferred from them. For this, we use the recently proposed FMNet architecture [23], which exploits the functional map representation [30]. However, rather than penalizing the deviation of the map from the ground truth, we enforce structural properties on the map, such as its bijectivity or approximate isometry. This results in a shape matching method that achieves state-of-the-art accuracy among unsupervised methods and, perhaps surprisingly, achieves comparable performance even to supervised techniques.

2 Related Work

Computing correspondences between 3D shapes is a very well-studied area of computer vision and computer graphics. Below we only review the most closely related methods and refer the interested readers to recent surveys including [46, 44, 5] for more in-depth discussions.

Functional Maps

Our method is built on the functional map representation, which was originally introduced in [30] for solving non-rigid shape matching problems, and then extended significantly in follow-up works, including [2, 21, 20, 9, 15, 36] among many others (see also [31] for a recent overview).

One of the key benefits of this framework is that it allows us to represent maps between shapes as small matrices, which encode relations between basis functions defined on the shapes. Moreover, as observed by several works in this domain [30, 40, 21, 36, 9], many natural properties on the underlying pointwise correspondences can be expressed as objectives on functional maps. This includes orthonormality of functional maps, which corresponds to the local area-preservation nature of pointwise correspondences [30, 21, 40]; commutativity with the Laplacian operators, which corresponds to intrinsic isometries [30], preservation of inner products of gradients of functions, which corresponds to conformal maps [40, 9, 50]; preservation of pointwise products of functions, which corresponds to functional maps arising from point-to-point correspondences [29, 28]; and slanted diagonal structure of functional map in the context of partial shapes [36, 24] among others.

Similarly, several other regularizers have been proposed, including exploiting the relation between functional maps in different directions [14], the map adjoint [18], and powerful cycle-consistency constraints [17] in shape collections to name a few. More recently constraints on functional maps have been introduced to promote map continuity [35, 34] and kernel-based techniques for extracting more information from given descriptors [49] among others. All these methods, however, are based on combining first-order penalties that arise from enforcing descriptor preservation constraints with these additional desirable structural properties of functional maps. As a result, any artefact or inconsistency in the pre-computed descriptors will inevitably lead to severe map estimation errors. Several methods have been suggested to use robust norms [21, 20], which can help reduce the influence of certain descriptors but still does not control the global map consistency properties.

Most recently, a powerful technique BCICP, for map optimization, was introduced in [35] that combines a large number of functional constraints with sophisticated post-processing, and careful descriptor selection. As we show below our method is simpler, more efficient and achieves superior accuracy even to this recent approach.

Learning-based Methods

To overcome the inherent difficulty of axiomatic techniques, several methods have been introduced to learn the correct deformation model from data with learning-based methods. Some early approaches in this direction were used to learn either optimal parameters of spectral descriptors [25] or exploited random forests [38] or metric learning [11] for learning optimal constraints given some ground truth matches.

More recently, with the advent of deep learning methods, several approaches have been proposed to learn transformations in the context of non-rigid shape matching. Most of the proposed methods either use Convolutional Neural Networks (CNNs) on depth maps, e.g. for dense human body correspondence [51] or exploit extensions of CNNs directly to curved surfaces, either using the link between convolution and multiplication in the spectral domain [7, 12], or directly defining local parametrizations, for example via the exponential map, which allows convolution in the tangent plane of a point, e.g. [26, 8, 27, 33] among others.

These methods have been applied to non-rigid shape matching, in most cases modeling it as a label prediction problem, with points corresponding to different labels. Although successful in the presence of sufficient training data, such approaches typically do not impose global consistency, and can lead to artefacts, such as outliers, requiring post-processing to achieve high-quality maps.

Learning for Structured Prediction

Most closely related to our approach are recent works that apply learning for structured map prediction [10, 23]. These methods learn a transformation of given input descriptors, while optimizing for the deviation of the map computed from them using the functional map framework, from ground truth correspondences. By imposing a penalty on entire maps, and thus evaluating the ultimate use of the descriptors, these methods have led to significant accuracy improvements in practice. We note that concurrent to our work, Halimi et al. [16] also proposed an unsupervised deep learning method that computes correspondences without using the ground truth. This approach is similar to ours, but is based on computation of geodesic distances, while our method operates purely in the spectral domain making it extremely efficient.

Contribution

Unlike these existing methods, we propose an unsupervised learning-based approach that transforms given input descriptors, while optimizing for structural map properties, without any knowledge of the ground truth or geodesic distances. Our method, which can be seen as a bi-level optimization strategy, allows to explicitly control the interaction between pointwise descriptors and global map consistency, computed via the functional map framework. As a result, our technique is scalable with respect to shape complexity, leads to significant improvement compared to the standard unsupervised methods, and achieves comparable performance even to supervised approaches.

3 Background & Motivation

3.1 Shape Matching and Functional Maps

Our work is based on the functional map framework and representation. For completeness, we briefly review the basic notions and pipeline for estimating functional maps, and refer the interested reader to a recent course [31] for a more in-depth discussion.

Basic Pipeline

Given a pair of shapes, $S_{1},S_{2}$ represented as triangle meshes, and containing, respectively, $n_{1}$ and $n_{2}$ vertices, the basic pipeline for computing a map between them using the functional map framework, consists of the following main steps (see Chapter 2 in [31]) :

Compute a small set of $k_{1},k_{2}$ of basis functions on each shape, e.g. by taking the first few eigenfunctions of the respective Laplace-Beltrami operators. 2. 2.

Compute a set of descriptor functions on each shape that are expected to be approximately preserved by the unknown map. For example, a descriptor function can correspond to a particular dimension (e.g. choice of time parameter of the Heat Kernel Signature [43]) computed at every point. Store their coefficients in the respective bases as columns of matrices $\mathbf{A}_{1},\mathbf{A}_{2}$ . 3. 3.

Compute the optimal functional map $\mathbf{C}$ by solving the following optimization problem:

[TABLE]

where the first term aims at the descriptor preservation: $E_{\text{desc}}\big{(}\mathbf{C}_{12}\big{)}=\big{\|}\mathbf{C}_{12}\mathbf{A}_{1}-\mathbf{A}_{2}\big{\|}^{2}$ , whereas the second term regularizes the map by promoting the correctness of its overall structural properties. The simplest approach penalizes the failure of the unknown functional map to commute with the Laplace-Beltrami operators:

[TABLE]

where $\mathbf{\Lambda}_{1}$ and $\mathbf{\Lambda}_{2}$ are diagonal matrices of the Laplace-Beltrami eigenvalues on the two shapes. 4. 4.

Convert the functional map $\mathbf{C}$ to a point-to-point map, for example using nearest neighbor search in the spectral embedding, or using other more advanced techniques [37, 15].

One of the strengths of this pipeline is that typically Eq. (1) leads to a simple (e.g., least squares) problem with $k_{1}k_{2}$ unknowns, independent of the number of points on the shapes. This formulation has been extended using e.g. manifold optimization [22], descriptor preservation constraints via commutativity [29] and, more recently, with kernelization [49] among many others (see also Chapter 3 in [31]).

3.2 Deep Functional Maps

Despite its simplicity and efficiency, the functional map estimation pipeline described above is fundamentally dependent on the initial choice of descriptor functions. To alleviate this dependence, several approaches have been proposed to learn the optimal descriptors from data [10, 23]. In our work, we build upon a recent deep learning-based framework, called FMNet, introduced by Litany et al. [23] that aims to transform a given set of descriptors so that the optimal map computed using them is as close as possible to some ground truth map given during training.

Specifically, the approach proposed in [23] assumes, as input, a set of shape pairs for which ground truth point-wise maps are known, and aims to solve the following problem:

[TABLE]

Here $T$ is a non-linear transformation, in the form of a neural network, to be applied to some input descriptor functions $D$ , Train is the set of training pairs for which ground truth correspondence $GT_{(S_{1},S_{2})}$ is known, $l_{F}$ is the soft error loss, which penalizes the deviation of the computed functional map $\mathbf{C}_{\text{opt}}$ , after converting it to a soft map $Soft(\mathbf{C}_{\text{opt}})$ from the ground truth correspondence, and $\mathbf{A}_{T(D_{1})}$ denotes the transformed descriptors $D_{1}$ written in the basis of shape $1$ . In other words, the FMNet framework [23] aims to learn a transformation $T$ of descriptors, so that the transformed descriptors $T(D_{1})$ , $T(D_{2})$ , when used within the functional map pipeline result in a soft map that is as close as possible to some known ground truth correspondence. Unlike methods based on formulating shape matching as a labeling problem this approach evaluates the quality of the entire map, obtained using the transformed descriptors, which as shown in [23] leads to significant improvement compared to several strong baselines.

Motivation

Similarly to other supervised learning methods, although FMNet [23] can result in highly accurate correspondences, its applicability is limited to shape classes for which high-quality ground truth maps are available. Moreover, perhaps less crucially, the soft map loss in FMNet is based on the knowledge of geodesic distances between all pairs of points, making it computationally expensive. Our goal, therefore, is to show that a similar approach can be used more widely, without any training data, while working purely in the spectral domain.

4 SURFMNet

4.1 Overview

In this paper, we introduce a novel approach, which we call SURFMNet for Spectral Unsupervised FMNet. Our method aims to optimize for non-linear transformations of descriptors, in order to obtain high-quality functional, and thus pointwise maps. For this, we follow the general strategy proposed in FMNet [23].

However, crucially, rather than penalizing the deviation of the computed map from the known ground truth correspondence, we evaluate the structural properties of the inferred functional maps, such as their bijectivity or orthogonality. Importantly, we express all these desired properties, and thus the penalties during optimization, purely in the spectral domain, which allows us to avoid the conversion of functional maps to soft maps during optimization as was done in [23]. Thus, in addition to being purely unsupervised, our approach is also more efficient since it does not require pre-computation of geodesic distance matrices or expensive manipulation of large soft map matrices during training.

To achieve these goals, we build on the FMNet model, described in Eq. (3) and (4) in several ways: first, we propose to consider functional maps in both directions, i.e. by treating the two shapes as both source and target; second, we remove the conversion from functional to soft maps; and, most importantly, third, we replace the soft map loss with respect to ground truth with a set of penalties on the computed functional maps, which are described in detail below. Our optimization problem can be written as:

[TABLE]

Here, similarly to Eq. (3) above, $T$ denotes a non-linear transformation in the form of a neural network, $(S_{1},S_{2})$ is a set of pairs of shapes in a given collection, $w_{i}$ are scalar weights, and $E_{i}$ are the penalties, described below. Thus, we aim to optimize for a non-linear transformation of input descriptor functions, such that functional maps computed from transformed descriptors possess certain desirable structural properties and are expressed via penalty minimization. Figure 2 illustrates our proposed method where we denote the total sum of all penalty terms in Eq. (5) as $E_{\text{global}}$ and back-propagation via grey dashed lines.

When deriving the penalties used in our approach, we exploit the links between properties of functional maps and associated pointwise maps, that have been established in several previous works [30, 40, 14, 29]. Unlike all these methods, however, we decouple the descriptor preservation constraints from structural map properties. This allows us to optimize for descriptor functions, and thus, gain a very strong resilience in the presence of noisy or uninformative descriptors, while still exploiting the compactness and efficiency of the functional map representation.

4.2 Deep Functional Map Regularization

In our work, we propose to use four regularization terms, by including them as a penalties in the objective function, all inspired by desirable map properties.

Bijectivity

Given a pair of shapes and the functional maps in both directions, perhaps the simplest requirement is for them to be inverses of each other, which can be enforced by penalizing the difference between their composition and the identity map. This penalty, used for functional map estimation in [14], can be written, simply as:

[TABLE]

Orthogonality

As observed in several works [30, 40] a point-to-point map is locally area preserving if and only if the corresponding functional map is orthonormal. Thus, for shape pairs, approximately satisfying this assumption, a natural penalty in our unsupervised pipeline is:

[TABLE]

Laplacian commutativity

Similarly, it is well-known that a pointwise map is an intrinsic isometry if and only if the associated functional map commutes with the Laplace-Beltrami operator [39, 30]. This has motivated using the lack of commutativity as a regularizer for functional map computations, as mentioned in Eq. (2). In our work, we use it to introduce the following penalty:

[TABLE]

where $\mathbf{\Lambda}_{1}$ and $\mathbf{\Lambda}_{2}$ are diagonal matrices of the Laplace-Beltrami eigenvalues on the two shapes.

Descriptor preservation via commutativity

The previous three penalties capture desirable properties of pointwise correspondences when expressed as functional maps. Our last penalty promotes functional maps that arise from point-to-point maps, rather than more general soft correspondences. To achieve this, we follow the approach proposed in [29] based on preservation of pointwise products of functions. Namely, it is known that a non-trivial linear transformation $\mathcal{T}$ across function spaces corresponds to a point-to-point map if and only if $\mathcal{T}(f\odot h)=\mathcal{T}(f)\odot\mathcal{T}(h)$ for any pair of functions $f,h$ . Here $\odot$ denotes the pointwise product between functions [41], i.e. $(f\odot h)(x)=f(x)h(x)$ . When $f$ is a descriptor function on the source and $g$ is the corresponding descriptor on the target, the authors of [29] demonstrate that this condition can be rewritten in the reduced basis as follows: $\mathbf{C}\mathbf{M}_{f}=\mathbf{M}_{g}\mathbf{C}$ , where $\mathbf{M}_{f}=\Phi^{+}\text{Diag}(f)\Phi,$ and $\mathbf{M}_{g}=\Psi^{+}\text{Diag}(g)\Psi$ . This leads to the following penalty, in our setting:

[TABLE]

In this expression, $f_{i}$ and $g_{i}$ are the optimized descriptors on source and target shape, obtained by the neural network, and expressed in the full (hat basis), whereas $\Phi,\Psi$ are the fixed basis functions on the two shapes, and ${+}$ denotes the Moore-Penrose pseudoinverse.

4.3 Optimization

As mentioned in Section 4.1, we incorporate these four penalties into the energy in Eq. (5). Importantly, the only unknowns in this optimization are the parameters of the neural network applied to the descriptor functions. The functional maps $\mathbf{C}_{12}$ and $\mathbf{C}_{21}$ are fully determined by the optimized descriptors via the solution of the optimization problems in Eq. (6) and Eq. (7). Note that although stated as optimization problems, both Eq. (6) and Eq. (7) reduce to solving a linear system of equations. This is easily differentiable using the well-known closed-form expression for derivatives of matrix inverses [32]. Moreover, the functionality of differentiating a linear system of equations is implemented in TensorFlow [1] and we use it directly, in the same way as it was used in the original FMNet work. Finally, all of the penalties $E_{1},E_{2},E_{3},E_{4}$ are differentiable with respect to the functional maps $\mathbf{C}_{12},\mathbf{C}_{21}$ . This means that the gradient of the total energy can be back-propagated to the neural network $T$ in Eq. (5), allowing us to optimize for the descriptors while penalizing the structural properties of the functional maps.

5 Implementation & Parameters

Implementation details

We implemented 111Code available at https://github.com/LIX-shape-analysis/SURFMNet. our method in TensorFlow [1] by adapting the open-source implementation of FMNet [23]. Thus, the neural network $T$ used for transforming descriptors in our approach, in Eq. (5) is exactly identical to that used in FMNet, as mentioned in Eq. (3). Namely, this network is based on a residual architecture, consisting of 7 fully connected residual layers with exponential linear units, without dimensionality reduction. Please see Section 5 in [23] for more details.

Following the approach of FMNet [23], we also sub-sample a random set of 1500 points at each training step, for efficiency. However, unlike their method, sub-sampling is done independently on each shape, without enforcing consistency. Remark that our network is fully connected on the dimensions of the descriptors, not across vertices themselves. For example, the first layer has $352\times 352$ weights (not $1500\times 352$ weights) where $352$ and 1500 are the dimensions of the SHOT descriptors, and no. of sampled vertices respectively. Indeed, in exactly the same way as in FMNet, our network is applied on the descriptors of each point independently, using the same (learned) weights, and different points on the shape only communicate through the functional map estimation layer, and not in the MLP layers. This ensures invariance to permutation of shape vertices. We also randomly sub-sample 20% of the optimized descriptors for our penalty $E_{4}$ at each training step to avoid manipulating a large set of operators. We observed that this sub-sampling not only helps to gain speed but also robustness during optimization. Importantly, we do not form large diagonal matrices explicitly, but rather define the multiplicative operators $\mathbf{M}$ in objective $E_{4}$ directly via pointwise products and summation using contraction between tensors.

Finally, we convert functional maps to pointwise ones with nearest neighbor search in the spectral domain, following the original approach [30].

Parameters

Our method takes two types of inputs: the input descriptors, and the scalar weights $w_{i}$ in Eq. (5). In all experiments below, we used the same SHOT [45] descriptors as in FMNet [23] with the same parameters, which leads to a 352-dimensional vector per point, or equivalently, 352 descriptor functions on each shape. For the scalar weights, $w_{i}$ , we used the same four fixed values for all experiments below (namely, $w_{1}=10^{3}$ , $w_{2}=10^{3}$ , $w_{3}=1$ and $w_{4}=10^{5}$ ), which were obtained by examining the relative penalty values obtained throughout the optimization on a small set of shapes, and setting the weights inversely proportionally to those values. We train our network with a batch size of 10 for 10 000 iterations using a learning rate of 0.001 and ADAM optimizer [13].

6 Results

Datasets

We evaluate our method on the following datasets: the original FAUST dataset [6] containing 100 human shapes in 1-1 correspondence and the remeshed versions of SCAPE [3] and FAUST [6] datasets, made publicly available recently by Ren et al. [35]. These datasets were obtained by independently re-meshing each shape to approximately 5000 vertices using the LRVD re-meshing method [52], while keeping track of the ground truth maps within each collection. This results in meshes that are no longer in 1-1 correspondence, and indeed can have different number of vertices. The re-meshed datasets therefore offer significantly more variability in terms of shape structures, including e.g. point sampling density, making them more challenging for existing algorithms. Let us note also that the SCAPE dataset is slightly more challenging since the shapes are less regular (e.g., there are often reconstruction artefacts on hands and feet) and have fewer features than those in FAUST.

We stress that although we also evaluated on the original FAUST dataset, we view the remeshed datasets as more realistic, providing a more faithful representation of the accuracy and generalization power of different techniques.

Ablation study

We first evaluated the relative importance of the different penalties in our method on the FAUST shape dataset [6]. We evaluated the average correspondence geodesic error with respect to the ground truth maps.

Table 1 summarizes the quality of the computed correspondences between shapes in the test set, using different combination of penalties. We observe that the combination of all four penalties significantly out-performs any other subsets. Besides, among individual penalties used independently, the Laplacian commutativity gives the best result. For more combinations of penalty terms, we refer to a more detailed ablation study in the supplementary material.

Baselines

We compared our method to several techniques, both supervised and fully automatic. For conciseness, we refer to SURFMNet as Ours in the following text. For a fair comparison with FMNet, we evaluate our method in two settings: Ours-sub and Ours-all. For Ours-sub, we split each dataset into training and test sets containing 80 and 20 shapes respectively, as done in [23]. For Ours-all, we optimize over all the dataset and apply the optimized network on the same test set as before. We stress that unlike FMNet, our method does not use any ground truth in either setting. We use the notation Ours-sub only to emphasize the split of dataset into train and test since the “training set” was only used for descriptor optimization with the functional map penalties introduced above without any ground truth.

Since the original FMNet work [23] already showed very strong improvement compared to existing supervised learning methods we primarily compare to this approach. For reference, we also compare to the Geodesic Convolutional Neural Networks (GCNN) method of [26] on the remeshed datasets, which were not considered in [23]. GCNN is a representative supervised method based on local shape parameterization, and as FMNet assumes, as input, ground truth maps between a subset of the training shapes. For supervised methods, we always split the datasets into 80 (resp. 60) shapes for training and 20 (resp. 10) for testing in the FAUST and SCAPE datasets respectively.

Among fully automatic methods, we use the Product Manifold Filter method with the Gaussian kernel [48] (PMF Gauss) and its variant with the Heat kernel [47] (PMF Heat). We also compare to the recently proposed BCICP [35], which achieved state-of-the-art results among axiomatic methods. With a slight abuse of notation, we denote these non-learning methods as Unsupervised in Figure 4 since none of these methods use ground truth. Finally, we also evaluated the basic functional map approach, based on directly optimizing the functional maps as outlined in Section 3.1, but using all four of our energies for regularization. This method, which we call “Fmap Basic” can be viewed as a combination of the approaches of [14] and [28], as it incorporates functional map coupling (via energy $E_{1}$ ) and descriptor commutativity (via $E_{4}$ ). Unlike our technique, however, it operates on fixed descriptor functions, and uses descriptor preservation constraints with the original and noisy descriptors.

For fairness of comparison, we used SHOT descriptors [45] as input to all methods, except BCICP [35], which uses carefully curated WKS [4] descriptors. Furthermore, we consider the results of FMNet [23] before and after applying the PMF-based post-processing as suggested in the original article. We also report results with ICP post-processing introduced in [30]. Besides the accuracy plots shown in Figures 3 and 4, we also include statistics such as maximum and 95th percentile in supplementary material.

6.1 Evaluation and Results

Figure 3 summarizes the accuracy obtained by supervised methods on the three datasets whereas Figure 4 compares with unsupervised methods, using the evaluation protocol introduced in [19]. Note that in all cases, our network SURFMNet, (Ours-all), when optimized on all shapes achieves the best results even compared to the recent state-of-the-art method in [35]. Furthermore, our method is comparable even to supervised learning techniques, GCNN [7] and FMNet [23] despite being purely unsupervised.

Remark that the remeshed datasets are significantly harder for both supervised and unsupervised methods, since the shapes are no longer identically meshed and in 1-1 correspondence. We have observed this difficulty also while training supervised FMNet and GCNN techniques with very slow convergence during training. On both of these datasets, our approach achieves the lowest average error, reported in Figure 3 and 4. Note that on the remeshed FAUST dataset, as shown in Figure 3, only GCNN [7] produces a similarly large fraction of correspondences with a small error. However, this method is supervised. On the remeshed SCAPE dataset, our method leads to the best results across all measures, despite being purely unsupervised.

Postprocessing Results

As shown in Figures 3 and 4 our method can often obtain high quality results even without any post-processing. Nevertheless, in the challenging cases such as the SCAPE remeshed dataset, when trained on a subset of shapes, it can also benefit from an efficient ICP-based refinement. This refinement, does not require computing geodesic distances and does not require the shapes to have the same number of points, thus maintaining the flexibility and efficiency of our pipeline.

Correlation with actual Geodesic loss

We further investigated if there is a correlation between the value of our loss and the quality of correspondence. Specifically, whether minimizing our loss function, mainly consisting of regularization terms on estimated functional maps, corresponds to minimizing the geodesic loss with respect to the unknown ground truth map. We found strong correlation between the two and share a plot in the supplementary material.

Qualitative and Runtime Comparison

Figures 5 and 6 show examples shape pairs and maps obtained between them using different methods, visualized via texture transfer. Note the continuity and quality of the maps obtained using our method, compared to other techniques (more results in supplementary material). One further advantage of our method is its efficiency, since we do not rely on the computation of geodesic matrices and operate entirely in the spectral domain. Table 2 compares the run-time of the best performing methods on an Intel Xeon 2.10GHz machine with an NVIDIA Titan X GPU. Note that our method is over an order of magnitude faster than FMNet and significantly faster than the currently best unsupervised BCICP.

7 Conclusion & Future Work

We presented an unsupervised method for computing correspondences between shapes. Key to our approach is a bi-level optimization formulation, aimed to optimize descriptor functions, while promoting the structural properties of the entire map, obtained from them via the functional maps framework. Remarkably, our approach achieves similar, and in some cases superior performance even to supervised correspondence techniques.

In the future, we plan to incorporate other penalties on functional maps, e.g., those arising from recently-proposed kernalization approaches [49], or for promoting orientation preserving maps[35] and also incorporate cycle consistency constraints [17]. Finally, it would be interesting to extend our method to partial and non-isometric shapes and matching other modalities, such as images or point clouds, since it opens the door to linking the properties of local descriptors to global map consistency.

Acknowledgements Parts of this work were supported by the ERC Starting Grant StG-2017-758800 (EXPROTEA), KAUST OSR Award No. CRG-2017-3426, and a gift from Nvidia. We are grateful to Jing Ren, Or Litany, Emanuele Rodolà and Adrien Poulenard for their help in performing quantitative comparisons and producing qualitative results.

8 Supplement

A Correlation with actual geodesic loss

To support the claim made in the subsection ’Evaluation and Results’, we include a plot here to visualize the correlation between our loss and the actual geodesic loss. As evident in Figure 7, there is a strong correlation between our loss value and the quality of correspondence as measured by average geodesic error.

B Detailed Tabular Quantitative Comparison

Besides the average geodesic error reported for quantitative comparison in Figures $3$ and $4$ , we provide detailed statistics in Table 3. Note that Table 3 also includes ’Fmap Ours Opt’ which is equivalent to “Fmap Basic” but uses the learned descriptors instead of original ones. Its competitive performance across all datasets proves quantitatively the utility of learning descriptors. Figures 13 and 14 illustrate this further. For completeness, in Table 4, we also provide a detailed ablation study with different combinations of penalties.

C Sensitivity to number of basis functions

Figure 8 shows the sensitivity of our network SURFMNet on the SCAPE remeshed dataset as the number of eigen functions are varied from 20 to 150. We train the network each time with 10000 mini batch steps. As evident, we obtain best result using 120. However, when trained on an individual dataset and tested on a different one, we see over-fitting when using a large eigen-basis. We attribute this phenomenon to the initialization of our descriptors with SHOT which is a very local descriptor and is not robust to very strong mesh variability. However, over-fitting is minimal when we train together on a relatively larger subset of SCAPE and FAUST and test on a different subset of shapes from both datasets, with smaller eigen basis.

D More Qualitative Comparison

In Figures 9 and 12 , we provide more qualitative comparisons of SURFMNet on the FAUST remeshed datasets whereas Figures $10$ and 11 provide a comparison on the SCAPE remeshed dataset. In all cases, our method produces the highest quality maps.

Bibliography52

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Martín Abadi, Ashish Agarwal, and Paul Barham et al. Tensor Flow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
2[2] Yonathan Aflalo, Anastasia Dubrovina, and Ron Kimmel. Spectral generalized multi-dimensional scaling. International Journal of Computer Vision , 118(3):380–392, 2016.
3[3] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. SCAPE: Shape Completion and Animation of People. In ACM Transactions on Graphics (TOG) , volume 24, pages 408–416. ACM, 2005.
4[4] Mathieu Aubry, Ulrich Schlickewei, and Daniel Cremers. The wave kernel signature: A quantum mechanical approach to shape analysis. 31(4), Nov. 2011.
5[5] Silvia Biasotti, Andrea Cerri, A Bronstein, and M Bronstein. Recent trends, applications, and perspectives in 3d shape similarity assessment. In Computer Graphics Forum , volume 35, pages 87–119, 2016.
6[6] Federica Bogo, Javier Romero, Matthew Loper, and Michael J. Black. FAUST: Dataset and evaluation for 3D mesh registration. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , Piscataway, NJ, USA, June 2014. IEEE.
7[7] Davide Boscaini, Jonathan Masci, Simone Melzi, Michael M Bronstein, Umberto Castellani, and Pierre Vandergheynst. Learning class-specific descriptors for deformable shapes using localized spectral convolutional networks. In Computer Graphics Forum , volume 34, pages 13–23. Wiley Online Library, 2015.
8[8] Davide Boscaini, Jonathan Masci, Emanuele Rodola, and Michael M. Bronstein. Learning shape correspondence with anisotropic convolutional neural networks. In Proc. NIPS , pages 3189–3197, 2016.