Assessing local deformation and computing scalar curvature with nonlinear conformal regularization of decoders

Benjamin Cou\'eraud; Vikram Sunkara; Christof Sch\"utte

arXiv:2508.20413·cs.LG·August 29, 2025

Assessing local deformation and computing scalar curvature with nonlinear conformal regularization of decoders

Benjamin Cou\'eraud, Vikram Sunkara, Christof Sch\"utte

PDF

Open Access

TL;DR

This paper introduces a nonlinear conformal regularization method for autoencoders that quantifies local deformation and enables scalar curvature computation of the learned manifold, enhancing geometric understanding of data representations.

Contribution

It proposes a novel geometric regularization for autoencoders that allows local deformation analysis and scalar curvature computation of the learned data manifold.

Findings

01

Regularization enables local deformation quantification.

02

Method allows scalar curvature computation of the learned manifold.

03

Experiments demonstrate effectiveness on Swiss roll and CelebA datasets.

Abstract

One aim of dimensionality reduction is to discover the main factors that explain the data, and as such is paramount to many applications. When working with high dimensional data, autoencoders offer a simple yet effective approach to learn low-dimensional representations. The two components of a general autoencoder consist first of an encoder that maps the observed data onto a latent space; and second a decoder that maps the latent space back to the original observation space, which allows to learn a low-dimensional manifold representation of the original data. In this article, we introduce a new type of geometric regularization for decoding maps approximated by deep neural networks, namely nonlinear conformal regularization. This regularization procedure permits local variations of the decoder map and comes with a new scalar field called conformal factor which acts as a quantitative…

Tables2

Table 1. Table 1 : Results for the conditions numbers κ 𝗃𝖺𝖼 \kappa_{\mathsf{jac}} of the decoder’s Jacobian, and κ 𝗉𝖻𝗆 \kappa_{\mathsf{pbm}} of the pullback metric, respectively (mean ± \pm standard deviation), across the whole latent space. The vanilla autoencoder together ℒ 𝗅𝗈𝖼𝗂𝗌𝗈 \mathcal{L}_{\mathsf{lociso}} or ℒ ~ 𝖼𝗈𝗇𝖿 \widetilde{\mathcal{L}}_{\mathsf{conf}} consistently outperform the vanilla autoencoder alone or with the regularizer ℒ 𝗀𝗅𝗈𝖻𝗂𝗌𝗈 \mathcal{L}_{\mathsf{globiso}} , and we observed that the difference is even more important upon increasing the number of samples in Hutchinson’s estimator.

	$ℒ_{𝗋𝖾𝖼𝗈𝗇}$	$ℒ_{𝗀𝗅𝗈𝖻𝗂𝗌𝗈}$
$κ_{𝗃𝖺𝖼}$	$2.33 \pm 1.26$	$4.22 \pm 1.05$
$κ_{𝗉𝖻𝗆}$	$7.03 \pm 13.32$	$18.91 \pm 12.54$
	$ℒ_{𝗅𝗈𝖼𝗂𝗌𝗈}$	${\tilde{ℒ}}_{𝖼𝗈𝗇𝖿}$
$κ_{𝗃𝖺𝖼}$	$1.99 \pm 0.65$	$2.05 \pm 0.90$
$κ_{𝗉𝖻𝗆}$	$4.39 \pm 3.23$	$5.02 \pm 5.29$

Table 2. Table 2 : Results for κ 𝗃𝖺𝖼 \kappa_{\mathsf{jac}} and κ 𝗉𝖻𝗆 \kappa_{\mathsf{pbm}} on the CelebA dataset: the vanilla autoencoder together ℒ 𝗅𝗈𝖼𝗂𝗌𝗈 \mathcal{L}_{\mathsf{lociso}} or ℒ ~ 𝖼𝗈𝗇𝖿 \widetilde{\mathcal{L}}_{\mathsf{conf}} consistently outperform the vanilla autoencoder alone or with the regularizer ℒ 𝗀𝗅𝗈𝖻𝗂𝗌𝗈 \mathcal{L}_{\mathsf{globiso}} .

	$ℒ_{𝗋𝖾𝖼𝗈𝗇}$	$ℒ_{𝗀𝗅𝗈𝖻𝗂𝗌𝗈}$
$κ_{𝗃𝖺𝖼}$	$1.91 \pm 0.58$	$33.84 \pm 10.10$
$κ_{𝗉𝖻𝗆}$	$3.97 \pm 2.93$	$1246.92 \pm 651.46$
	$ℒ_{𝗅𝗈𝖼𝗂𝗌𝗈}$	${\tilde{ℒ}}_{𝖼𝗈𝗇𝖿}$
$κ_{𝗃𝖺𝖼}$	$2.06 \pm 0.59$	$1.74 \pm 0.44$
$κ_{𝗉𝖻𝗆}$	$4.59 \pm 3.37$	$3.23 \pm 1.65$

Equations43

x \in X \subset R^{n} E z \in Z \subset R^{m} D \overset{x}{^} \in X \subset R^{n} .

x \in X \subset R^{n} E z \in Z \subset R^{m} D \overset{x}{^} \in X \subset R^{n} .

\mathcal{L}_{\mathsf{recon}}(\operatorname{\mathsf{E}},\operatorname{\mathsf{D}})=\frac{1}{N}\sum_{i=1}^{N}\Big{\|}x_{i}-\operatorname{\mathsf{D}}\big{(}\operatorname{\mathsf{E}}(x_{i};\theta_{\mathsf{enc}});\theta_{\mathsf{dec}}\big{)}\Big{\|}^{2}_{2},

\mathcal{L}_{\mathsf{recon}}(\operatorname{\mathsf{E}},\operatorname{\mathsf{D}})=\frac{1}{N}\sum_{i=1}^{N}\Big{\|}x_{i}-\operatorname{\mathsf{D}}\big{(}\operatorname{\mathsf{E}}(x_{i};\theta_{\mathsf{enc}});\theta_{\mathsf{dec}}\big{)}\Big{\|}^{2}_{2},

\displaystyle\begin{split}\mathcal{L}_{\mathsf{globiso}}&(\operatorname{\mathsf{D}})\\ &=\sum_{z_{1},z_{2}\in Z}\Big{|}d_{Z}(z_{1},z_{2})-d_{X}(\operatorname{\mathsf{D}}(z_{1}),\operatorname{\mathsf{D}}(z_{2}))\Big{|},\end{split}

\displaystyle\begin{split}\mathcal{L}_{\mathsf{globiso}}&(\operatorname{\mathsf{D}})\\ &=\sum_{z_{1},z_{2}\in Z}\Big{|}d_{Z}(z_{1},z_{2})-d_{X}(\operatorname{\mathsf{D}}(z_{1}),\operatorname{\mathsf{D}}(z_{2}))\Big{|},\end{split}

\mathcal{L}_{\mathsf{lociso}}(\operatorname{\mathsf{D}})=\sum_{z\in Z}\Big{\|}J_{\operatorname{\mathsf{D}}}(z)^{\mathsf{T}}J_{\operatorname{\mathsf{D}}}(z)-I_{m}\Big{\|}_{F},

\mathcal{L}_{\mathsf{lociso}}(\operatorname{\mathsf{D}})=\sum_{z\in Z}\Big{\|}J_{\operatorname{\mathsf{D}}}(z)^{\mathsf{T}}J_{\operatorname{\mathsf{D}}}(z)-I_{m}\Big{\|}_{F},

h_{f(x)}\big{(}\mathrm{d}f_{x}(u),\mathrm{d}f_{x}(v)\big{)}=c(x)g_{x}(u,v),

h_{f(x)}\big{(}\mathrm{d}f_{x}(u),\mathrm{d}f_{x}(v)\big{)}=c(x)g_{x}(u,v),

J_{f}(x)^{\mathsf{T}}H\big{(}f(x)\big{)}J_{f}(x)=c(x)G(x).

J_{f}(x)^{\mathsf{T}}H\big{(}f(x)\big{)}J_{f}(x)=c(x)G(x).

L_{lociso} (D) = \frac{1}{m} i = 1 \sum m \int_{Z} Φ (λ_{i} (z)) d ν,

L_{lociso} (D) = \frac{1}{m} i = 1 \sum m \int_{Z} Φ (λ_{i} (z)) d ν,

L_{conf} (D) = \frac{1}{m} i = 1 \sum m \int_{Z} Φ (\frac{λ _{i} ( z )}{σ ( λ _{1} ( z ) , \dots , λ _{m} ( z ))}) d ν,

L_{conf} (D) = \frac{1}{m} i = 1 \sum m \int_{Z} Φ (\frac{λ _{i} ( z )}{σ ( λ _{1} ( z ) , \dots , λ _{m} ( z ))}) d ν,

L_{conf} (D) = \frac{1}{m} i = 1 \sum m \int_{Z} Φ (\frac{1}{I} λ_{i} (z)) d ν,

L_{conf} (D) = \frac{1}{m} i = 1 \sum m \int_{Z} Φ (\frac{1}{I} λ_{i} (z)) d ν,

Φ (\frac{λ _{i} ( z )}{σ ( λ _{1} ( z ) , \dots , λ _{m} ( z ))}) = 0

Φ (\frac{λ _{i} ( z )}{σ ( λ _{1} ( z ) , \dots , λ _{m} ( z ))}) = 0

L_{conf} (D) = \frac{m}{2} E_{z \sim ν} [\frac{Tr R ^{2} ( z )}{( Tr R ( z ) ) ^{2}}] - \frac{1}{2},

L_{conf} (D) = \frac{m}{2} E_{z \sim ν} [\frac{Tr R ^{2} ( z )}{( Tr R ( z ) ) ^{2}}] - \frac{1}{2},

c (z) = \frac{1}{m} Tr R (z) .

\begin{split}\widetilde{\mathcal{L}}_{\mathrm{conf}}(\operatorname{\mathsf{D}})=\frac{1}{2m}\operatorname{\mathbb{E}}_{z\sim\nu}\Bigg{[}\sum_{i=1}^{m}&\frac{m^{2}\lambda_{i}(z)^{2}}{\big{(}\sum_{j=1}^{m}\lambda_{j}(z)\big{)}^{2}}\\ &-\sum_{i=1}^{m}\frac{2m\lambda_{i}(z)}{\sum_{j=1}^{m}\lambda_{j}(z)}+m\Bigg{]},\end{split}

\begin{split}\widetilde{\mathcal{L}}_{\mathrm{conf}}(\operatorname{\mathsf{D}})=\frac{1}{2m}\operatorname{\mathbb{E}}_{z\sim\nu}\Bigg{[}\sum_{i=1}^{m}&\frac{m^{2}\lambda_{i}(z)^{2}}{\big{(}\sum_{j=1}^{m}\lambda_{j}(z)\big{)}^{2}}\\ &-\sum_{i=1}^{m}\frac{2m\lambda_{i}(z)}{\sum_{j=1}^{m}\lambda_{j}(z)}+m\Bigg{]},\end{split}

d_{\mathsf{geo}}^{X}\big{(}\operatorname{\mathsf{D}}(z_{1}),\operatorname{\mathsf{D}}(z_{2})\big{)}\approx c(z)d_{\mathsf{geo}}^{Z}(z_{1},z_{2}),

d_{\mathsf{geo}}^{X}\big{(}\operatorname{\mathsf{D}}(z_{1}),\operatorname{\mathsf{D}}(z_{2})\big{)}\approx c(z)d_{\mathsf{geo}}^{Z}(z_{1},z_{2}),

L_{lociso} (D) = \frac{1}{2 m} E_{z \sim ν} [Tr R^{2} (z)] - \frac{1}{m} E_{z \sim ν} [Tr R (z)] + \frac{1}{2} .

L_{lociso} (D) = \frac{1}{2 m} E_{z \sim ν} [Tr R^{2} (z)] - \frac{1}{m} E_{z \sim ν} [Tr R (z)] + \frac{1}{2} .

L_{conf} (D) = \frac{m}{2} \frac{E _{z \sim ν} [ Tr R ^{2} ( z ) ]}{E _{z \sim ν} [ Tr R ( z ) ] ^{2}} - \frac{1}{2},

L_{conf} (D) = \frac{m}{2} \frac{E _{z \sim ν} [ Tr R ^{2} ( z ) ]}{E _{z \sim ν} [ Tr R ( z ) ] ^{2}} - \frac{1}{2},

(D^{⋆} g_{eucl (n)}) (z) = c (z) g_{eucl (2)} (z),

(D^{⋆} g_{eucl (n)}) (z) = c (z) g_{eucl (2)} (z),

\displaystyle\begin{split}S(\mathrm{e}^{2f}&g)=\\ &\mathrm{e}^{-2f}\big{[}S(g)-2(m-1)\Delta^{g}f\\ &\quad\quad\quad-(m-2)(m-1)\|\mathrm{d}f\|_{g}^{2}\big{]},\end{split}

\displaystyle\begin{split}S(\mathrm{e}^{2f}&g)=\\ &\mathrm{e}^{-2f}\big{[}S(g)-2(m-1)\Delta^{g}f\\ &\quad\quad\quad-(m-2)(m-1)\|\mathrm{d}f\|_{g}^{2}\big{]},\end{split}

S (c g_{eucl (2)}) = - \frac{1}{c} Δ lo g c,

S (c g_{eucl (2)}) = - \frac{1}{c} Δ lo g c,

Tr R \approx \frac{1}{N} i = 1 \sum N v_{i}^{T} J^{T} J v_{i} = \frac{1}{N} i = 1 \sum N ∥ J v_{i} ∥^{2} = \frac{1}{N} i = 1 \sum N j = 1 \sum m (J v_{i})_{j}^{2},

Tr R \approx \frac{1}{N} i = 1 \sum N v_{i}^{T} J^{T} J v_{i} = \frac{1}{N} i = 1 \sum N ∥ J v_{i} ∥^{2} = \frac{1}{N} i = 1 \sum N j = 1 \sum m (J v_{i})_{j}^{2},

D (ξ, η) = ⎩ ⎨ ⎧ x = ξ cos ξ y = η z = ξ sin ξ .

D (ξ, η) = ⎩ ⎨ ⎧ x = ξ cos ξ y = η z = ξ sin ξ .

R (ξ, η) = J_{D} (ξ, η)^{T} J_{D} (ξ, η) = [1 + ξ^{2} 0 01],

R (ξ, η) = J_{D} (ξ, η)^{T} J_{D} (ξ, η) = [1 + ξ^{2} 0 01],

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Image Processing Techniques · Distributed and Parallel Computing Systems · Geological Modeling and Analysis

Full text

Assessing local deformation and computing scalar curvature with nonlinear conformal regularization of decoders

Benjamin Couéraud111Corresponding author, please address all communications to [email protected].

Zuse Institute Berlin

Vikram Sunkara

Zuse Institute Berlin

Christof Schütte

Zuse Institute Berlin

Abstract

One aim of dimensionality reduction is to discover the main factors that explain the data, and as such is paramount to many applications. When working with high dimensional data, autoencoders offer a simple yet effective approach to learn low-dimensional representations. The two components of a general autoencoder consist first of an encoder that maps the observed data onto a latent space; and second a decoder that maps the latent space back to the original observation space, which allows to learn a low-dimensional manifold representation of the original data. In this article, we introduce a new type of geometric regularization for decoding maps approximated by deep neural networks, namely nonlinear conformal regularization. This regularization procedure permits local variations of the decoder map and comes with a new scalar field called conformal factor which acts as a quantitative indicator of the amount of local deformation sustained by the latent space when mapped into the original data space. We also show that this regularization technique allows the computation of the scalar curvature of the learned manifold. Implementation and experiments on the Swiss roll and CelebA datasets are performed to illustrate how to obtain these quantities from the architecture.

Keywords: Nonlinear dimensionality reduction, manifold learning, representation learning, autoencoders, deep learning, conformal map, regularization, Hutchinson’s estimator, deformation, scalar curvature.

Introduction

Dimensionality reduction, also known as manifold learning in certain contexts, aims at representing high-dimensional datasets by a smaller number of latent factors. Principal Component Analysis (PCA), is a well-known method that achieves that aim, but is not satisfactory for highly nonlinear datasets. Starting approximately two decades ago, numerous methods have been developed to address nonlinearity in dimension reduction. Popular among these methods are Locally Linear Embedding (LLE) [21], Multidimensional Distance Scaling (MDS) [23], Laplacian eigenmaps [1], just to name a few. These methods rely on performing a Singular Value Decomposition (SVD) at some stage and as such are not directly translatable to large and high-dimensional datasets. For an in-depth overview of nonlinear dimensionality reduction, see [2]. In the last decade, deep neural networks (DNNs) have been used in the analysis of such highly nonlinear datasets with much success. DNNs help in circumventing the need for SVD by optimizing with respect to a reconstruction loss (possibly amended with regularizers), on randomly selected points, to learn both local and global properties of the data.

We begin by establishing the mathematical notations used in this article. Let us denote the observation (data) space as $X\subset\mathbb{R}^{n}$ , with $n$ usually very large, representing the number of features of the dataset. After performing dimension reduction via an encoder map $\operatorname{\mathsf{E}}$ , we end up in the latent space $Z\subset\mathbb{R}^{m}$ , with $m\ll n$ . To reconstruct this reduced data, we map it back with a decoder map $\operatorname{\mathsf{D}}$ to the original space $X$ , which is usually denoted by $\widehat{X}$ to emphasize the reconstruction process. Therefore, dimensionality reduction is represented by the maps:

[TABLE]

The manifold that is learned is $\mathcal{D}=\operatorname{\mathsf{D}}(Z)\subset\mathbb{R}^{n}$ ; it can be seen as embedded in $\mathbb{R}^{n}$ although the existence of an embedding (in the mathematical sense) is not guaranteed. In practice, the latent space $Z$ and the decoder $\operatorname{\mathsf{D}}$ constitute a parametrization of the learned manifold $\mathcal{D}$ . When the encoder $\operatorname{\mathsf{E}}$ and decoder $\operatorname{\mathsf{D}}$ are neural networks, the structure above is knows as an autoencoder. In this context the encoder is parameterized by weights $\theta_{\mathsf{enc}}$ and the decoder is parameterized by weights $\theta_{\mathsf{dec}}$ , and the optimal weights are found by minimizing the reconstruction loss

[TABLE]

with respect to the weights $\theta_{\mathsf{enc}}$ and $\theta_{\mathsf{dec}}$ , and where $x_{i}$ denotes one of the $N$ data samples ( $\|\cdot\|_{2}$ denotes the Euclidean norm). In what follows we will omit the reference to these weights for simplicity.

While using the reconstruction loss guarantees that the data manifold $\mathcal{D}$ is close to the original data found in $X$ , neighboring data points in $X$ can me mapped to points in $Z$ (called codes) which are far away from each other in the latent space $Z$ , and similarly nearby codes in $Z$ can be reconstructed without preserving the distance between them, resulting in encoders and decoders that exhibit high variations. In order to obtain better reduced representations of the data in the latent space, various geometric regularization terms have been proposed for either the encoder or the decoder, in order to not tear apart the latent representation and preserve neighborhoods (topology), or the distances in a certain sense (geometry – which practitioners may want to use to make statements about their datasets)222Regularizers can be seen as constraints in a given optimization problem and serve many purposes, such as imposing sparsity or smoothness. In this article regularization is used to enforce geometric properties.. For instance, to avoid high variations of the encoder, the norm of the Jacobian of the encoder is used as a regularizer (see [19]), which encourages the variations to be globally small. In another direction, a global isometry can be encouraged for the decoder by using a regularization term of the form:

[TABLE]

where $d_{Z}$ denotes a distance on the latent space $Z$ and $d_{X}$ denotes a distance on the data space $X$ (or similarly for the encoder; see [16] for an application to encoders where $d_{X}$ is the (approximated) geodesic distance). However, this regularization term is strong, as distances are encouraged to be preserved for any pair of codes, possibly not near each other. A weaker possibility is to encourage the decoder to be a local isometry, meaning that distances are preserved only between mapped neighborhoods. If both the data space and latent space are equipped with Euclidean metrics, the regularizer for the decoder is then

[TABLE]

where $J_{\operatorname{\mathsf{D}}}(z)$ denotes the Jacobian of the decoder at a point $z\in Z$ (see [7] for such an approach), $I_{m}$ the identity matrix of size $m\times m$ , and $\|\cdot\|_{F}$ the Frobenius norm. Local distance preservation has also been explored in [14], without resorting to Jacobians but by using an auxiliary neighborhood graph and enforcing a local version of (3) between consecutive layers of the encoder. Another direction for geometric regularization makes use of curvature minimization, see for instance [13].

Nonlinear conformal decoding

In [12], it was noted that (4) is not a coordinate-invariant expression. The authors then proposed a new coordinate-invariant regularizer that ensures that the learned decoder is as closed as possible to a (local) isometry, and extended this approach (in a slightly different way) to conformal mappings possessing a constant conformal factor (see also [5] and [17] for former approaches). Conformal maps locally preserve distances up to a (possibly nonlinear) factor, and therefore preserve angles. Such maps are well-known in cartography (for cartographic projections, akin to what an encoder does) as well as physics and engineering (for reformulating problems into equivalent yet simpler ones). In generative artificial intelligence, conformal maps have also been used as normalizing flows (see [20]). We extend this approach to the case a nonlinear conformal factor and show its benefits.

Definition 1:

Let $(M,g)$ and $(N,h)$ be two Riemannian manifolds and $f:M\to N$ be a smooth map. The map $f$ is called conformal if there exists a smooth map $c:M\to\mathbb{R}$ everywhere strictly positive such that $f^{\star}h=cg$ , which means that for any point $x\in M$ , $u$ , $v\in T_{x}M$ , we have:

[TABLE]

or alternatively in terms of the corresponding matrices:

[TABLE]

The function $c$ (or sometimes $\sqrt{c}$ ) is known as the conformal factor).

We now would like to encourage a decoder $\operatorname{\mathsf{D}}:Z\to\widehat{X}$ to be a nonlinear conformal map. In many practical cases, the Riemannian metric on the data space $X$ (or $\widehat{X}$ ) is assumed to be Euclidean, and we will assume so as well: it amounts to say that the matrix $H$ above is everywhere the identity matrix. Then from (5), the condition we are trying to enforce reads $J_{\operatorname{\mathsf{D}}}(z)^{\mathsf{T}}J_{\operatorname{\mathsf{D}}}(z)G(z)^{-1}=c(z)I_{m}$ , where $G(z)$ is the matrix of the Riemannian metric on the latent space $Z$ (which is invertible) and $c(z)$ is the conformal factor for the decoder $\operatorname{\mathsf{D}}$ . Following [12], since eigenvalues of a matrix are invariant upon a coordinate change, we need to formulate a regularization term that involves eigenvalues of the ratio matrix $R(z):=J_{\operatorname{\mathsf{D}}}(z)^{\mathsf{T}}J_{\operatorname{\mathsf{D}}}(z)G(z)^{-1}$ , which we will denote by $\lambda_{i}(z)$ , $0\leq i\leq m$ . A natural expression that doesn’t favor any eigenvalue is:

[TABLE]

where $\Phi:\mathbb{R}\to\mathbb{R}$ denotes a smooth, positive and convex function that reaches its minimum at $1$ , and $\nu$ denotes a probability measure on the latent space $Z$ 333The latent space will always be assumed to be compact.. This expression can be used to encourage the decoder to be a local isometry; we will explain how briefly hereafter. However, for the purpose of encouraging the decoder $\operatorname{\mathsf{D}}$ to be a nonlinear conformal map, we propose the following regularization term:

[TABLE]

where $\sigma:\mathbb{R}^{m}\to\mathbb{R}$ denotes any symmetric function such that $\sigma(1,\dots,1)=1$ and such that $\sigma$ is homogeneous of degree $1$ , meaning $\sigma(\alpha x_{1},\cdots,\alpha x_{m})=\alpha\sigma(x_{1},\dots,x_{m})$ for any nonzero real number $\alpha$ . Note that such a function $\sigma$ may still favor certain eigenvalues, but as a priori there is no reason to do so, we will specify a neutral $\sigma$ later on. This new regularization term is a direct generalization of:

[TABLE]

with $I=\int_{Z}\sigma(\lambda_{1}(z),\dots,\lambda_{m}(z))\ \mathrm{d}z$ (see [12, section 3.3]), with the advantage that we will have access to a statistical estimator of the local deformation factor, see (9). Indeed, the conformal factor at a point $z$ can be seen as the ratio of the area of a ball centered at $z$ to the area of the preimage by the decoder of the same ball, giving a quantitative way to measure the stretching sustained while decoding.

Proposition 1:

Given a decoder $\operatorname{\mathsf{D}}:Z\to\widehat{X}$ , a function $\Phi:\mathbb{R}\to\mathbb{R}$ and a function $\sigma:\mathbb{R}^{m}\to\mathbb{R}$ satisfying the properties mentioned above, we have:

1

$\widetilde{\mathcal{L}}_{\mathsf{conf}}(\operatorname{\mathsf{D}})\geq 0$ , 2. 2

$\widetilde{\mathcal{L}}_{\mathsf{conf}}(\operatorname{\mathsf{D}})=0$ is equivalent to the existence of a smooth map $c:Z\to\mathbb{R}$ such that $\lambda_{i}(z)=c(z)$ for all $0\leq i\leq m$ .

Also, given two decoders $\operatorname{\mathsf{D}}_{1}:Z\to\widehat{X}$ and $\operatorname{\mathsf{D}}_{2}:Z\to\widehat{X}$ , we have:

3

If there exists a smooth map $c:Z\to\mathbb{R}$ such that $R_{1}(z)=c(z)R_{2}(z)$ , then $\widetilde{\mathcal{L}}_{\mathsf{conf}}(\operatorname{\mathsf{D}}_{1})=\widetilde{\mathcal{L}}_{\mathsf{conf}}(\operatorname{\mathsf{D}}_{2})$ .

Proof:

The first point is trivial. Concerning the second point, the implication from right to left is clear thanks to the properties of $\sigma$ . The converse comes from the fact that $h$ is positive, so

[TABLE]

for all $0\leq i\leq m$ , but then $\Phi$ reaches its unique minimum at $1$ , hence setting $c(z):=\sigma(\lambda_{1}(z),\dots,\lambda_{m}(z))$ we obtain $\lambda_{i}(z)=c(z)$ for all $0\leq i\leq m$ . The third point is clear using the homogeneity of $\sigma$ after remarking that the eigenvalues of the matrix $R_{1}$ are the ones of $R_{2}$ multiplied by the conformal factor $c$ , at any point $z$ . $\Box$

The first property ensures that the minimum of the regularizer is zero, while the second ensures that this minimum is reached precisely when the decoder is a conformal map, with the conformal factor given by the map $c:Z\to\mathbb{R}$ . The third property simply says that conformal-equivalent decoders yield the same value of the regularization term; such property reduces the search space.

Proposition 2:

Choosing $\Phi(x)=\frac{1}{2}(x-1)^{2}$ and $\sigma(x_{1},\cdots,x_{m}):=\frac{1}{m}(x_{1}+\cdots+x_{m})$ so as not to favor any particular eigenvalue, we obtain:

[TABLE]

Proof:

From the previous proof, we note that the conformal factor is given by $c(z)=\sigma(\lambda_{1}(z),\dots,\lambda_{m}(z))=\frac{1}{m}\sum_{i=1}^{m}\lambda_{i}(z)$ , hence we obtain the second formula. For the first, after rewriting the integral in (7) as an expectation with respect to the probability distribution $\nu$ on $Z$ , we compute:

[TABLE]

which after simplification yields the first formula, since thanks to the invariance by similarity (change of basis) property of the trace we have $\operatorname{Tr}R^{2}(z)=\sum_{i=1}^{m}\lambda_{i}(z)^{2}$ . $\Box$

Remark 1:

If we denote by $d_{\mathsf{geo}}^{X}$ the geodesic distance on the data space $X$ and the geodesic distance on the latent space $Z$ by $d_{\mathsf{geo}}^{Z}$ , we have:

[TABLE]

for any $z_{1}$ , $z_{2}\in Z$ in a neighborhood of a common $z\in Z$ . Indeed, using the Riemannian exponential map $\exp:T_{\operatorname{\mathsf{D}}(z)}X\to X$ at the point $\operatorname{\mathsf{D}}(z)$ we have $d_{\mathsf{geo}}^{X}(\operatorname{\mathsf{D}}(z_{1}),\operatorname{\mathsf{D}}(z_{2}))=d_{\mathsf{geo}}^{X}(\mathrm{e}^{\mathrm{d}\operatorname{\mathsf{D}}_{z_{1}}(u_{1})},\mathrm{e}^{\mathrm{d}\operatorname{\mathsf{D}}_{z_{2}}(u_{2})})$ for some tangent vectors $u_{1}$ and $u_{2}$ . Then by using [6, chapter 5, proposition 2.7] which approximates the geodesic distance by the Riemannian distance $h$ on $X$ , we obtain that the last term is approximately equal to $\|\mathrm{d}\operatorname{\mathsf{D}}_{z_{1}}(u_{1})-\mathrm{d}\operatorname{\mathsf{D}}_{z_{2}}(u_{2})\|_{h(\operatorname{\mathsf{D}}(z))}$ (to first order) which in turn is equal to $c(z)\|u_{1}-u_{2}\|_{g(z)}$ since $\operatorname{\mathsf{D}}$ is conformal444Given $(M,g)$ a Riemannian manifold, $\|u\|^{2}_{g(x)}:=g(x)(u,u)$ (or $u^{\mathsf{T}}G(x)u$ in matrix form) for any tangent vector $u\in T_{x}M$ .. Then using the exponential $\exp:T_{z}Z\to Z$ and the aforementioned proposition once more, we obtain $c(z)d_{\mathsf{geo}}^{Z}(z_{1},z_{2})$ .

Remark 2:

The regularization term (6) can be used to encourage the decoder $\operatorname{\mathsf{D}}$ to be local isometry. Using the same function $\Phi$ as before, and computations similar to the ones in proposition 2 shows that we obtain a regularization term of the form:

[TABLE]

See also [9] for a similar approach but that does not make use of neural networks. In [12], constant conformal maps were introduced; they correspond to the coordinate-invariant regularization term:

[TABLE]

and the associated constant conformal factor then can be seen to be $c\equiv\frac{1}{m}\int_{Z}\operatorname{Tr}R(z)\ \mathrm{d}\nu$ . Notice also the difference with $\widetilde{\mathcal{L}}_{\mathsf{conf}}(\operatorname{\mathsf{D}})$ regarding the use of the expectation operator $\operatorname{\mathbb{E}}_{\nu}$ .

Computing scalar curvature

In Riemannian geometry, scalar curvature [11] is the tensor of lowest rank that one can associate to the original Riemann curvature tensor. It is a smooth function $S$ on the manifold (or on the latent space, in case the manifold is embedded like in this article) that measures the curvature at that point, independently from a choice a local chart (or parametrization in our case). Computing different measures of the curvature of the learned manifold gives an intrinsic piece information about this manifold, and is of particular interest in the case of RNAseq data, see [24] and [22] for instance. With the help of the regularizer (7), in the case the latent space is of dimension $2$ (which is usually the case for visualization) we are encouraging the decoder to be conformal, meaning that at each latent variables $z\in Z$ we are trying to impose the relationship:

[TABLE]

where $g_{\mathsf{eucl}(n)}$ denotes the Euclidean metric on $\mathbb{R}^{n}$ . Therefore, the scalar curvature on the latent space is the scalar curvature of the metric $cg_{\mathsf{eucl}(m)}$ , with $m=2$ . However, the change of scalar curvature $S(g)$ (on a manifold of dimension $m$ ) under a conformal change of metric $g\mapsto\mathrm{e}^{2f}g$ is well known in Riemannian geometry (see [3]):

[TABLE]

where $\Delta^{g}$ denotes the Laplace-Beltrami operator, generalizing the Laplacian to manifolds (see [11]). In our case $m=2$ and $g=g_{\mathsf{eucl}(2)}$ so we obtain:

[TABLE]

which is independent of the dimension $n$ of the space in which the data manifold $\mathcal{D}=\operatorname{\mathsf{D}}(Z)\subset\mathbb{R}^{n}$ is embedded. Once the decoder $\operatorname{\mathsf{D}}$ has been learned during the training phase, we can compute the conformal factor $c$ with the formula (10): to each input data point corresponds a latent code $z$ , and accordingly a value $c(z)$ of the conformal factor at that point. To estimate the scalar curvature of $\mathcal{D}$ , a simple approach is then to construct a $k$ -nearest neighbors graph from the latent codes, a weight matrix $W$ (and its associated degree matrix $D$ ) with exponential weights with respect to the Euclidean distances between latent codes, and form the graph Laplacian $L=D-W$ . Then if $\mathbf{c}=\{c(z_{i})\}_{1\leq i\leq N}$ represents the array of estimated values for the conformal factor, the scalar curvature is given by the array $-\frac{1}{\mathbf{c}}L\log\mathbf{c}$ , where the division and the logarithm are component-wise operations.

Implementation

In order to implement the different regularizers, we use Hutchinson’s trace estimator [8]. Namely, given a $n\times n$ matrix $A$ , the trace $\operatorname{Tr}A$ can be Monte-Carlo estimated thanks to the relation $\operatorname{Tr}A=\operatorname{\mathbb{E}}_{v\sim\mathcal{N}(0,I_{n})}[v^{\mathsf{T}}Av]$ , which has the advantage to only require matrix-vector products. Stacking $n$ samples from a Rademacher distribution instead of using one sample from a $n$ -dimensional multivariate Gaussian, one obtains an unbiased estimator with minimal variance (see [8, proposition 1])555Notice though that Hutchinson’s estimator may exhibit a great amount of Monte-Carlo variance. To remedy to this problem, variance reduction techniques can be employed, see this web page for an exhaustive review and insights. We give an example of this estimator for $\operatorname{Tr}R$ with $R=J^{\mathsf{T}}J$ , $J$ denoting the Jacobian of some map (a decoder in particular). We have the formula

[TABLE]

which we used to implement the regularization term (9) with PyTorch666Matrix-vector products of the form $Jv$ are computed with the help of the function torch.func.jvp, while matrix-vector products such as $J^{\mathsf{T}}$ (that appear in $R^{2})$ are computed with the help of the function torch.func.vjp., and where $v_{i}$ denotes a stacked Rademacher sample as described before. Also, to implement $\operatorname{\mathbb{E}}_{\nu}$ , we take $\nu$ to be the empirical measure.

Furthermore when the full Jacobian of the decoder is needed for a batch of data, we make use of the the torch.func.vmap and torch.func.jacfwd functions. The optimizer used is the weighted Adam’s algorithm, with additions in the case of the Celebrities dataset (see section 5.2).

All code was implemented using PyTorch and was run on standard laptop-grade Intel CPU hardware (i5-11400H), then with a standard NVIDIA GPU hardware (GeForce RTX 3500), as well as on a computing cluster equipped with NVIDIA H100 NVL GPU hardware, giving roughly a tenfold performance increase on the experiments to follow. The full code can be found at this URL.

Remark 3:

Concerning reproducibility, although we made sure that all seeds (Python, NumPy, PyTorch and CUDA ones) are set before running the experiments, we couldn’t obtain the same values for different runs. According to PyTorch’s documentation, this is due to CUDA choosing at runtime what is the best algorithm to run the code on the GPU, and that randomness is involved in this choice. Even with fully deterministic CUDA algorithms we couldn’t obtain the same values for different runs. Running multiple times the same experiment and averaging the results was ruled out because it proved to be too computationally intensive. We invite the reader to consult PyTorch’s official documentation on this matter.

Experiments

In order to assess the regularization terms $\mathcal{L}_{\textsf{lociso}}$ (11) and $\widetilde{\mathcal{L}}_{\textsf{conf}}$ (12) proposed in this article, we created experiments based on several datasets. A vanilla autoencoder with reconstruction loss $\mathcal{L}_{\textsf{recon}}$ (2) serves as the basis for comparison (with no geometric regularizer), with layers of different types according to the dataset used. Then, the different regularization terms were added to the reconstruction loss $\mathcal{L}_{\mathrm{recon}}$ , starting with the well-known $\mathcal{L}_{\textsf{gloiso}}$ (3) and then considering $\mathcal{L}_{\textsf{lociso}}$ and $\widetilde{\mathcal{L}}_{\textsf{conf}}$ .

Swiss roll dataset

The Swiss roll is a classical toy dataset in nonlinear dimensionality reduction. It can be found for example in scikit-learn. In this package it is parametrized as follows: $Z=[\frac{3\pi}{2},\frac{9\pi}{2}]\times[0,21]$ , a latent variable is written $z=(\xi,\eta)$ , and the parametrization is given by:

[TABLE]

The notations $Z$ and $\operatorname{\mathsf{D}}$ are precisely chosen so as to give a parallel with the autoencoder structure (1) (with $X=\mathbb{R}^{3}$ , $n=3$ and $m=2$ ). We are now given samples from this dataset and the aim of the encoder is to obtain a latent space $Z$ (not necessarily the same as the one above), while the decoder will give us a new parametrization of the Swiss roll. Adding the regularization terms means that we are adding constraints on the way the Swiss roll is embedded in $\mathbb{R}^{3}$ through its parametrization $\operatorname{\mathsf{D}}$ . Using this parametrization we compute:

[TABLE]

therefore the parametrization above is not conformal, and the regularizer we propose will find one. The dataset has been normalized as follows: to each data point the average among all samples is subtracted, and we divide the result by the standard deviation among the samples. The vanilla autoencoder is composed of a sequence of fully connected layers with dimensions $\mathbf{3}\rightarrow\mathbf{50}\rightarrow\mathbf{50}\rightarrow\mathbf{50}\rightarrow\mathbf{2}$ for the encoder, and reversed for the decoder, together with ReLU activation functions. The same architecture is held throughout the experiments, for all regularizers. The standard mean squared error (MSE) loss is used to implement the reconstruction loss.

The scalar curvature of the Swiss roll, which is twice its Gaussian curvature, can be computed from the parametrization above and we find $S\equiv 0$ as expected as it is an embedding of a plane in $\mathbb{R}^{3}$ (see [18] for how to compute the Gaussian curvature from the first and second fundamental forms). The vanilla autoencoder, together with the conformal regularizer $\widetilde{\mathcal{L}}_{\textsf{conf}}$ and the computation of the graph Laplacian on the latent space allows to recover this fact as shown by figure 3.

On the validation dataset we also measured the condition number (with respect to the $L^{2}$ norm) $\kappa_{\textsf{jac}}$ of the Jacobian of the decoder, $J_{\operatorname{\mathsf{D}}}(z)$ , and the condition number $\kappa_{\textsf{pbm}}$ of the pull-back metric of the decoder, $J_{\operatorname{\mathsf{D}}}(z)^{\mathsf{T}}J_{\operatorname{\mathsf{D}}}(z)$ with $z\in Z\subset\mathbb{R}^{2}$ . The first measure shows how the linearized decoder around one latent code amplifies variations in its input (latent variables), and the second summarizes the pullback metric, specifically the anisotropy of the pullback metric, or how far the decoder is from being a local isometry. These condition numbers are computed with torch.linalg.cond at each validation point in the latent space. Their spatial distributions are visualized in figure 4 and summarized in the table 1, for all geometric regularizers. Although no experiment repetition and results averaging has been performed, it has been observed clearly that nonlinear conformal regularization (and local isometric regularization) consistently learn a better conditioned and anisotropic decoder.

An important hyperparameter to consider when one wants to quantitatively assess the benefits of a geometric regularizer is the intensity with which the constraint is imposed in the total loss. One rule of thumb is to first adjust it so that the regularizer values times this intensity are in the same range as the values of the reconstruction loss $\mathcal{L}_{\mathsf{recon}}$ . Then, from that stage, if we increase this intensity it may happen that the reconstruction loss does not decrease anymore, while decreasing the intensity will accordingly decrease the desired effect of the geometric regularizer. Finding a good balance is a challenge, especially because the regularizer values change during the training process.

Celebrities dataset

The Celebrities dataset (also widely known as CelebA, see [15]) is a well-known dataset that comprises more than 200,000 faces of celebrities (with 40 annotated attributes), each of size $178\times 218$ pixels. With the help of the torchvision.transforms submodule, each image in the dataset is center cropped to $150\times 150$ pixels, resized to $32\times 32$ pixels, and normalized per channel so that all values are in $[-1,1]$ . Therefore, with regards to the structure (1), $n=3072$ and $m=2$ . For this experiment the vanilla autoencoder uses convolutional layers with kernel size $3$ , stride $2$ and padding $1$ , together with leaky ReLU activation functions, except for the last layer of the decoder which uses an hyperbolic tangent in order to normalize the values in $[-1,1]$ and recreate an image. There is also a fully connected layer after the last convolution layer to map the last feature map onto the latent space. The dimensions are as follows: $\mathbf{3\times 32\times 32}\rightarrow\mathbf{32\times 16\times 16}\rightarrow\mathbf{64\times 8\times 8}\rightarrow\mathbf{128\times 4\times 4}\xrightarrow{\text{flatten}}\mathbf{2048}\rightarrow\mathbf{2}$ . In order to reduce overfitting, for this dataset we used weight decay and a learning rate scheduler that reduces the learning rate upon encountering a plateau (starting with a learning rate of $10^{-4}$ , with batch size 64, attaining convergence at 100 epochs). Note that since the data set is high dimensional and the compression is maximal (latent space has dimension 2), it is very difficult to reconstruct images and the reconstruction loss stabilizes around $0.1$ only. The table 2 shows once more that the local isometry and nonlinear conformal regularizers added to the vanilla autoencoder learn better conditioned decoders than the global isometry regularizer.

Conclusion

In this article, we applied a new type of geometric regularization to decoders, namely nonlinear conformal regularization. Conformal maps do not exactly preserve distances, only up to a certain factor that varies with the codes in the latent space, called the conformal factor, which allows to make an assessment on the local deformation that occurs upon reconstructing the observed data from latent codes. The regularizer is formulated in a coordinate-invariant way, in the spirit of [9], and is implemented as a Hutchinson’s Monte-Carlo estimator, whose quality increases with the number of stacked Rademacher samples. The architecture proposed automatically learns the conformal factor as well as the scalar curvature of the manifold after construction of a graph Laplacian on the latent space. In addition, it is more flexible than a global isometry regularizer in the sense that local variations of the decoder are allowed, and the regularizers allow the architecture to learn better conditioned decoders.

This work opens further questions on geometric regularization in the context of autoencoders. One venue of research would be to integrate this new geometric regularizer into a variational autoencoder (VAE), and study the interplay between the geometry brought back onto the latent space with the help of the pullback metric, and the information geometry of the latent space as given bye the VAE (several works are already investigating this direction, such as [4] and [10]), and how conformal maps are beneficial in that case. Moreover, VAEs tend to use less samples than standard autoencoders, which would be beneficial to study RNAseq datasets with the presented framework, possibly opening the way to make statements based on stretched distances and scalar curvature as visualized in the latent space. Also, it would be interesting to study variance reduction techniques for the conformal regularizer, with the aim of improving both the condition number of the decoder’s Jacobian and the condition number of the pullback metric.

Acknowledgments: The author would like to thank Enikő Regényi for her explanations regarding RNAseq data and for the preparation of experiments regarding several RNAseq datasets (unpublished).

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In T. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems , volume 14. MIT Press, 2001.
2[2] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. , 35(8):1798–1828, 2013.
3[3] Arthur L Besse. Einstein Manifolds . Classics in Mathematics. Springer, Berlin, Germany, 2007.
4[4] Clément Chadebec and Stéphanie Allassonnière. A geometric perspective on variational autoencoders. In Proceedings of the 36th International Conference on Neural Information Processing Systems , NIPS ’22, Red Hook, NY, USA, 2022. Curran Associates Inc.
5[5] Nutan Chen, Alexej Klushyn, Francesco Ferroni, Justin Bayer, and Patrick Van Der Smagt. Learning flat latent manifolds with VA Es. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning , volume 119 of Proceedings of Machine Learning Research , pages 1587–1596. PMLR, 2020.
6[6] F. Flaherty and M.P. do Carmo. Riemannian Geometry . Mathematics: Theory & Applications. Birkhäuser Boston, 2013.
7[7] Amos Gropp, Matan Atzmon, and Yaron Lipman. Isometric autoencoders, 2020.
8[8] M.F. Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics - Simulation and Computation , 18(3):1059–1076, 1989.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Assessing local deformation and computing scalar curvature with nonlinear conformal regularization of decoders

Abstract

Introduction

Nonlinear conformal decoding

Definition 1**:**

Proposition 1**:**

Proof**:**

Proposition 2**:**

Proof**:**

Remark 1**:**

Remark 2**:**

Computing scalar curvature

Implementation

Remark 3**:**

Experiments

Swiss roll dataset

Celebrities dataset

Conclusion

Definition 1:

Proposition 1:

Proof:

Proposition 2:

Proof:

Remark 1:

Remark 2:

Remark 3: