Entropic Regularisation of Robust Optimal Transport

Rozenn Dahyot; Hana Alghamdi; Mairead Grogan

arXiv:1905.12678·cs.CV·July 15, 2019

Entropic Regularisation of Robust Optimal Transport

Rozenn Dahyot, Hana Alghamdi, Mairead Grogan

PDF

TL;DR

This paper reinterprets a recent colour transfer method as a robust optimal transport framework with entropy regularisation, providing a new perspective on the approach.

Contribution

It introduces a novel robust optimal transport formulation with entropy regularisation over marginals, unifying and extending previous colour transfer methods.

Findings

01

Reinterprets a colour transfer method within a robust optimal transport framework

02

Demonstrates the effectiveness of entropy regularisation in robust optimal transport

03

Provides theoretical insights connecting colour transfer and optimal transport

Abstract

Grogan et al [11,12] have recently proposed a solution to colour transfer by minimising the Euclidean distance L2 between two probability density functions capturing the colour distributions of two images (palette and target). It was shown to be very competitive to alternative solutions based on Optimal Transport for colour transfer. We show that in fact Grogan et al's formulation can also be understood as a new robust Optimal Transport based framework with entropy regularisation over marginals.

Figures4

Click any figure to enlarge with its caption.

Equations56

\hat{ϕ} = ar g ϕ min C (ϕ)

\hat{ϕ} = ar g ϕ min C (ϕ)

\begin{array}[]{l|lcl}\mathcal{C}(\phi)=&+\frac{1}{n^{2}}\sum_{j_{1}=1}^{n}\sum_{j_{2}=1}^{n}\mathcal{N}(0;y^{(j_{1})}-y^{(j_{2})},2h^{2}\mathrm{I})&&(\mathcal{T}_{0})\\ &&&\\ &+\frac{1}{\tilde{n}^{2}}\sum_{i_{1}=1}^{\tilde{n}}\sum_{i_{2}=1}^{\tilde{n}}\mathcal{N}(0;\tilde{y}^{(i_{1})}-\tilde{y}^{(i_{2})},2h^{2}\mathrm{I})&&(\mathcal{T}_{1})\\ &&&\\ &-\frac{2}{n\tilde{n}}\sum_{i=1}^{\tilde{n}}\sum_{j=1}^{n}\mathcal{N}(0;\tilde{y}^{(i)}-y^{(j)},2h^{2}\mathrm{I})&&(\mathcal{T}_{2})\\ &&&\\ &-\lambda_{1}\frac{1}{\tilde{\tilde{n}}}\sum_{k=1}^{\tilde{\tilde{n}}}\mathcal{N}(0;\tilde{y}^{(k)}-y^{(k)},2h^{2}\mathrm{I})&&(\mathcal{T}_{3})\\ &&&\\ &+\lambda_{2}\ \mathcal{P}(\tilde{y})&&(\mathcal{T}_{4})\\ &&&\\ &+\lambda_{3}\ \mathcal{P}(\phi)&&(\mathcal{T}_{5})\\ &&&\\ \end{array}

\begin{array}[]{l|lcl}\mathcal{C}(\phi)=&+\frac{1}{n^{2}}\sum_{j_{1}=1}^{n}\sum_{j_{2}=1}^{n}\mathcal{N}(0;y^{(j_{1})}-y^{(j_{2})},2h^{2}\mathrm{I})&&(\mathcal{T}_{0})\\ &&&\\ &+\frac{1}{\tilde{n}^{2}}\sum_{i_{1}=1}^{\tilde{n}}\sum_{i_{2}=1}^{\tilde{n}}\mathcal{N}(0;\tilde{y}^{(i_{1})}-\tilde{y}^{(i_{2})},2h^{2}\mathrm{I})&&(\mathcal{T}_{1})\\ &&&\\ &-\frac{2}{n\tilde{n}}\sum_{i=1}^{\tilde{n}}\sum_{j=1}^{n}\mathcal{N}(0;\tilde{y}^{(i)}-y^{(j)},2h^{2}\mathrm{I})&&(\mathcal{T}_{2})\\ &&&\\ &-\lambda_{1}\frac{1}{\tilde{\tilde{n}}}\sum_{k=1}^{\tilde{\tilde{n}}}\mathcal{N}(0;\tilde{y}^{(k)}-y^{(k)},2h^{2}\mathrm{I})&&(\mathcal{T}_{3})\\ &&&\\ &+\lambda_{2}\ \mathcal{P}(\tilde{y})&&(\mathcal{T}_{4})\\ &&&\\ &+\lambda_{3}\ \mathcal{P}(\phi)&&(\mathcal{T}_{5})\\ &&&\\ \end{array}

μ (y) = \frac{1}{n} y^{(j)} \in S \sum N (y; y^{(j)}, h^{2})

μ (y) = \frac{1}{n} y^{(j)} \in S \sum N (y; y^{(j)}, h^{2})

\tilde{μ} (y ∣ ϕ) = \frac{1}{n ~} y^{(i)} \in S \sum N (y; \tilde{y}^{(i)}, h^{2})

\tilde{μ} (y ∣ ϕ) = \frac{1}{n ~} y^{(i)} \in S \sum N (y; \tilde{y}^{(i)}, h^{2})

L_{2} (μ, \tilde{μ}) = ∥ μ - \tilde{μ} ∥^{2} = \int (μ (y) - \tilde{μ} (y ∣ ϕ))^{2} d y = T_{0} ∥ μ ∥^{2} T_{2} - 2 ⟨ μ ∣ \tilde{μ} ⟩ + T_{1} ∥ \tilde{μ} ∥^{2}

L_{2} (μ, \tilde{μ}) = ∥ μ - \tilde{μ} ∥^{2} = \int (μ (y) - \tilde{μ} (y ∣ ϕ))^{2} d y = T_{0} ∥ μ ∥^{2} T_{2} - 2 ⟨ μ ∣ \tilde{μ} ⟩ + T_{1} ∥ \tilde{μ} ∥^{2}

\tilde{μ} (y) = \frac{1}{n ~ ~} y^{(k_{1})} \in \tilde{\tilde{S}} \sum N (y; y^{(k_{1})}, h^{2} I) and \tilde{μ} (y) = \frac{1}{n ~ ~} y^{(k_{2})} \in \tilde{\tilde{S}} \sum N (y; \tilde{y}^{(k_{2})}, \tilde{h}^{2} I)

\tilde{μ} (y) = \frac{1}{n ~ ~} y^{(k_{1})} \in \tilde{\tilde{S}} \sum N (y; y^{(k_{1})}, h^{2} I) and \tilde{μ} (y) = \frac{1}{n ~ ~} y^{(k_{2})} \in \tilde{\tilde{S}} \sum N (y; \tilde{y}^{(k_{2})}, \tilde{h}^{2} I)

⟨ μ ∣ \tilde{μ} ⟩ = \frac{1}{n ~ ~ ^{2}} y^{(k_{1})} \in \tilde{\tilde{S}} \sum y^{(k_{2})} \in \tilde{\tilde{S}} \sum N (y^{(k_{1})}; \tilde{y}^{(k_{2})}, (h^{2} + \tilde{h}^{2}) I)

⟨ μ ∣ \tilde{μ} ⟩ = \frac{1}{n ~ ~ ^{2}} y^{(k_{1})} \in \tilde{\tilde{S}} \sum y^{(k_{2})} \in \tilde{\tilde{S}} \sum N (y^{(k_{1})}; \tilde{y}^{(k_{2})}, (h^{2} + \tilde{h}^{2}) I)

W (μ, \tilde{μ}) = γ min {\int\int c (y, \tilde{y}) γ (y, \tilde{y}) d y d \tilde{y} = ⟨ c ∣ γ ⟩}

W (μ, \tilde{μ}) = γ min {\int\int c (y, \tilde{y}) γ (y, \tilde{y}) d y d \tilde{y} = ⟨ c ∣ γ ⟩}

γ_{u} (y, \tilde{y} ∣ ϕ) = \frac{1}{n} y^{(j)} \in S \sum N (y; y^{(j)}, h^{2} I) \times \frac{1}{n ~} \tilde{y}^{(i)} \in S \sum N (\tilde{y}; \tilde{y}^{(i)}, \tilde{h}^{2} I)

γ_{u} (y, \tilde{y} ∣ ϕ) = \frac{1}{n} y^{(j)} \in S \sum N (y; y^{(j)}, h^{2} I) \times \frac{1}{n ~} \tilde{y}^{(i)} \in S \sum N (\tilde{y}; \tilde{y}^{(i)}, \tilde{h}^{2} I)

μ_{u} (y) = \frac{1}{n} y^{(j)} \in S \sum N (y; y^{(j)}, h^{2} I)

μ_{u} (y) = \frac{1}{n} y^{(j)} \in S \sum N (y; y^{(j)}, h^{2} I)

\tilde{μ}_{u} (\tilde{y} ∣ ϕ) = \frac{1}{n ~} \tilde{y}^{(i)} \in S \sum N (\tilde{y}; \tilde{y}^{(i)}, \tilde{h}^{2} I)

\tilde{μ}_{u} (\tilde{y} ∣ ϕ) = \frac{1}{n ~} \tilde{y}^{(i)} \in S \sum N (\tilde{y}; \tilde{y}^{(i)}, \tilde{h}^{2} I)

γ_{s} (y, \tilde{y} ∣ ϕ) = \frac{1}{n ~ ~} (y^{(k)}, \tilde{y}^{(k)}) \in S \sum N (y; y^{(k)}, h^{2} I) N (\tilde{y}; \tilde{y}^{(k)}, \tilde{h}^{2} I)

γ_{s} (y, \tilde{y} ∣ ϕ) = \frac{1}{n ~ ~} (y^{(k)}, \tilde{y}^{(k)}) \in S \sum N (y; y^{(k)}, h^{2} I) N (\tilde{y}; \tilde{y}^{(k)}, \tilde{h}^{2} I)

μ_{s} (y) = \frac{1}{n ~ ~} y^{(k)} \in S \sum N (y; y^{(k)}, h^{2} I)

μ_{s} (y) = \frac{1}{n ~ ~} y^{(k)} \in S \sum N (y; y^{(k)}, h^{2} I)

\tilde{μ}_{s} (\tilde{y} ∣ ϕ) = \frac{1}{n ~ ~} \tilde{y}^{(k)} \in S \sum N (\tilde{y}; \tilde{y}^{(k)}, \tilde{h}^{2} I)

\tilde{μ}_{s} (\tilde{y} ∣ ϕ) = \frac{1}{n ~ ~} \tilde{y}^{(k)} \in S \sum N (\tilde{y}; \tilde{y}^{(k)}, \tilde{h}^{2} I)

γ_{s + u} (y, \tilde{y} ∣ ϕ) = (1 - λ) γ_{u} (y, \tilde{y} ∣ ϕ) + λ γ_{s} (y, \tilde{y} ∣ ϕ)

γ_{s + u} (y, \tilde{y} ∣ ϕ) = (1 - λ) γ_{u} (y, \tilde{y} ∣ ϕ) + λ γ_{s} (y, \tilde{y} ∣ ϕ)

μ_{s + u} (y) = (1 - λ) μ_{u} (y) + λ μ_{s} (y)

μ_{s + u} (y) = (1 - λ) μ_{u} (y) + λ μ_{s} (y)

\tilde{μ}_{s + u} (\tilde{y} ∣ ϕ) = (1 - λ) \tilde{μ}_{u} (\tilde{y} ∣ ϕ) + λ \tilde{μ}_{s} (\tilde{y} ∣ ϕ)

\tilde{μ}_{s + u} (\tilde{y} ∣ ϕ) = (1 - λ) \tilde{μ}_{u} (\tilde{y} ∣ ϕ) + λ \tilde{μ}_{s} (\tilde{y} ∣ ϕ)

c_{G} (y, \tilde{y}) = A - N (y; \tilde{y}, h_{c}^{2} I)

c_{G} (y, \tilde{y}) = A - N (y; \tilde{y}, h_{c}^{2} I)

⟨ c_{G} ∣ γ ⟩ = A - \int\int N (y; \tilde{y}, h_{c}^{2} I) γ (y, \tilde{y}) d y d \tilde{y}

⟨ c_{G} ∣ γ ⟩ = A - \int\int N (y; \tilde{y}, h_{c}^{2} I) γ (y, \tilde{y}) d y d \tilde{y}

ρ_{G} (ϵ) = 1 - exp (- \frac{1}{2} (\frac{ϵ}{σ})^{2})

ρ_{G} (ϵ) = 1 - exp (- \frac{1}{2} (\frac{ϵ}{σ})^{2})

ρ_{G} (ϵ) = 1 - exp (- \frac{ϵ ^{2}}{2 σ ^{2}}) \sim \frac{ϵ ^{2}}{2 σ ^{2}} for ϵ << σ

ρ_{G} (ϵ) = 1 - exp (- \frac{ϵ ^{2}}{2 σ ^{2}}) \sim \frac{ϵ ^{2}}{2 σ ^{2}} for ϵ << σ

⟨ c_{G} ∣ γ_{u} ⟩ = \frac{1}{n n ~} i = 1 \sum \tilde{n} j = 1 \sum n N (0; \tilde{y}^{(i)} - y^{(j)}, (h^{2} + \tilde{h}^{2} + h_{c}^{2}) I)

⟨ c_{G} ∣ γ_{u} ⟩ = \frac{1}{n n ~} i = 1 \sum \tilde{n} j = 1 \sum n N (0; \tilde{y}^{(i)} - y^{(j)}, (h^{2} + \tilde{h}^{2} + h_{c}^{2}) I)

⟨ c_{G} ∣ γ_{s} ⟩ = \frac{1}{n _{c}} k = 1 \sum n_{c} N (0; \tilde{y}^{(k)} - y^{(k)}, (h^{2} + \tilde{h}^{2} + h_{c}^{2}) I)

⟨ c_{G} ∣ γ_{s} ⟩ = \frac{1}{n _{c}} k = 1 \sum n_{c} N (0; \tilde{y}^{(k)} - y^{(k)}, (h^{2} + \tilde{h}^{2} + h_{c}^{2}) I)

⟨ c_{G} ∣ γ_{s + u} ⟩ = (1 - λ) ⟨ c_{G} ∣ γ_{u} ⟩ + λ ⟨ c_{G} ∣ γ_{s} ⟩ (T)

⟨ c_{G} ∣ γ_{s + u} ⟩ = (1 - λ) ⟨ c_{G} ∣ γ_{u} ⟩ + λ ⟨ c_{G} ∣ γ_{s} ⟩ (T)

\hat{ϕ} = ar g ϕ min {W (μ_{s + u}, \tilde{μ}_{s + u}) = ⟨ c_{G} ∣ γ_{s + u} ⟩}

\hat{ϕ} = ar g ϕ min {W (μ_{s + u}, \tilde{μ}_{s + u}) = ⟨ c_{G} ∣ γ_{s + u} ⟩}

T_{5} = λ_{3} \int \frac{\partial ^{2} ϕ ( x )}{\partial ^{2} x}^{2} d x

T_{5} = λ_{3} \int \frac{\partial ^{2} ϕ ( x )}{\partial ^{2} x}^{2} d x

\hat{ϕ} = ar g ϕ max \int\int N (y; \tilde{y}, h_{c}^{2} I) γ_{ϕ} (y, \tilde{y}) d y d \tilde{y}

\hat{ϕ} = ar g ϕ max \int\int N (y; \tilde{y}, h_{c}^{2} I) γ_{ϕ} (y, \tilde{y}) d y d \tilde{y}

\hat{ϕ} = ar g ϕ max ⟨ γ_{m} ∣ γ_{ϕ} ⟩

\hat{ϕ} = ar g ϕ max ⟨ γ_{m} ∣ γ_{ϕ} ⟩

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Entropic Regularisation of Robust Optimal Transport

Rozenn Dahyot

School of Computer Science and Statistics

Trinity College Dublin, Ireland

[email protected], [email protected], [email protected]

Hana Alghamdi and Mairead Grogan

School of Computer Science and Statistics

Trinity College Dublin, Ireland

[email protected], [email protected], [email protected]

Abstract

Grogan et al. [11, 12] have recently proposed a solution to colour transfer by minimising the Euclidean distance $\mathcal{L}_{2}$ between two probability density functions capturing the colour distributions of two images (palette and target). It was shown to be very competitive to alternative solutions based on Optimal Transport for colour transfer. We show that in fact Grogan et al’s formulation can also be understood as a new robust Optimal Transport based framework with entropy regularisation over marginals.

Keywords: M-estimation, $\mathcal{L}_{2}E$ estimator, Optimal Transport, Colour Transfer

1 Introduction

Optimal transport (OT) [16] has been successfully used as a way for defining cost functions for optimisation when performing colour transfer [17] and more recently in machine learning [4, 16]. The optimal transport cost (e.g Wasserstein distance) itself is also used as a similarity metric for retrieval [18]. For colour transfer (see Fig. 1111Images extracted from video https://youtu.be/FfrdyKMBVRc (demo for [12]) have been used for designing Fig. 1.), Grogan et al. [11, 12] have recently proposed an alternative approach for designing the cost function based on the $\mathcal{L}_{2}$ divergence (see section 2). This $\mathcal{L}_{2}$ based cost function is a weighted sum of multiple terms (terms $\mathcal{T}_{0,1,2,3}$ in Eq. 1) able to take into account correspondences between images (via term $\mathcal{T}_{3}$ in Eq. 1) when these are available, as well as the unsupervised scenario when no correspondence is available (via term $\mathcal{T}_{2}$ in Eq. 1). In addition, $\mathcal{L}_{2}$ includes entropies (terms $\mathcal{T}_{0}$ and $\mathcal{T}_{1}$ in Eq. 1). To further constrain the cost function when estimating the colour transformation $\phi$ , additional penalties can be added to prevent colours exceeding a certain range or forcing the estimated solution $\phi$ to be smooth (resp. terms $\mathcal{T}_{4}$ and $\mathcal{T}_{5}$ in Eq. 1). The estimate $\hat{\phi}$ is computed as

[TABLE]

with:

[TABLE]

with $\mathcal{N}(z;a,\Sigma)$ indicating a normal distribution for random vector $z$ with expectation $a$ and covariance matrix $\Sigma$ . $\mathrm{I}$ is the identity matrix, $h$ is a user defined bandwidth and $\lambda_{1,2,3}$ are weights. This paper aims at proposing an OT formulation for the terms $\mathcal{T}_{2}$ and $\mathcal{T}_{3}$ (see Sec. 3) as an alternative to $\mathcal{L}_{2}$ (presented in Sec. 2). In particular we show that these terms corresponds to robust Wasserstein distances where the bandwidth $h$ (Eq. 1) enables the seamless control of the level of robustness in a similar fashion as the scale parameter controlling M-estimators [13]. This reformulation allows the following contributions: first, to extend OT in supervised and semi-supervised scenarios, and second to propose a robust Wasserstein cost (Sec. 3). We start first by explaining in more detail the notations used and the $\mathcal{L}_{2}$ cost function.

2 $\mathcal{L}_{2}$ divergence

We consider that the following are available:

•

a dataset $\mathcal{S}=\left\{y^{(j)}\right\}_{j=1,\cdots,n}$ : the term $\mathcal{T}_{0}$ (Eq. 1) uses the samples from this dataset.

•

a dataset $\widetilde{\mathcal{S}}=\left\{\tilde{y}^{(i)}=\phi(x^{(i)})\right\}_{i=1,\cdots,\tilde{n}}$ computed using a transfer (or mapping) function $\phi$ on data points $\{x^{(i)}\}_{i=1,\cdots,\tilde{n}}$ . The term $\mathcal{T}_{1}$ (Eq. 1) uses the samples from this dataset.

•

a dataset of correspondences $\widetilde{\widetilde{\mathcal{S}}}=\left\{(y^{(k)},\tilde{y}^{(k)}=\phi(x^{(k)}))\right\}_{k=1,\cdots,\tilde{\tilde{n}}}$ : the term $\mathcal{T}_{3}$ (Eq. 1) uses the samples from this dataset.

All data points have the same dimension (i.e. $\dim(y^{(l_{1})})=\dim(\tilde{y}^{(l_{2})})$ ) for any samples taken from $\mathcal{S}$ , $\widetilde{\mathcal{S}}$ or $\widetilde{\widetilde{\mathcal{S}}}$ . Figure 2 shows an illustration of our datasets in the context of colour transfer222Images from the video posted at https://twitter.com/gabrielpeyre/status/979605863295053826 have been used for designing Fig. 2..

In this $\mathcal{L}_{2}$ framework [11], only one random vector (r.v.) $y$ is defined. Using $\mathcal{S}$ and $\widetilde{\mathcal{S}}$ , two probability density functions noted $\mu(y)$ and $\tilde{\mu}(y|\phi)$ respectively are computed for r.v. $y$ as kernel density estimates with a Normal kernel (or Gaussian Mixture Models):

[TABLE]

and

[TABLE]

The unknown mapping function $\phi$ transforms the samples in $\widetilde{\mathcal{S}}$ that act as the means of the normal kernels in the mixture $\tilde{\mu}(y|\phi)$ . Hence, $\tilde{\mu}$ can be warped onto $\mu$ by finding the appropriate function $\phi$ . The best choice for function $\phi$ can be chosen as minimising the Euclidean $\mathcal{L}_{2}$ distance between $\mu$ and $\tilde{\mu}$ defined as [14]:

[TABLE]

from which terms $\mathcal{T}_{0,1,2}$ in the cost function $\mathcal{C}(\phi)$ originate (Eq. 1). Such a formulation of $\mathcal{L}_{2}$ has been used for colour transfer [11] and shape registration [14, 1]. The connection between $\mathcal{L}_{2}$ with robust M-estimators has also been shown [3, 19, 14].

Removing $\mathcal{T}_{0}$ from the cost function $\mathcal{C}$ .

$\mathcal{T}_{0}$ does not depends on $\phi$ and can be discarded, shortening $\mathcal{L}_{2}$ into $\mathcal{L}_{2}E$ [19] for estimating $\phi$ . Both $\mathcal{T}_{0}$ and $\mathcal{T}_{1}$ correspond to entropies since $-\log\left(\|\mu\|^{2}\right)$ and $-\log\left(\|\tilde{\mu}\|^{2}\right)$ are the quadratic Renyi entropies of $\mu$ and $\tilde{\mu}$ respectively [11].

Using correspondences.

The term $\mathcal{T}_{3}$ to account for correspondences in $\widetilde{\widetilde{\mathcal{S}}}$ , is explained intuitively with notation $-2\langle\mu|\tilde{\mu}\rangle$ by Grogan et al [12], where this time $\mu$ and $\tilde{\mu}$ are likewise kernel density estimates (with Normal kernel) using only observations in the dataset of correspondences $\widetilde{\widetilde{\mathcal{S}}}$ :

[TABLE]

and the scalar product $\langle\mu|\tilde{\mu}\rangle$ then corresponds to:

[TABLE]

Hence the notation $\langle\mu|\tilde{\mu}\rangle$ is not mathematically correct to explain $\mathcal{T}_{3}$ (i.e note the single sum for $\mathcal{T}_{3}$ in Eq. 1 versus the double sum appearing in Eq. 3). So even if the intuition for $\mathcal{T}_{3}$ is sound and proves to be efficient in practice against the state of the art techniques for colour transfer [12, 11], its origin cannot be explained mathematically with $\mathcal{L}_{2}$ and we provide next a better explanation for $\mathcal{T}_{3}$ based on Optimal Transport.

3 Optimal Transport

We propose to reformulate both $\mathcal{T}_{2}$ and $\mathcal{T}_{3}$ from an OT perspective. OT aims at choosing $\phi$ with the minimum transport (displacement) cost between two random vectors noted $y$ and $\tilde{y}$ . The OT cost function is expressed here with the Wasserstein distance [16] as follow:

[TABLE]

where $c$ is a cost often chosen as $c(y,\tilde{y})=\|y-\tilde{y}\|^{2}$ (quadratic Wasserstein distance), and $\gamma$ is the joint probability density function of $y$ and $\tilde{y}$ having $\mu$ and $\tilde{\mu}$ for marginals respectively i.e. $\int\gamma(y,\tilde{y})\ dy=\tilde{\mu}(\tilde{y})$ and $\int\gamma(y,\tilde{y})\ d\tilde{y}=\mu(y)$ . We first present our choices for these distributions (Sec. 3.1) and then propose a new robust cost (in Sec. 3.2). An alternative OT based explanation for terms $\mathcal{T}_{2}$ and $\mathcal{T}_{3}$ then emerges (Sec. 3.3).

3.1 Models for $\gamma_{\phi}$ , $\mu$ and $\tilde{\mu}_{\phi}$

Kernel density estimates with Normal kernels are used as joint density functions $\gamma_{\phi}$ and using the datasets available, three estimates of $\gamma_{\phi}\in\{\gamma_{u},\gamma_{s},\gamma_{s+u}\}$ can be proposed:

•

using independent sets $\mathcal{S}$ and $\widetilde{\mathcal{S}}$ (unsupervised scenario i.e without correspondences):

[TABLE]

with the marginals

[TABLE]

and

[TABLE]

•

using the set of correspondences $\widetilde{\widetilde{\mathcal{S}}}$ (supervised):

[TABLE]

providing the marginals

[TABLE]

and:

[TABLE]

•

Using all datasets, the following mixture can be considered (semi-supervised):

[TABLE]

where $0\leq\lambda\leq 1$ is a parameter controlling the importance between the estimates $\gamma_{u}$ and $\gamma_{s}$ . In this case, the marginals are:

[TABLE]

and

[TABLE]

Note that these models noted $\gamma_{\phi}\in\{\gamma_{u},\gamma_{s},\gamma_{s+u}\}$ are parameterized by $\phi$ via the samples $\tilde{y}^{(l)}$ in $\widetilde{\mathcal{S}}$ and $\widetilde{\widetilde{\mathcal{S}}}$ . The bandwidths $h$ and $\tilde{h}$ are user defined and using $h=\tilde{h}=0$ enables the recovery of the empirical pdf estimates with Dirac kernels.

3.2 Robust cost $c_{G}(y,\tilde{y})$

Concave functions $g$ to define costs $c$ of the form $c(y,\tilde{y})=g(|y-\tilde{y}|)$ have been suggested for robustness [8]. Here, we go further by proposing the following robust cost:

[TABLE]

where $A$ is a constant that can be added if one need to enforce a positive cost $c_{G}$ . Our cost $c_{G}$ is convex near the origin $\|y-\tilde{y}\|\sim 0$ and then becomes concave as the difference $\|y-\tilde{y}\|$ increases. We also note that:

[TABLE]

since $\gamma$ integrates to 1 by definition. In practice, for estimation of $\phi$ that minimizes this cost, the constant $A$ does not matter and can be set $A=0$ .

3.2.1 Relation to M-estimators

With the more familiar notation for error $\epsilon=\|y-\tilde{y}\|$ , our robust cost $c_{G}$ is proportional to the Welsch-Leclerc loss $\rho_{G}$ [2]:

[TABLE]

which is a well-known hard redescending M-estimating function with scale parameter $\sigma=h_{c}$ [13, 15, 7, 2]. The more the chosen function $\rho$ penalises large errors $\epsilon$ , the more it is robust to outliers. See for instance in Fig. 3(a) how the hard redescending functions $\rho_{GM}$ (for Geman-McClure loss [6, 2]) and $\rho_{G}$ have an upper finite limit (equal to 1) when $\epsilon\rightarrow+\infty$ and thus prevent high residuals (outliers) to overly contribute too much when estimating $\hat{\phi}$ . The non-robust Least Square function $\rho_{LS}$ is also shown and corresponds here to the quadratic Wasserstein cost $c(y,\tilde{y})=\|y-\tilde{y}\|^{2}$ that is not robust to gross errors.

3.2.2 Relation of robust cost $c_{G}$ to Wasserstein distance

When the bandwidth $h_{c}$ (or scale parameter $\sigma$ ) is very very large compared to $\epsilon$ , using Taylor approximation of the cost shows that (cf. Fig. 3(b)):

[TABLE]

making our cost $c_{G}$ proportional to the one used in the quadratic Wasserstein distance. The bandwidth $h_{c}$ allows for the modulation of the cost from the non robust Euclidean distance ( $h_{c}\rightarrow\infty$ ) to a more robust cost ( $h_{c}$ small) for penalising high differences $\|y-\tilde{y}\|$ (or outliers).

3.3 OT perspective for terms $\mathcal{T}_{2}$ and $\mathcal{T}_{3}$

Using the definitions of our cost $c_{G}=-\mathcal{N}(y;\tilde{y},h_{c}^{2}\mathrm{I})$ and our joint probability density functions $\gamma_{\phi}\in\{\gamma_{u},\gamma_{s},\gamma_{s+u}\}$ (cf. Sec. 3.1), we note that:

[TABLE]

hence it is equivalent to the term $\mathcal{T}_{2}$ (since the bandwidths are user defined). Likewise we note

[TABLE]

which is equivalent to $\mathcal{T}_{3}$ (Eq. 1) introduced by Grogan et al to take advantage of correspondences [11]. Since the weight $\lambda_{1}$ was chosen in an ad hoc fashion, we can propose a more elegant alternative form combining $\mathcal{T}_{2}$ and $\mathcal{T}_{3}$ into a new term $\mathcal{T}$ using the estimate $\gamma_{s+u}$ :

[TABLE]

With the OT formulation (Eq. 4), Grogan et al’s estimation (terms $\mathcal{T}_{2}$ and $\mathcal{T}_{3}$ , Eq. 1) can be rewritten:

[TABLE]

to which entropic terms on the marginals $\mu$ and $\tilde{\mu}$ ( $\mathcal{T}_{0}$ and $\mathcal{T}_{1}$ ) can be added along with other constraints on $\phi$ (e.g. $\mathcal{T}_{4}$ and $\mathcal{T}_{5}$ ).

When setting $h=\tilde{h}=0$ for simplicity (i.e. using empirical pdf estimates with Dirac kernels, Sec. 3.1), Grogan et al’s terms $\mathcal{T}_{2}$ and $\mathcal{T}_{3}$ are robust OT distances where the parameter $h_{c}$ in the robust cost $c_{G}$ controls the influence of outliers when performing estimation of the mapping function $\phi$ in the same way as the scale parameter for M-estimation.

3.4 Parametric Modelling of the transfer function $\phi$

In practice, a parametric form of $\phi$ is used: Thin Plate Splines (TPS) have been used for colour transfer and shape registration [14, 11, 12]. The term $\mathcal{T}_{5}$ in Eq. 1 corresponds to a smoothness constraint on the TPS solution [14, 11]:

[TABLE]

However TPS is not a convenient formulation when modelling transfer functions in high dimensional spaces and Deep Neural Networks are now providing more powerful formulations for $\phi$ .

3.5 Interpretation and Generalization of the cost $c_{G}$

Our formulation of OT is equivalent to :

[TABLE]

where more generally $\mathcal{N}(y;\tilde{y},h_{c}^{2}\mathrm{I})$ can be understood as a conditional pdf ( $y$ given $\tilde{y}$ or vice versa since the Normal distribution is symmetric w.r.t. its mean). Using a flat prior for $\tilde{y}$ (e.g. $\tilde{y}\sim\mathcal{N}(\tilde{y};0,a\mathrm{I})$ with bandwidth $a$ very large to approximate a flat prior), then a model for the joint probability density function is available $\gamma_{m}(y,\tilde{y})=\mathcal{N}(y;\tilde{y},h_{c}^{2}\mathrm{I})\times\mathcal{N}(\tilde{y};0,a\mathrm{I})$ and our OT formulation (Eq. 17) is equivalent to:

[TABLE]

which has the same form as the cross product $\langle\mu|\tilde{\mu}\rangle$ appearing in $\mathcal{L}_{2}$ (cf. Eq. 2): as indicated in [11], the main difference between the two frameworks lies in the modelling of one r.v. ( $y$ in $\mathcal{L}_{2}$ , with notation $\langle.|.\rangle$ indicating integration over this one vector) or two r.v. ( $y$ and $\tilde{y}$ in OT, $\langle.|.\rangle$ indicating integration over these two vectors). These scalar products between probability densities functions (joint, marginals or conditionals) are frequent for robust estimation including for instance the Hough transform widely used in image processing [5, 7, 9]. While some robust costs can be identified as a negative log likelihood [6, 2], we identify directly our robust cost $c_{G}$ as a negative Multivariate Normal distribution instead.

4 Final remarks

We have proposed a new generic formulation for Optimal Transport with the following advantages:

•

it is robust: our new robust cost $c_{G}(y,\tilde{y})=-\mathcal{N}(y;\tilde{y},h_{c}^{2}\mathrm{I})$ is parameterised by a bandwidth $h_{c}$ that acts like the scale parameter of M-estimators. This bandwidth enables the control of the level of robustness and when chosen very large, it makes our cost converge towards the standard (non robust) quadratic Wasserstein distance.

•

Our formulation can seamlessly consider various scenarios e.g. unsupervised, supervised (with correspondences) or semi-supervised depending on the dataset(s) available.

•

Grogan et al [11] propose the use of entropy terms for the marginals (e.g. $\tilde{\mu}$ ) that can be used in addition to (or instead of) an entropy on the joint pdf $\gamma$ [16].

•

More generally, we have shown the commonality of these formulations ( $\mathcal{L}_{2}$ and OT) in using scalar products between two p.d.fs. The main difference between $\mathcal{L}_{2}$ and OT is then in the number of random vectors used in the formulation of this scalar product. We believe this thinking extends to the Gromov-Wasserstein formulation which defines 4 random vectors [20].

Beyond the impact of our formulation for colour transfer [12, 11], future work will investigate shape registration with correspondences (e.g. for user interactions) and with kernels other than Gaussian better suited to directional data [10].

Acknowledgments

This work is partly supported by a scholarship from Umm Al-Qura University, Saudi Arabia, and in part by a research grants from Science Foundation Ireland (SFI) (Grant Number 15/RP/2776), and the ADAPT Centre for Digital Content Technology (www.adaptcentre.ie) that is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.

Bibliography20

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] C. Arellano and R. Dahyot. Robust ellipse detection with gaussian mixture models. Pattern Recognition , 58:12 – 26, 2016.
2[2] J. T. Barron. A more general robust loss function. Co RR , abs/1701.03077, 2017.
3[3] A. Basu, I. R. Harris, N. L. Hjort, and M. C. Jones. Robust and efficient estimation by minimising a density power divergence. Biometrika , 85(3):549–559, 1998.
4[4] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal transport for domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence , 39(9):1853–1865, Sep. 2017.
5[5] R. Dahyot. Statistical hough transform. IEEE Transactions on Pattern Analysis and Machine Intelligence , 31(8):1502–1509, Aug 2009.
6[6] R. Dahyot, P. Charbonnier, and F. Heitz. A bayesian approach to object detection using probabilistic appearance-based models. Pattern Analysis and Applications , 7(3):317–332, Dec 2004.
7[7] R. Dahyot and J. Ruttle. Generalised relaxed radon transform (gr 2t) for robust inference. Pattern Recognition , 46(3):788 – 794, 2013.
8[8] J. Delon, J. Salomon, and A. Sobolevski. Local matching indicators for transport problems with concave costs. SIAM Journal on Discrete Mathematics , 26(2):801–827, 2012.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Entropic Regularisation of Robust Optimal Transport

Abstract

1 Introduction

2 L2\mathcal{L}_{2}L2​ divergence

Removing T0\mathcal{T}_{0}T0​ from the cost function C\mathcal{C}C.

Using correspondences.

3 Optimal Transport

3.1 Models for γϕ\gamma_{\phi}γϕ​, μ\muμ and μ~ϕ\tilde{\mu}_{\phi}μ~​ϕ​

3.2 Robust cost cG(y,y~)c_{G}(y,\tilde{y})cG​(y,y~​)

3.2.1 Relation to M-estimators

3.2.2 Relation of robust cost cGc_{G}cG​ to Wasserstein distance

3.3 OT perspective for terms T2\mathcal{T}_{2}T2​ and T3\mathcal{T}_{3}T3​

3.4 Parametric Modelling of the transfer function ϕ\phiϕ

3.5 Interpretation and Generalization of the cost cGc_{G}cG​

4 Final remarks

Acknowledgments

2 $\mathcal{L}_{2}$ divergence

Removing $\mathcal{T}_{0}$ from the cost function $\mathcal{C}$ .

3.1 Models for $\gamma_{\phi}$ , $\mu$ and $\tilde{\mu}_{\phi}$

3.2 Robust cost $c_{G}(y,\tilde{y})$

3.2.2 Relation of robust cost $c_{G}$ to Wasserstein distance

3.3 OT perspective for terms $\mathcal{T}_{2}$ and $\mathcal{T}_{3}$

3.4 Parametric Modelling of the transfer function $\phi$

3.5 Interpretation and Generalization of the cost $c_{G}$