(Martingale) Optimal Transport And Anomaly Detection With Neural   Networks: A Primal-dual Algorithm

Pierre Henry-Labordere

arXiv:1904.04546·math.OC·April 12, 2019

(Martingale) Optimal Transport And Anomaly Detection With Neural Networks: A Primal-dual Algorithm

Pierre Henry-Labordere

PDF

TL;DR

This paper presents a primal-dual algorithm for martingale optimal transport problems, with applications to anomaly detection and financial data generation, leveraging cost functions similar to those used in training GANs.

Contribution

It introduces a novel primal-dual algorithm for martingale optimal transport problems with practical applications in anomaly detection and financial data synthesis.

Findings

01

Effective algorithm for martingale optimal transport

02

Successful application to anomaly detection tasks

03

Generates realistic financial data samples

Abstract

In this paper, we introduce a primal-dual algorithm for solving (martingale) optimal transportation problems, with cost functions satisfying the twist condition, close to the one that has been used recently for training generative adversarial networks. As some additional applications, we consider anomaly detection and automatic generation of financial data.

Figures9

Click any figure to enlarge with its caption.

Equations88

MK_{c} (μ^{1}, μ^{2}) := P \in M (μ^{1}, μ^{2}) sup E^{P} [c (S_{1}, S_{2})]

MK_{c} (μ^{1}, μ^{2}) := P \in M (μ^{1}, μ^{2}) sup E^{P} [c (S_{1}, S_{2})]

P_{c} := P \in M sup E^{P} [c (S_{1}, S_{2}, \dots, S_{n})]

P_{c} := P \in M sup E^{P} [c (S_{1}, S_{2}, \dots, S_{n})]

MK_{c} (μ^{1}, μ^{2}) := u_{1} \in L^{1} (μ^{1}), u_{2} \in L^{1} (μ^{2}), h \in C_{b} (R^{d}, R^{d}) in f E^{μ^{1}} [u_{1}] + E^{μ^{2}} [u_{2}]

MK_{c} (μ^{1}, μ^{2}) := u_{1} \in L^{1} (μ^{1}), u_{2} \in L^{1} (μ^{2}), h \in C_{b} (R^{d}, R^{d}) in f E^{μ^{1}} [u_{1}] + E^{μ^{2}} [u_{2}]

u_{1} (s_{1}) + u_{2} (s_{2}) + h (s_{1}) . (s_{2} - s_{1}) \geq c (s_{1}, s_{2})

u_{1} (s_{1}) + u_{2} (s_{2}) + h (s_{1}) . (s_{2} - s_{1}) \geq c (s_{1}, s_{2})

MK_{c}^{ϵ} (μ^{1}, μ^{2}) := P \in M (μ^{1}, μ^{2}) sup E^{P} [c (S_{1}, S_{2})] - ϵH (P ∣ P^{0})

MK_{c}^{ϵ} (μ^{1}, μ^{2}) := P \in M (μ^{1}, μ^{2}) sup E^{P} [c (S_{1}, S_{2})] - ϵH (P ∣ P^{0})

MK_{c}^{ϵ} (μ^{1}, μ^{2}) := u_{1} \in L^{1} (μ^{1}), u_{2} \in L^{1} (μ^{2}), h \in C_{b} (R^{d}, R^{d}) in f

MK_{c}^{ϵ} (μ^{1}, μ^{2}) := u_{1} \in L^{1} (μ^{1}), u_{2} \in L^{1} (μ^{2}), h \in C_{b} (R^{d}, R^{d}) in f

e^{- \frac{u _{1} ( s _{1} )}{ϵ}} \int p_{0} (s_{1}, s_{2}) d s_{2} e^{\frac{1}{ϵ} (c (s_{1}, s_{2}) - u_{2} (s_{2}) - h (s_{1}) . (s_{2} - s_{1}))}

e^{- \frac{u _{1} ( s _{1} )}{ϵ}} \int p_{0} (s_{1}, s_{2}) d s_{2} e^{\frac{1}{ϵ} (c (s_{1}, s_{2}) - u_{2} (s_{2}) - h (s_{1}) . (s_{2} - s_{1}))}

e^{- \frac{u _{2} ( s _{2} )}{ϵ}} \int p_{0} (s_{1}, s_{2}) d s_{1} e^{\frac{1}{ϵ} (c (s_{1}, s_{2}) - u_{1} (s_{1}) - h (s_{1}) . (s_{2} - s_{1}))}

\int p_{0} (s_{1}, s_{2}) d s_{2} (s_{2} - s_{1}) e^{\frac{1}{ϵ} (c (s_{1}, s_{2}) - u_{2} (s_{2}) - h (s_{1}) . (s_{2} - s_{1}))}

e^{- \frac{u _{1}^{(n)} ( s _{1} )}{ϵ}} \int p_{0} (s_{1}, s_{2}) d s_{2} e^{\frac{1}{ϵ} (c (s_{1}, s_{2}) - u_{2}^{(n - 1)} (s_{2}) - h^{(n - 1)} (s_{1}) . (s_{2} - s_{1}))}

e^{- \frac{u _{1}^{(n)} ( s _{1} )}{ϵ}} \int p_{0} (s_{1}, s_{2}) d s_{2} e^{\frac{1}{ϵ} (c (s_{1}, s_{2}) - u_{2}^{(n - 1)} (s_{2}) - h^{(n - 1)} (s_{1}) . (s_{2} - s_{1}))}

h (s_{1}) := θ s.t. \int p_{0} (s_{1}, s_{2}) d s_{2} (s_{2} - s_{1}) e^{\frac{1}{ϵ} (c (s_{1}, s_{2}) - u_{2}^{(n - 1)} (s_{2}) - θ . (s_{2} - s_{1}))}

h (s_{1}) := θ s.t. \int p_{0} (s_{1}, s_{2}) d s_{2} (s_{2} - s_{1}) e^{\frac{1}{ϵ} (c (s_{1}, s_{2}) - u_{2}^{(n - 1)} (s_{2}) - θ . (s_{2} - s_{1}))}

e^{- \frac{u _{2}^{(n)} ( s _{2} )}{ϵ}} \int p_{0} (s_{1}, s_{2}) d s_{1} e^{\frac{1}{ϵ} (c (s_{1}, s_{2}) - u_{1}^{(n)} (s_{1}) - h^{(n)} (s_{1}) . (s_{2} - s_{1}))}

e^{- \frac{u _{2}^{(n)} ( s _{2} )}{ϵ}} \int p_{0} (s_{1}, s_{2}) d s_{1} e^{\frac{1}{ϵ} (c (s_{1}, s_{2}) - u_{1}^{(n)} (s_{1}) - h^{(n)} (s_{1}) . (s_{2} - s_{1}))}

MK_{c}^{ϵ} (μ^{1}, μ^{2})

MK_{c}^{ϵ} (μ^{1}, μ^{2})

MK_{c}^{γ} (μ^{1}, μ^{2}) :

MK_{c}^{γ} (μ^{1}, μ^{2}) :

MK_{c} (μ^{1}, μ^{2}) := P \in M (μ^{1}, μ^{2}) sup E^{P} [c (S_{1}, S_{2})]

MK_{c} (μ^{1}, μ^{2}) := P \in M (μ^{1}, μ^{2}) sup E^{P} [c (S_{1}, S_{2})]

MK_{c} (μ^{1}, μ^{2}) :

MK_{c} (μ^{1}, μ^{2}) :

P^{*} (d s_{1}, d s_{2}) = μ^{1} (d s_{1}) δ (s_{2} - T (s_{1})) d s_{2}

P^{*} (d s_{1}, d s_{2}) = μ^{1} (d s_{1}) δ (s_{2} - T (s_{1})) d s_{2}

MK_{c} (μ^{1}, μ^{2}) = \int_{0}^{1} (F_{1}^{- 1} (u) - F_{2}^{- 1} (u))^{2} d u

MK_{c} (μ^{1}, μ^{2}) = \int_{0}^{1} (F_{1}^{- 1} (u) - F_{2}^{- 1} (u))^{2} d u

MK_{c} (μ^{1}, μ^{2}) := u \in L^{1} (μ^{2}) in f T : R^{d} \mapsto R^{d} sup E^{μ^{1}} [c (S_{1}, T (S_{1})) - u (T (S_{1}))] + E^{μ^{2}} [u (S_{2})]

MK_{c} (μ^{1}, μ^{2}) := u \in L^{1} (μ^{2}) in f T : R^{d} \mapsto R^{d} sup E^{μ^{1}} [c (S_{1}, T (S_{1})) - u (T (S_{1}))] + E^{μ^{2}} [u (S_{2})]

MK_{c}^{t, u} (μ^{1}, μ^{2}) := θ \in R^{u} min ω \in R^{t} max E^{μ^{1}} [c (S_{1}, T_{ω} (S_{1})) - u_{θ} (T_{ω} (S_{1}))] + E^{μ^{2}} [u_{θ} (S_{2})]

MK_{c}^{t, u} (μ^{1}, μ^{2}) := θ \in R^{u} min ω \in R^{t} max E^{μ^{1}} [c (S_{1}, T_{ω} (S_{1})) - u_{θ} (T_{ω} (S_{1}))] + E^{μ^{2}} [u_{θ} (S_{2})]

(W_{p} (μ^{1}, μ^{2}))^{p} := P \in M (μ^{1}, μ^{2}) in f E^{P} [∣ S_{2} - S_{1} ∣^{p}]

(W_{p} (μ^{1}, μ^{2}))^{p} := P \in M (μ^{1}, μ^{2}) in f E^{P} [∣ S_{2} - S_{1} ∣^{p}]

P

P

P = u \in L^{1} (μ^{real}) sup \hat{T} : R^{l} \mapsto R^{d}, T : R^{d} \mapsto R^{d} in f E^{μ^{real}} [c (S_{1}, T (S_{1})) - u (T (S_{1}))] + E^{μ^{0}} [u (\hat{T} (S_{0}))]

P = u \in L^{1} (μ^{real}) sup \hat{T} : R^{l} \mapsto R^{d}, T : R^{d} \mapsto R^{d} in f E^{μ^{real}} [c (S_{1}, T (S_{1})) - u (T (S_{1}))] + E^{μ^{0}} [u (\hat{T} (S_{0}))]

P = u \in L^{1} (μ^{real}) sup \hat{T} : R^{l} \mapsto R^{d}, T : R^{d} \mapsto R^{d} in f E^{μ^{real}} [∣ S_{1} - T (S_{1}) ∣ - u (T (S_{1}))] + E^{μ^{0}} [u (\hat{T} (S_{0}))]

P = u \in L^{1} (μ^{real}) sup \hat{T} : R^{l} \mapsto R^{d}, T : R^{d} \mapsto R^{d} in f E^{μ^{real}} [∣ S_{1} - T (S_{1}) ∣ - u (T (S_{1}))] + E^{μ^{0}} [u (\hat{T} (S_{0}))]

P = u \in Lip_{1} sup \hat{T} : R^{l} \mapsto R^{d} in f - E^{μ^{real}} [u (S_{1})] + E^{μ^{0}} [u (\hat{T} (S_{0}))]

P = u \in Lip_{1} sup \hat{T} : R^{l} \mapsto R^{d} in f - E^{μ^{real}} [u (S_{1})] + E^{μ^{0}} [u (\hat{T} (S_{0}))]

P

P

P^{γ}

P^{γ}

\hat{T}_{#} μ^{0} (x_{anomaly}) \leq λ

\hat{T}_{#} μ^{0} (x_{anomaly}) \leq λ

θ \in R^{u} min ω \in R^{t} max \frac{1}{N _{MC}} i = 1 \sum N_{MC} J_{i} (θ, ω)

θ \in R^{u} min ω \in R^{t} max \frac{1}{N _{MC}} i = 1 \sum N_{MC} J_{i} (θ, ω)

J_{i} (θ, ω) := c (S_{1}^{i}, T_{ω} (S_{1}^{i})) - u_{θ} (T_{ω} (S_{1}^{i})) + u_{θ} (S_{2}^{i})

J_{i} (θ, ω) := c (S_{1}^{i}, T_{ω} (S_{1}^{i})) - u_{θ} (T_{ω} (S_{1}^{i})) + u_{θ} (S_{2}^{i})

θ_{n + 1}

θ_{n + 1}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

(Martingale) Optimal Transport and anomaly detection with neural networks: a primal-dual algorithm

Pierre Henry-Labordère

Société Générale, Global markets Quantitative Research

CMAP, Ecole Polytechnique

[email protected]

Abstract.

In this paper, we introduce a primal-dual algorithm for solving (martingale) optimal transportation problems, with cost functions satisfying the twist condition, close to the one that has been used recently for training generative adversarial networks. As some additional applications, we consider anomaly detection and automatic generation of financial data.

Key words and phrases:

(Martingale) optimal transport, Arrow-Hurwicz’s algorithm, generative adversarial networks, anomaly detection

1. Introduction

We introduce a primal-dual algorithm for solving (martingale) optimal transportation problem (in short MOT), potentially large-scale, using neural networks. The martingale optimal transport, first introduced in [2] and in a continuous-time setting in [11], can be defined in a discrete-time setting as the following infinite-dimensional linear program:

[TABLE]

where ${\cal M}(\mu^{1},\mu^{2}):=\{\mathbb{P}\in{\cal P}({\mathbb{R}}^{d},{\mathbb{R}}^{d})\;:\;S_{1}\overset{\mathbb{P}}{\sim}\mu^{1},\quad S_{2}\overset{\mathbb{P}}{\sim}\mu^{2},\quad{\mathbb{E}}^{\mathbb{P}}[S_{2}|S_{1}]=S_{1}\}$ is a weak compact convex set and ${\cal P}({\mathbb{R}}^{d}\times{\mathbb{R}}^{d})$ is the set of probability measures on ${\mathbb{R}}^{d}\times{\mathbb{R}}^{d}$ (or ${\mathbb{R}}_{+}^{d}\times{\mathbb{R}}_{+}^{d}$ if the random variables $S_{1}$ and $S_{2}$ are interpreted as financial asset prices). A similar definition applies by replacing the supremum over ${\cal M}(\mu^{1},\mu^{2})$ by an infimum. $\mathrm{MK}_{c}(\mu^{1},\mu^{2})$ is a number which depends on a cost function $c:{\mathbb{R}}^{d}\times{\mathbb{R}}^{d}\mapsto{\mathbb{R}}$ and two marginal distributions $\mu^{1}$ and $\mu^{2}$ defined on ${\mathbb{R}}^{d}$ . In comparison with the classical OT, we have an additional martingale constraint ${\mathbb{E}}^{\mathbb{P}}[S_{2}|S_{1}]=S_{1}$ and the linear problem is well-posed if and only if $\mu^{1}\leq\mu^{2}$ in the convex order. In mathematical finance, $\mathrm{MK}_{c}(\mu^{1},\mu^{2})$ can then be interpreted as the model-independent arbitrage-free optimal upper bound for a payoff $c(S_{1},S_{2})$ depending on an asset $S_{\cdot}\in{\mathbb{R}}^{d}$ evaluated at two maturities $t_{1}<t_{2}$ , i.e., $S_{1}:=S_{t_{1}}$ , $S_{2}:=S_{t_{2}}$ , which is consistent with the prices (at $t=0$ ) of $t_{1}$ and $t_{2}$ ( $d$ -dimensional) European basket options (see [15] for an extensive introduction to MOT and its relevance in arbitrage-free pricing). Our algorithm, described in Section 3, can also be applied to more general linear programs of the form:

[TABLE]

where ${\cal M}$ is a weak-compact convex subset of ${\cal P}(({\mathbb{R}}^{d})^{n})$ , see for example the multi-marginals (M)OT. However, our algorithm will be applicable only to cost functions satisfying a (martingale) twist condition. Although the extension of our algorithm to this more general setting is straightforward, we prefer for the sake of simplicity to focus on (martingale) OT as defined by (1). Most of the numerical schemes of (M)OT, that we will describe, rely strongly on the dual Monge-Kantorovich formulation in which $\mathrm{MK}_{c}(\mu^{1},\mu^{2})$ can be written as (see [2] for a proof in the context of MOT):

[TABLE]

such that for all $(s_{1},s_{2})\in{\mathbb{R}}^{d}\times{\mathbb{R}}^{d}$

[TABLE]

By definition, $h(s_{1}).(s_{2}-s_{1}):=\sum_{i=1}^{d}h_{i}(s_{1})(s^{i}_{2}-s^{i}_{1})$ .

2. Numerical algorithms: A short overview

In this section, we review three numerical algorithms for solving (martingale) optimal transport and highlight their main drawbacks111We acknowledge G. Peyré for useful discussions.. These algorithms will be compared to our primal-dual method in Section 4.

2.1. Simplex and cutting-plane

The problem (2) (resp. 1) defines a linear program that can be solved using a simplex algorithm. In the context of MOT, this has been explored in [16]. By discretizing the measures $\mu^{1}$ and $\mu^{2}$ on a large grid $G_{\infty}$ in ${\mathbb{R}}^{d}\times{\mathbb{R}}^{d}$ , we obtain a finite-dimensional linear program. Due to the large number $N:=\mathrm{card}(G_{\infty})$ of linear constraints (3), one can use a cutting-plane algorithm, see [16] for extensive details. This consists in solving the LP program using first a small dimensional grid $G_{0}\subset G_{\infty}$ ( $\mathrm{card}(G_{0})\ll\mathrm{card}(G_{\infty})$ ). The optimal bound $\mathrm{MK}^{(0)}_{c}(\mu^{1},\mu^{2})$ is attained by the dual variables $(u_{1}^{(0)},u_{2}^{(0)},h^{(0)})$ . Then we check on the full grid $G_{\infty}$ if our optimal dual solution violates the linear constraints (3). The points of $G_{\infty}$ where the linear constraints are not satisfied, are then added to the grid $G_{0}$ , defining a new refined grid $G_{1}$ . By construction, we obtain $\mathrm{MK}_{c}(\mu^{1},\mu^{2})\geq\mathrm{MK}^{(1)}_{c}(\mu^{1},\mu^{2})\geq\mathrm{MK}^{(0)}_{c}(\mu^{1},\mu^{2})$ as $G_{0}\subset G_{1}\subset G_{\infty}$ . The procedure is then iterated until the optimal dual solution $(u_{1}^{(n)},u_{2}^{(n)},h^{(n)})$ at step $(n)$ satisfies all the constraints on $G_{\infty}$ for which we can conclude that we have converged towards the true solution. Despite its simplicity, this algorithm could not be extended in large dimension as the number of constraints explodes with the dimension. For example, the complexity of the Hungarian/auction algorithms is $O(N^{3})$ .

2.2. Entropic relaxation

Another approach is to introduce an entropy penalization (or more generally a $f$ -divergence):

[TABLE]

where $H(\mathbb{P}|\mathbb{P}^{0}):={\mathbb{E}}^{\mathbb{P}}[\left(\ln{d\mathbb{P}\over d\mathbb{P}^{0}}-1\right)]$ is the relative entropy with respect to a prior probability measure $\mathbb{P}^{0}\in{\cal P}({\mathbb{R}}^{d}\times{\mathbb{R}}^{d})$ and $\epsilon$ is a positive parameter taken to be small. In particular, $\lim_{\epsilon\rightarrow 0}\mathrm{MK}^{\epsilon}_{c}(\mu^{1},\mu^{2})=\mathrm{MK}_{c}(\mu^{1},\mu^{2})$ . The problem $\mathrm{MK}^{\epsilon}_{c}(\mu^{1},\mu^{2})$ can be dualized using the Fenchel-Rockafellar’s theorem into a strictly convex optimization problem [16]:

[TABLE]

2.2.1. Sinkhorn’s algorithm

By computing the gradients with respect to $u_{1}$ , $u_{2}$ and $h$ , we obtain the first-order optimality conditions:

[TABLE]

For the sake of simplicity, we have assumed here that $\mathbb{P}^{0}$ , $\mu^{1}$ and $\mu^{2}$ are absolutely-continuous with respect to the Lebesgue measure. The Sinkhorn algorithm can be then described by the following steps:

(1)

Set $n:=1$ and set $u^{(0)}_{1}:=0$ , $u^{(0)}_{2}:=0$ , $h^{(0)}:=0$ for convenience. We approximate the measures $\mu^{1}$ and $\mu^{2}$ by Dirac masses supported on $N$ points $(s_{1}^{i})_{1\leq i\leq N}$ and $(s_{2}^{i})_{1\leq i\leq N}$ . 2. (2)

Compute $u_{1}^{(n)}(s_{1})$ for all $(s_{1}^{i})_{1\leq i\leq N}$ using

[TABLE] 3. (3)

Compute $h^{(n)}(s_{1})$ for all $(s_{1}^{i})_{1\leq i\leq N}$ by finding the (unique) zero $\theta\in{\mathbb{R}}^{d}$ of

[TABLE] 4. (4)

Compute $u_{2}^{(n)}(s_{2})$ for all $(s_{2}^{i})_{1\leq i\leq N}$ using

[TABLE] 5. (5)

Set $n:=n+1$ and iterate steps (2-3-4) up to convergence.

The use of the Sinkhorn algorithm for solving OT problem was introduced in [6] and in [14], [7] in the context of MOT (see also [8] for an application to the construction of arbitrage-free implied volatility surfaces). Again this algorithm does not scale well with the dimension as at each Sinkhorn’s iteration, $u_{1}^{(n)}(s_{1}),h^{(n)}(s_{1}),u_{2}^{(n)}(s_{1})$ must be computed on a grid whose the cardinality explodes with the dimension $d$ . The overall complexity is $O(N^{2}\ln N)$ .

2.3. and neural networks…

In [19], the optimization (4) is solved by approximating the potentials $u_{1},u_{2}$ (and $h$ ) by some neural networks and then the training is achieved using a stochastic gradient descent algorithm. Similarly, by using Equation (5), the problem (4) can be converted into an equivalent form which involves only the potentials $u_{2}$ and $h$ :

[TABLE]

and solve similarly. In [12], instead of using neural networks, the authors make use of an expansion of the dual variables in a reproducing kernel Hilbert space. Despite this algorithm scales properly with the dimension in practise, we will illustrate in our numerical experiments that our computations are unstable when $\epsilon$ becomes small. This has been also reported in [12].

2.4. Penalization

In [10], the optimization $\mathrm{MK}_{c}(\mu^{1},\mu^{2})$ is approximated by

[TABLE]

where $\gamma$ is a large parameter. This ensures that by taking $\gamma$ large, the optimal dual solution $(u_{1}^{*},u_{2}^{*},h^{*})$ will satisfy the linear constraints (3) and therefore $\lim_{\gamma\rightarrow\infty}\mathrm{MK}^{\gamma}_{c}(\mu^{1},\mu^{2})=\mathrm{MK}_{c}(\mu^{1},\mu^{2})$ . As above, the potentials $u_{1}$ , $u_{2}$ and $h$ are approximated by some neural networks. This is a classical technique for solving linear programs by penalization and in practise the parameter $\gamma_{t}$ is chosen to increase to a large value as the learning parameter $\eta_{t}$ , used in the stochastic gradient descent, decreases. In our numerical experiments, we will illustrate that this algorithm is unstable, when the parameter $\gamma$ is chosen large in order to converge to the true solution. Finally, let us remark that the penalization method can be obtained by replacing the entropy penalization $H(\mathbb{P}|\mathbb{P}^{0})$ by the $\mathrm{L}^{2}$ -divergence $f(\mathbb{P}|\mathbb{P}^{0}):={\mathbb{E}}^{\mathbb{P}^{0}}[\left({d\mathbb{P}\over d\mathbb{P}^{0}}\right)^{2}]$ .

3. A primal-dual algorithm

3.1. A saddle-point formulation

For the sake of clarity, we explain our algorithm in the case of the classical OT problem which consists in solving

[TABLE]

where ${\cal M}(\mu^{1},\mu^{2}):=\{\mathbb{P}\in{\cal P}({\mathbb{R}}^{d},{\mathbb{R}}^{d})\;:\;S_{1}\overset{\mathbb{P}}{\sim}\mu^{1},\quad S_{2}\overset{\mathbb{P}}{\sim}\mu^{2}\}$ . By introducing the Lagrange multipliers $u_{1}$ and $u_{2}$ associated to the two marginal constraints, this problem can be written as a minimax (relaxed) optimization problem:

[TABLE]

where ${\cal M}_{+}$ denotes the space of positive measures on ${\mathbb{R}}^{d}\times{\mathbb{R}}^{d}$ .

3.2. Using Brenier’s theorem

Definition 3.1 (Twist condition).

A function $c\in C({\mathbb{R}}^{d}\times{\mathbb{R}}^{d})$ differentiable with respect to $s_{1}$ is said to be twisted if $\forall s_{0}\in{\mathbb{R}}^{d}$ , the map $s_{2}\in{\mathbb{R}}^{d}\mapsto\nabla_{s_{1}}c(s_{0},s_{2})$ is one-to-one.

We recall the Brenier theorem (see e.g. [20]):

Theorem 3.2 (Brenier’s theorem).

By assuming that $\mu^{1}$ is absolutely continuous with respect to the Lebesgue measure and the cost function $c$ satisfies the twist condition, the optimal probability measure $\mathbb{P}^{*}$ , solution of the above saddle-point problem (8), is supported on a unique map $T:{\mathbb{R}}^{d}\mapsto{\mathbb{R}}^{d}$ :

[TABLE]

Note that the constraints $S_{1}\overset{\mathbb{P}^{*}}{\sim}\mu^{1}$ and $S_{2}\overset{\mathbb{P}^{*}}{\sim}\mu^{2}$ imply the requirement $T_{\#}\mu^{1}=\mu^{2}$ where $T_{\#}\mu^{1}$ denotes the push-forward of the measure $\mu_{1}$ by the map $T$ . $T$ can be characterized as the unique solution of a Monge-Ampère-like equation. More precisely, in the case of the quadratic cost function, $T$ is the gradient of a convex function solution of the Monge-Ampère PDE (see e.g. [20]).

Remark 3.3 (Fréchet-Hoeffding $d=1$ ).

Under the (twist) condition $\partial_{s_{1}s_{2}}c\geq 0$ in $d=1$ , the optimal transport can be solved analytically and it is given by the Fréchet-Hoeffding solution:

[TABLE]

The map is then $T(s)=F_{2}^{-1}\circ F_{1}(s)$ with $F_{i}$ the cumulative distribution of $\mu^{i}$ .

Under the twist condition, the above minimax optimization (8) can therefore be simplified as

[TABLE]

Note that as $S_{1}\overset{\mathbb{P}^{*}}{\sim}\mu^{1}$ , the potential $u_{1}$ has disappeared and the minimax optimization involves now only the potential $u:=u_{2}$ and the Brenier map $T$ .

3.3. and neural networks…

We then approximate the two unknowns $u:{\mathbb{R}}^{d}\mapsto{\mathbb{R}}$ and $T:{\mathbb{R}}^{d}\mapsto{\mathbb{R}}^{d}$ with two neural networks depending respectively on some weights $\theta\in{\mathbb{R}}^{u}$ and $\omega\in{\mathbb{R}}^{t}$ . $\mathrm{MK}_{c}(\mu^{1},\mu^{2})$ can then be approximated by

[TABLE]

In particular, from the universal approximation property of neural networks, we have $\lim_{t,u\rightarrow\infty}\mathrm{MK}^{t,u}_{c}=\mathrm{MK}_{c}$ .

3.4. Link with Wasserstein generative adversarial networks

The $p$ -Wasserstein distance ${\cal W}_{p}(\mu^{1},\mu^{2})$ corresponds to an OT problem with a $\mathrm{L}^{p}$ -cost in ${\mathbb{R}}^{d}$ , $c(s_{1},s_{2}):=|s_{2}-s_{1}|^{p}$ :

[TABLE]

${\cal W}_{p}$ defines then a distance which metrizes the space ${\cal P}({\mathbb{R}}^{d})$ (see e.g. [20]). If we consider a probability measure $\mu^{\mathrm{real}}$ in ${\mathbb{R}}^{d}$ corresponding to some real data, one would like to reconstruct this density using a mapping $\hat{T}:{\mathbb{R}}^{l}\mapsto{\mathbb{R}}^{d}$ with $l\ll d$ and such that the push-forward of $\hat{T}$ by a prior density $\mu^{0}$ supported on ${\mathbb{R}}^{l}$ (e.g. an uniform or Gaussian density for the sake of simplicity) is as close as possible to $\mu^{\mathrm{real}}$ with respect to the Wasserstein distance. The mapping $\hat{T}$ is then chosen to be the solution of

[TABLE]

Note that $H(\hat{T}_{\#}\mu^{0}|\mu^{\mathrm{real}})=+\infty$ and this is why it is not possible to use the relative entropy as in the case of maximum likelihood estimation. Using the saddle-point formulation of the Wassertein distance (the $\mathrm{L}^{p}$ -cost satisfies the twist condition) explained in the previous section, this is equivalent to the following minimax optimization:

[TABLE]

This problem is similar to (10) and therefore as described in Section 3.6, our algorithm is close in spirit to the one used for training Wasserstein generative adversarial networks [1] (see also [13]).

Specializing to $p=1$ , we get

[TABLE]

This should be compared with the dual formulation of the $1$ -Wassertein distance used in [1]

[TABLE]

where the supremum is over all the $1$ -Lipschitz functions. The Lipschitz constraint is enforced in brute force by weight clipping.

Starting from the primal formula of OT and using the Brenier theorem, $\mathrm{P}$ can also be written as

[TABLE]

This was done in [4] although the Brenier result is not mentioned. The constraint $T_{\#}\mu^{\mathrm{real}}=\hat{T}_{\#}\mu^{0}$ is then implemented by adding a penalty term $\gamma D(\cdot|\cdot)$ with $\gamma$ large:

[TABLE]

One obtains the Wasserstein-VAE formulation.

3.5. Anomaly detector and data generator

Let us consider some real data generated by a density $\mu^{\mathrm{real}}$ and let us choose a prior density $\mu^{0}$ supported on a low-dimensional manifold. As outlined above, we find the density $\hat{T}_{\#}\mu^{0}$ such that the $p$ -Wasserstein distance ${\cal W}_{p}(\mu^{\mathrm{real}},\hat{T}_{\#}\mu^{0})$ is minimized. Then, a data $x_{\mathrm{anomaly}}$ will be considered as an anomaly if $\hat{T}_{\#}\mu^{0}(x_{\mathrm{anomaly}})$ is below a certain threshold $\lambda$ :

[TABLE]

Similarly, a new data $x_{\mathrm{new}}$ can be generated by drawing a random variable $Z$ distributed according to $\mu^{0}$ and set $x_{\mathrm{new}}=\hat{T}(Z)$ .

3.6. Arrow-Hurwicz algorithm: recipe

We simulate $\mu^{1}$ and $\mu^{2}$ by Monte-Carlo with $N_{\mathrm{MC}}$ paths $(S^{i}_{1},S^{i}_{2})_{1\leq i\leq N_{\mathrm{MC}}}$ and for large $N_{\mathrm{MC}}$ , our optimization (11) consists in solving:

[TABLE]

where

[TABLE]

The average functional can be optimized by using a stochastic Arrow-Hurwicz algorithm which consists in doing sequentially the two iterations at each step $n$ : Draw a uniform r.v. $I\in[[1,N_{\mathrm{MC}}]]$ and compute

[TABLE]

where $\eta$ is a learning parameter. In practise, the gradients are computed by back-propagation where

[TABLE]

We could used also a predictor-corrector scheme (that gives similar results in our numerical experiments):

[TABLE]

3.7. Convergence

By using one layer for the approximation of the two unknowns $T_{\omega}$ and $u_{\theta}$ with a linear activation function (a drift can also be included without loss of generality):

[TABLE]

the problem (11) can be written as

[TABLE]

and it is of the form

[TABLE]

where $K$ is a linear operator. As shown by [5], the stochastic Arrow-Hurwicz algorithm converges if $F$ is concave, $G$ is convex and $||K||\eta^{2}<1$ . Our program (14) is clearly convex in $\theta$ as being linear and is concave in $\omega$ if and only if $D^{2}_{s_{2}}c\leq 0$ . This implies that our algorithm converges (in the case of one layer), if we impose that $D^{2}_{s_{2}}c\leq 0$ . Additionally, we should have that $c$ satisfies the twist condition as we have used the Brenier theorem.

Let us remark that if we consider the new cost function $\bar{c}(s_{1},s_{2})=c(s_{1},s_{2})-U(s_{2})$ , then we have for all $U\in\mathrm{L}^{1}(\mu^{2})$ :

[TABLE]

Using this property, we can apply our algorithm to the cost function $\bar{c}$ where $U$ is chosen such that222We are grateful to our master-degree students Y. Chen and F. Jiang at Ecole Polytechnique for pointing to us this remark.

[TABLE]

Example 3.4.

For $c(x,y)=-(x-y)^{2}$ , we can take $U(y)=0$ . For $c(x,y)=(x+y)^{2}$ , we can take $U(y)=2y^{2}$ .

Using the result in [5], we conclude:

Proposition 3.5 (Convergence).

Let us assume that $c$ satisfies the twist condition and $D^{2}_{s_{2}}c-D^{2}_{s_{2}}U(s_{2})\leq 0$ for some twice differentiable function $U$ in $\mathrm{L}^{1}(\mu^{2})$ , then the Arrow-Hurwicz algorithm (with one layer) (12-13) converges for $\eta$ small enough.

Note that a similar conclusion appears if we expand $T_{\omega}$ and $u_{\theta}$ in terms of a reproducing kernel Hilbert space.

3.8. The case of MOT

For $d=1$ , under the (martingale) twist condition $\partial_{s_{1}}\partial_{s_{2}}^{2}c\geq 0$ , the optimal probability measure $\mathbb{P}^{*}$ is shown to be supported not on a single map $T$ but on two maps $T_{d}(x)\leq x\leq T_{u}(x)$ [3, 17]:

[TABLE]

This leads to the following minimax optimization:

[TABLE]

Note that the martingale condition leads explicitly to $q(x):={x-T_{d}(x)\over T_{u}(x)-T_{d}(x)}$ but we do not use this equation in order to preserve the concavity-convexity property with respect to the neural network weights (in the case of one layer). The algorithm is then similar to the one presented for OT except that now we have five (instead of two) neural networks for the potentials $h,u,q$ and the two maps $T_{u}$ and $T_{d}$ .

For $d\geq 2$ , one can characterize the cost functions for which the optimal probability measure $\mathbb{P}^{*}$ is supported on $n$ maps $T_{i}$ [9]. The above optimization becomes therefore:

[TABLE]

where $q_{n}:=1-\sum_{i=1}^{n-1}q_{i}$ . In practise, the number of maps $n$ can be seen as an hyperparameter that can be optimized.

4. Numerical examples

4.1. OT in $d=1$

We first check our algorithm described in Section 3.6 for OT problem in $d=1$ . We consider the two cost functions $c(s_{1},s_{2})=(s_{1}+s_{2})^{2}$ and $c(s_{1},s_{2})=-(s_{1}-s_{2})^{2}$ satisfying the conditions in Proposition 3.5 (see Figures 1 and 2). $\mu_{1}$ and $\mu_{2}$ are chosen to be two log-normal distributions in ${\mathbb{R}}^{+}$ centered at $S_{0}=1$ and with variances $0.2^{2}$ and $0.2^{2}\times 1.5$ . They are simulated using $2^{13}$ Monte-Carlo paths. For each neural network, we have used $2$ hidden layers of dimension $4$ . We have also used a Adam stochastic gradient descent [18] with $64$ minibatches for the computation of the online gradients and our algorithm has been written from crash in C++. The exact solution has been computed using formula (9) and performing a 1d numerical integration. We have compared our algorithm with the entropy relaxation and the penalization methods outlined in Sections 2.2-2.4. We can observe that our primal-dual algorithm converges faster (to the exact solution). On one hand, the choice of the gamma factor in the penalization method is tricky. Taking a small value of $\gamma$ results into convergence towards a false solution and a large $\gamma$ gives noisy results. On the other hand, the entropy relaxation needs more iterations to converge. We have used in all our numerical experiments at most $10^{6}$ iterations. For each $10^{4}\times n$ iterations where $n$ ranges from $1$ up to $10^{2}$ , we have computed the functional $J(\theta_{n},\omega_{n})$ by averaging over our recorded $2^{13}$ Monte-Carlo paths. We have also plotted the map found by our algorithm (denoted “NN”) and compared with the Fréchet-Hoeffding solution $T(s)=F_{2}^{-1}\circ F_{1}(s)$ . We found a perfect match (the blue and red curves coincide).

4.2. $2$ -Wassertein distance in ${\mathbb{R}}^{d}$ , $d=2,10,20$

Next, we compute the $2$ -Wassertein distance in ${\mathbb{R}}^{d}$ . In our notation, this corresponds to the payoff $c(s_{1},s_{2})=-\sum_{i=1}^{d}(s^{i}_{1}-s^{i}_{2})^{2}$ with a minus sign. We have first considered $d=2$ (see Figure 3–left). We have compared the entropy relaxation method against our primal-dual algorithm. As concluded in $d=1$ , our algorithm converges faster and the entropy relaxation method is unstable according to our choice of $\epsilon$ . For large epsilon, the Wasserstein distance is underestimated and for small epsilon, our SGD is noisy and therefore the result can not be trusted. As a consequence, the entropy relaxation method could not be used as presented for computing the Wasserstein distance. The convergence is very fast for our primal-dual method. Here $\mu_{1}$ and $\mu_{2}$ are chosen to be two uncorrelated normal distributions in ${\mathbb{R}}^{d}$ with variances $1$ and $2$ for which the exact $2$ -Wassertein distance in ${\mathbb{R}}^{d}$ is ${\cal W}_{2}(\mu^{1},\mu^{2})^{2}=d(\sqrt{2}-\sqrt{1})^{2}$ . Then, we consider only our primal-dual algorithm and take $d=10$ and $d=20$ (see Figure 3–right). For each neural network, we have used $1$ hidden layer of dimension $50$ .

4.3. MOT in $d=1$

A similar test has been performed in the case of MOT in $d=1$ with a cost $c(s_{1},s_{2})=(s_{1}+s_{2})^{3}$ for which the martingale twist condition $\partial_{s_{1}}\partial_{s_{2}}^{2}c>0$ is satisfied. Our optimization converges towards the exact solution obtained using a simplex algorithm (see Figure 4).

4.4. Anomaly detection in $d=2$

As a final simple numerical example, we consider our anomaly detection algorithm outlined in Section 3.5. We have used $2$ hidden layers of dimension $10$ with linear activation output. We take for $\mu^{\mathrm{real}}$ a two-dimensional uncorrelated log-normal distribution with mean $-0.02$ , variance $0.04$ and for $\mu^{\mathrm{0}}$ a two-dimensional uncorrelated normal distribution. They are simulated using $2^{13}$ Monte-Carlo paths. Note that the stochastic Arrow-Hurwicz iterations over $u_{\theta}$ and $T_{\omega}$ are performed and each $1000$ iterations, a stochastic gradient descent minimization over $\hat{T}_{\hat{\omega}}$ is done. We have plotted in Figure 5 the $2$ -Wasserstein distance ${\cal W}_{2}(\mu^{\mathrm{real}},{\hat{T}_{\hat{\omega}}}\#\mu^{0})$ each $10^{4}$ iterations and this converges, as expected, to zero. Once the mapping $\hat{T}:{\mathbb{R}}^{2}\mapsto{\mathbb{R}}^{2}$ is constructed by optimization, we generate some “anomalies” $\hat{T}_{\hat{\omega}}(G+3\times\mathrm{sign}(G))$ by drawing some normal variables $G\in\mathrm{N}(0,I_{2})$ in ${\mathbb{R}}^{2}$ and adding an anomaly factor $3\times\mathrm{sign}(G)$ . The “normal” variables $\hat{T}_{\hat{\omega}}(G)$ are generated without introducing this anomaly factor. The two-dimensional “normal” and “abnormal” variables generated are then displayed in Figure 6. As expected, the “abnormal” data live on the edge of the two-dimensional uncorrelated log-normal distribution $\mu^{\mathrm{real}}$ , which is close to $\hat{T}_{\#}\mu_{0}$ with respect to the $2$ -Wasserstein distance (see Figure 5).

Bibliography20

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Arjovsky, M., Chintala, S., Bottou, L. : Wasserstein GAN , ar Xiv:1701.07875.
2[2] Beiglböck, M., Henry-Labordère, P., Penkner, F. : Model-independent Bounds for Option Prices: A Mass-Transport Approach , Finance and Stochastics, July 2013, Volume 17, Issue 3, pp 477–501.
3[3] Beiglböck, M., Juillet, N. : On a problem of optimal transport under marginal martingale constraints , Ann. Probab. Volume 44, Number 1 (2016), 42-106.
4[4] Bousquet, O., Gelly, S., Tolstikhin, I., Simon-Gabriel, C-J, Schölkopf, B. : From Optimal Transport to Generative Modeling: the VEGAN cookbook , ar Xiv:1705.07642.
5[5] Chambolle, A., Pock, T. : A first-order primal-dual algorithm for convex problems with applications to imaging , Journal of Mathematical Imaging and Vision, 2011.
6[6] Cuturi, M. : Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances , Advances in Neural Information Processing Systems 26, pages 2292–2300, 201.
7[7] De March, A. : Entropic resolution for multi-dimensional optimal transport , arxiv:181211104.
8[8] De March, A., Henry-Labordère, P. : Building arbitrage-free implied volatility: Sinkhorn’s algorithm and variants , ar Xiv:1902.04456.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

(Martingale) Optimal Transport and anomaly detection with neural networks: a primal-dual algorithm

Abstract.

Key words and phrases:

1. Introduction

2. Numerical algorithms: A short overview

2.1. Simplex and cutting-plane

2.2. Entropic relaxation

2.2.1. Sinkhorn’s algorithm

2.3. and neural networks…

2.4. Penalization

3. A primal-dual algorithm

3.1. A saddle-point formulation

3.2. Using Brenier’s theorem

Definition 3.1** (Twist condition).**

Theorem 3.2** (Brenier’s theorem).**

Remark 3.3** (Fréchet-Hoeffding d=1d=1d=1).**

3.3. and neural networks…

3.4. Link with Wasserstein generative adversarial networks

3.5. Anomaly detector and data generator

3.6. Arrow-Hurwicz algorithm: recipe

3.7. Convergence

Example 3.4**.**

Proposition 3.5** (Convergence).**

3.8. The case of MOT

4. Numerical examples

4.1. OT in d=1d=1d=1

4.2. 222-Wassertein distance in Rd{\mathbb{R}}^{d}Rd, d=2,10,20d=2,10,20d=2,10,20

4.3. MOT in d=1d=1d=1

4.4. Anomaly detection in d=2d=2d=2

Definition 3.1 (Twist condition).

Theorem 3.2 (Brenier’s theorem).

Remark 3.3 (Fréchet-Hoeffding $d=1$ ).

Example 3.4.

Proposition 3.5 (Convergence).

4.1. OT in $d=1$

4.2. $2$ -Wassertein distance in ${\mathbb{R}}^{d}$ , $d=2,10,20$

4.3. MOT in $d=1$

4.4. Anomaly detection in $d=2$