Sparse Optimization on Measures with Over-parameterized Gradient Descent

Lenaic Chizat (CNRS; LMO)

arXiv:1907.10300·math.OC·November 4, 2020·Math. Program.

Sparse Optimization on Measures with Over-parameterized Gradient Descent

Lenaic Chizat (CNRS, LMO)

PDF

1 Repo

TL;DR

This paper introduces a global optimization algorithm for sparse measure minimization problems using over-parameterized gradient descent, achieving logarithmic complexity in accuracy under certain conditions.

Contribution

It demonstrates that discretized non-convex gradient descent can efficiently solve measure-based convex problems with sparsity penalties, with complexity scaling as log(1/ε).

Findings

01

Algorithm achieves complexity scaling as log(1/ε).

02

Global convergence is established under non-degeneracy assumptions.

03

Bounds involve exponential dependence on the dimension d.

Abstract

Minimizing a convex function of a measure with a sparsity-inducing penalty is a typical problem arising, e.g., in sparse spikes deconvolution or two-layer neural networks training. We show that this problem can be solved by discretizing the measure and running non-convex gradient descent on the positions and weights of the particles. For measures on a $d$ -dimensional manifold and under some non-degeneracy assumptions, this leads to a global optimization algorithm with a complexity scaling as $lo g (1/ ϵ)$ in the desired accuracy $ϵ$ , instead of $ϵ^{- d}$ for convex methods. The key theoretical tools are a local convergence analysis in Wasserstein space and an analysis of a perturbed mirror descent in the space of measures. Our bounds involve quantities that are exponential in $d$ which is unavoidable under our assumptions.

Figures15

Click any figure to enlarge with its caption.

Equations322

J^{*} : = ν \in M_{+} (Θ) min J (ν), J (ν) := R (\int_{Θ} ϕ (θ) d ν (θ)) + λ ν (Θ)

J^{*} : = ν \in M_{+} (Θ) min J (ν), J (ν) := R (\int_{Θ} ϕ (θ) d ν (θ)) + λ ν (Θ)

W_{p} (μ_{1}, μ_{2}) = (γ \in Π (μ_{1}, μ_{2}) min \int dist (x_{1}, x_{2})^{p} d γ (x_{1}, x_{2}))^{1/ p}

W_{p} (μ_{1}, μ_{2}) = (γ \in Π (μ_{1}, μ_{2}) min \int dist (x_{1}, x_{2})^{p} d γ (x_{1}, x_{2}))^{1/ p}

F_{m} ((r_{1}, θ_{1}), \dots, (r_{m}, θ_{m})) : = R (\frac{1}{m} i = 1 \sum m h (r_{i}) ϕ (θ_{i})) + \frac{λ}{m} i = 1 \sum m h (r_{i}),

F_{m} ((r_{1}, θ_{1}), \dots, (r_{m}, θ_{m})) : = R (\frac{1}{m} i = 1 \sum m h (r_{i}) ϕ (θ_{i})) + \frac{λ}{m} i = 1 \sum m h (r_{i}),

J_{ν}^{'} (θ) = ⟨ ϕ (θ), \nabla R (\int_{Θ} ϕ (θ) d ν (θ)) ⟩_{F} + λ,

J_{ν}^{'} (θ) = ⟨ ϕ (θ), \nabla R (\int_{Θ} ϕ (θ) d ν (θ)) ⟩_{F} + λ,

⟨(δ r_{1}, δ θ_{1}), (δ r_{2}, δ θ_{2}) ⟩_{(r, θ)} = α (r)^{- 1} δ r_{1} δ r_{2} + β (r)^{- 1} ⟨ δ θ_{1}, δ θ_{2} ⟩_{θ}

⟨(δ r_{1}, δ θ_{1}), (δ r_{2}, δ θ_{2}) ⟩_{(r, θ)} = α (r)^{- 1} δ r_{1} δ r_{2} + β (r)^{- 1} ⟨ δ θ_{1}, δ θ_{2} ⟩_{θ}

{\nabla_{r_{i}} F_{m} ((r_{i}, θ_{i})_{i = 1}^{m}) \nabla_{θ_{i}} F_{m} ((r_{i}, θ_{i})_{i = 1}^{m}) = α (r_{i}) h^{'} (r_{i}) J_{ν}^{'} (θ_{i}) = β (r_{i}) h (r_{i}) \nabla J_{ν}^{'} (θ_{i})

{\nabla_{r_{i}} F_{m} ((r_{i}, θ_{i})_{i = 1}^{m}) \nabla_{θ_{i}} F_{m} ((r_{i}, θ_{i})_{i = 1}^{m}) = α (r_{i}) h^{'} (r_{i}) J_{ν}^{'} (θ_{i}) = β (r_{i}) h (r_{i}) \nabla J_{ν}^{'} (θ_{i})

F^{⋆} = μ \in P_{2} (Ω) min F (μ) where F (μ) : = (\int_{Ω} h (r) ϕ (θ) d μ (r, θ)) + λ \int_{Ω} h (r) d μ (r, θ) .

F^{⋆} = μ \in P_{2} (Ω) min F (μ) where F (μ) : = (\int_{Ω} h (r) ϕ (θ) d μ (r, θ)) + λ \int_{Ω} h (r) d μ (r, θ) .

\int_{Θ} φ (θ) d (h μ) (θ) = \int_{Ω} h (r) φ (θ) d μ (r, θ)

\int_{Θ} φ (θ) d (h μ) (θ) = \int_{Ω} h (r) φ (θ) d μ (r, θ)

x^{'} (t) = - \nabla F_{m} (x (t))

x^{'} (t) = - \nabla F_{m} (x (t))

g_{\nu}(r,\theta)=\big{(}\alpha(r)h^{\prime}(r)J^{\prime}_{\nu}(\theta),\beta(r)h(r)\nabla J^{\prime}_{\nu}(\theta)\big{)}\in\mathbb{R}\times T_{\theta}\Theta.

g_{\nu}(r,\theta)=\big{(}\alpha(r)h^{\prime}(r)J^{\prime}_{\nu}(\theta),\beta(r)h(r)\nabla J^{\prime}_{\nu}(\theta)\big{)}\in\mathbb{R}\times T_{\theta}\Theta.

\partial_{t} μ_{t} = div (μ_{t} g_{h μ_{t}})

\partial_{t} μ_{t} = div (μ_{t} g_{h μ_{t}})

{\nabla_{r_{i}} F_{m} ((r_{i}, θ_{i})_{i = 1}^{m}) \nabla_{θ_{i}} F_{m} ((r_{i}, θ_{i})_{i = 1}^{m}) = 2 α r_{i} J_{ν}^{'} (θ_{i}) = β \nabla J_{ν}^{'} (θ_{i})

{\nabla_{r_{i}} F_{m} ((r_{i}, θ_{i})_{i = 1}^{m}) \nabla_{θ_{i}} F_{m} ((r_{i}, θ_{i})_{i = 1}^{m}) = 2 α r_{i} J_{ν}^{'} (θ_{i}) = β \nabla J_{ν}^{'} (θ_{i})

g_{\nu}(r,\theta)=\big{(}2\alpha rJ^{\prime}_{\nu}(\theta),\beta\nabla J^{\prime}_{\nu}(\theta)\big{)}\in\mathbb{R}\times T_{\theta}\Theta,\forall(r,\theta)\in\Omega.

g_{\nu}(r,\theta)=\big{(}2\alpha rJ^{\prime}_{\nu}(\theta),\beta\nabla J^{\prime}_{\nu}(\theta)\big{)}\in\mathbb{R}\times T_{\theta}\Theta,\forall(r,\theta)\in\Omega.

\partial_{t} ν_{t} = - 4 α ν_{t} J_{ν_{t}}^{'} + β div (ν_{t} \nabla J_{ν_{t}}^{'}) .

\partial_{t} ν_{t} = - 4 α ν_{t} J_{ν_{t}}^{'} + β div (ν_{t} \nabla J_{ν_{t}}^{'}) .

\frac{d}{d t} (\int φ d ν_{t}) = - \int ⟨ \nabla (h^{*} φ), g_{h μ_{t}} ⟩_{(r, θ)} d μ_{t} = - \int (4 α φ J_{ν_{t}}^{'} + β \nabla φ \cdot \nabla J_{ν_{t}}^{'}) d ν_{t},

\frac{d}{d t} (\int φ d ν_{t}) = - \int ⟨ \nabla (h^{*} φ), g_{h μ_{t}} ⟩_{(r, θ)} d μ_{t} = - \int (4 α φ J_{ν_{t}}^{'} + β \nabla φ \cdot \nabla J_{ν_{t}}^{'}) d ν_{t},

F (μ) = R (\int_{R^{d + 1}} ψ (u) d μ (u)) + λ \int_{R^{d + 1}} ∥ u ∥_{2}^{2} d μ (u)

F (μ) = R (\int_{R^{d + 1}} ψ (u) d μ (u)) + λ \int_{R^{d + 1}} ∥ u ∥_{2}^{2} d μ (u)

μ_{k + 1} = (T_{k})_{#} μ_{k}

μ_{k + 1} = (T_{k})_{#} μ_{k}

(r_{i}^{(k + 1)}, θ_{i}^{(k + 1)}) \leftarrow Ret_{(r_{i}^{(k)}, θ_{i}^{(k)})} (- 2 α r_{i}^{(k)} J_{ν^{(k)}}^{'} (θ_{i}^{(k)}), - β \nabla J_{ν^{(k)}}^{'} (θ_{i}^{(k)})) for i \in {1, \dots, m}

(r_{i}^{(k + 1)}, θ_{i}^{(k + 1)}) \leftarrow Ret_{(r_{i}^{(k)}, θ_{i}^{(k)})} (- 2 α r_{i}^{(k)} J_{ν^{(k)}}^{'} (θ_{i}^{(k)}), - β \nabla J_{ν^{(k)}}^{'} (θ_{i}^{(k)})) for i \in {1, \dots, m}

ν_{k + 1} = (T_{k}^{θ})_{#} ((T_{k}^{r})^{2} ν_{k}) .

ν_{k + 1} = (T_{k}^{θ})_{#} ((T_{k}^{r})^{2} ν_{k}) .

\int ψ d ν_{k + 1}

\int ψ d ν_{k + 1}

J (ν_{k + 1}) - J (ν_{k}) \leq - \frac{1}{2} ∥ g_{ν_{k}} ∥_{L^{2} (ν_{k})}^{2}

J (ν_{k + 1}) - J (ν_{k}) \leq - \frac{1}{2} ∥ g_{ν_{k}} ∥_{L^{2} (ν_{k})}^{2}

\int ψ d (ν_{k + 1} - ν_{k})

\int ψ d (ν_{k + 1} - ν_{k})

= - \int (4 α ψ \cdot J_{ν_{k}}^{'} + β \nabla ψ \cdot \nabla J_{ν_{k}}^{'}) d ν_{k} + ∥ ψ ∥_{C^{2}} ∥ g_{ν_{k}} ∥_{L^{2} (μ_{k})}^{2} O (max {α, β}) .

\Big{\|}\int\phi\mathrm{d}(\nu_{k+1}-\nu_{k})\Big{\|}^{2}=\sup_{\|f\|\leq 1}\Big{\|}\int\psi_{f}\mathrm{d}(\nu_{k+1}-\nu_{k})\Big{\|}^{2}=O(\max\{\alpha,\beta\}\|g_{\nu_{k}}\|^{2}_{L^{2}(\mu_{k})}).

\Big{\|}\int\phi\mathrm{d}(\nu_{k+1}-\nu_{k})\Big{\|}^{2}=\sup_{\|f\|\leq 1}\Big{\|}\int\psi_{f}\mathrm{d}(\nu_{k+1}-\nu_{k})\Big{\|}^{2}=O(\max\{\alpha,\beta\}\|g_{\nu_{k}}\|^{2}_{L^{2}(\mu_{k})}).

J (ν_{k + 1}) - J (ν_{k})

J (ν_{k + 1}) - J (ν_{k})

+ λ \int (d ν_{k + 1} - d ν_{k}) + O (max {α, β} ∥ g_{ν_{k}} ∥_{L^{2} (μ_{k})}^{2})

= \int J_{ν_{k}}^{'} d (ν_{k + 1} - ν_{k}) + O (max {α, β} ∥ g_{ν_{k}} ∥_{L^{2} (μ_{k})}^{2})

= (- 1 + O (max {α, β})) ∥ g_{ν_{k}} ∥_{L^{2} (μ_{k})}^{2} .

K_{(i, j), (i^{'}, j^{'})} : = ⟨ r_{i} \overset{ˉ}{\nabla}_{j} ϕ (θ_{i}), r_{i^{'}} \overset{ˉ}{\nabla}_{j^{'}} ϕ (θ_{i^{'}}) ⟩_{d^{2} R_{f^{⋆}}}

K_{(i, j), (i^{'}, j^{'})} : = ⟨ r_{i} \overset{ˉ}{\nabla}_{j} ϕ (θ_{i}), r_{i^{'}} \overset{ˉ}{\nabla}_{j^{'}} ϕ (θ_{i^{'}}) ⟩_{d^{2} R_{f^{⋆}}}

H_{i} : = β^{2} \nabla^{2} J_{ν^{⋆}}^{'} (θ_{i})

H_{i} : = β^{2} \nabla^{2} J_{ν^{⋆}}^{'} (θ_{i})

H_{(i, j), (i, j^{'})} = {β^{2} \nabla_{j, j^{'}}^{2} J_{ν^{⋆}}^{'} (θ_{i}) 0 if i = i^{'} and j, j^{'} \geq 1, if j = 0 or j^{'} = 0 .

H_{(i, j), (i, j^{'})} = {β^{2} \nabla_{j, j^{'}}^{2} J_{ν^{⋆}}^{'} (θ_{i}) 0 if i = i^{'} and j, j^{'} \geq 1, if j = 0 or j^{'} = 0 .

W_{2} (ν_{1}, ν_{2}) : = min {W_{2} (μ_{1}, μ_{2}); (μ_{1}, μ_{2}) \in P_{2} (Ω)^{2} satisfy (h μ_{1}, h μ_{2}) = (ν_{1}, ν_{2})}

W_{2} (ν_{1}, ν_{2}) : = min {W_{2} (μ_{1}, μ_{2}); (μ_{1}, μ_{2}) \in P_{2} (Ω)^{2} satisfy (h μ_{1}, h μ_{2}) = (ν_{1}, ν_{2})}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lchizat/2019-sparse-optim-measures
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Sparse Optimization on Measures

with Over-parameterized Gradient Descent

Lénaïc Chizat CNRS, Laboratoire de Mathématiques d’Orsay, Université Paris-Saclay, 91405, Orsay, France.

Abstract

Minimizing a convex function of a measure with a sparsity-inducing penalty is a typical problem arising, e.g., in sparse spikes deconvolution or two-layer neural networks training. We show that this problem can be solved by discretizing the measure and running non-convex gradient descent on the positions and weights of the particles. For measures on a $d$ -dimensional manifold and under some non-degeneracy assumptions, this leads to a global optimization algorithm with a complexity scaling as $\log(1/\epsilon)$ in the desired accuracy $\epsilon$ , instead of $\epsilon^{-d}$ for convex methods. The key theoretical tools are a local convergence analysis in Wasserstein space and an analysis of a perturbed mirror descent in the space of measures. Our bounds involve quantities that are exponential in $d$ which is unavoidable under our assumptions.

1 Introduction

Finding parsimonious descriptions of complex observations is an important problem in machine learning and signal processing. In its simplest form, this task boils down to searching for an element in a Hilbert space $\mathcal{F}$ that is close to a certain $f_{0}\in\mathcal{F}$ — the observations — and that is a linear combination of a few elements from a parameterized set $\{\phi(\theta)\}_{\theta\in\Theta}\subset\mathcal{F}$ — the parsimonious description. This can be formulated as a minimization problem where the linear combination is expressed through an unknown measure $\nu$ and the distance to $f_{0}$ is quantified using a smooth convex loss function $R:\mathcal{F}\to\mathbb{R}$ , such as the square loss $R(f)=\frac{1}{2}\|f-f_{0}\|^{2}_{\mathcal{F}}$ . The problem to solve is then

[TABLE]

where $\mathcal{M}_{+}(\Theta)$ is the set of nonnegative measures $\nu$ on the parameter space $\Theta$ with finite total mass $\nu(\Theta)<\infty$ and $\lambda>0$ is the regularization strength. This formulation also covers minimization over signed measures with total variation regularization, by replacing $\Theta$ with the disjoint union of two copies of $\Theta$ where $\phi$ takes opposite values, see Appendix A. A large body of research has exhibited the favorable properties of minimizers of such problems [4, 23, 43] with a statistical or variational viewpoint, showing in particular that $\lambda$ favors sparser solutions and increases stability as it gets larger, at the expense of introducing a stronger bias. The present paper deals with the optimization aspect: our goal is to design algorithms that return $\epsilon$ -accurate solutions with a guaranteed computational complexity. When the set $\Theta$ is a finite set, this is a finite dimensional convex optimization problem that is well understood [9, 5]. However, convex approaches are generally inefficient when $\Theta$ is a continuous space, such as a $d$ -dimensional manifold, where the need to discretize the space leads to a complexity scaling as $\epsilon^{-d}$ in the accuracy $\epsilon$ . We consider the following setting:

(A1) $\Theta$ is a compact $d$ -dimensional Riemannian manifold without boundaries. The functions $\phi:\Theta\to\mathcal{F}$ and $R:\mathcal{F}\to\mathbb{R}_{+}$ are twice Fréchet differentiable, with locally Lipschitz second-order derivatives, and $\nabla R$ is bounded on sublevel sets.

The algorithm that we analyze in this paper is simple to describe: initialize with a discrete measure and run gradient descent on the positions and weights of the particles. We will see that when the problem (1) admits sparse solutions and is non-degenerate, this over-parameterized non-convex gradient descent has a complexity scaling as $\log(1/\epsilon)$ in the accuracy $\epsilon$ . We make the following contributions:

–

In Section 2, we introduce the conic particle gradient descent algorithm to solve optimization problems in the space of measures and discuss several of its interpretations.

–

In Section 3, we show under under certain non-degeneracy assumptions that there is a sublevel of $J$ starting from which this algorithm converges exponentially fast to minimizers.

–

In Section 4, we show that for suitable choices of gradient and initialization, this algorithm converges to global minimizers. The proof combines the result of Section 3 with an analysis of a perturbed mirror descent in the space of measures. The number of iterations required to reach an accuracy $\epsilon$ is polynomial in the characteristics of the problem and logarithmic in $\epsilon$ . In contrast, the required number of particles depends exponentially on the dimension $d$ , which is unavoidable under our assumptions.

–

We report results of numerical experiments in Section 5, where the various insights brought by our analysis about local and global behaviors are investigated.

1.1 Examples of applications

As the problem of finding the simplest linear decomposition over a continuous dictionary is a very natural one, problems of the form (1) appear in a large variety of situations, see [8] for an extensive list. In this paper, our numerical illustrations are focused on two applications, chosen for their practical importance and also because they illustrate the variety of behaviors that can be encountered. We also mention a third example to emphasize on the extreme generality — and thus the intrinsic limits — of our analysis. These three cases are illustrated on Figure 1.

Sparse deconvolution.

In this application, we want to recover a signal that consists of a mixture of spikes/impulses on $\Theta$ given a noisy and filtered observation $f_{0}$ in the space $\mathcal{F}=L^{2}(\Theta)$ of square-integrable real-valued functions on $\Theta$ . When one defines $\phi(\theta):x\mapsto\psi(x-\theta)$ the translations of the filter impulse response $\psi$ and $R$ the squared loss, solving (1) allows to reconstruct the mixture of impulses with some guarantees, see e.g. [28, 23, 50]. In this typically low dimensional application, solving (1) to a high accuracy is crucial. Both the signed and nonnegative case have practical motivations (see Appendix A for how to handle the signed case). Figure 1-(a) illustrates the behavior of particle gradient descent for the signed case on the $1$ -torus, where the observed signal is shown in orange. Figure 2 illustrates the unsigned case on the $2$ -torus.

Two-layer neural networks.

Here the goal is to select, within a specific class, a function that maps features in $\mathbb{R}^{d-1}$ to labels in $\mathbb{R}$ from the observation of a joint distribution of features and labels. This corresponds to $\mathcal{F}$ being the space of real-valued functions on $\mathbb{R}^{d-1}$ which are square-integrable under the distribution of features, $R$ being e.g., the quadratic or the logistic loss function, and $\phi(\theta):x\mapsto\sigma(\sum_{i=1}^{d-1}\theta_{i}x_{i}+\theta_{d})$ with an activation function $\sigma:\mathbb{R}\to\mathbb{R}$ . Common choices are the sigmoid function or the rectified linear unit [35, 33]. In this application, $d$ is typically large and it is not clear yet how to verify the non-degeneracy assumptions a priori, so our global convergence bounds are not useful. Still, the local analysis in Section 3 gives insights on the local behavior in the over-parameterized regularized setting and explains well the behavior observed in numerical experiments. With the ReLU activation, the method we analyze boils down to the classical gradient descent algorithm, see the remark in Section 2.2 about the $2$ -homogeneous case. Figure 1-(b) illustrates this case, by plotting the trajectories of $|a_{i}|\cdot b_{i}\in\mathbb{R}^{2}$ where $a_{i}\in\mathbb{R}$ is the output weight of neuron $i$ and $b_{i}\in\mathbb{R}^{2}$ its hidden weights (the color represents the sign of $a_{i}$ ).

Non-convex optimization.

Lastly, the minimization of any smooth function on a manifold $\phi:\Theta\to\mathbb{R}$ is covered by (1), as proved in Appendix B. For this problem, our algorithm is analogous to running independently several gradient-based minimization with diverse initializations, because the various particles simply follow the gradient field of $\phi$ and only interact through their masses. This case is illustrated on Figure 1-(c) where the function to minimize (here on the $1$ -torus) is plotted in orange. We recover the standard fact that random search as to be complemented with local search if one wants complexity that is reasonable in the precision. We stress that this is not the situation that motivates our analysis. Instead, we are interested in the case of general interactions between the particles, which is when we obtain novel insights.

1.2 Related work

Sparse optimization on measures.

Problems with the structure (1) have a long history in optimization when $\Theta$ is discrete, and is typically solved with ISTA [22], mirror descent [47, 6] or variants of those algorithms. When $\Theta$ is continuous, the one dimensional case can sometimes be dealt with specific algorithms [13, 15]. In higher dimensions, the classical algorithms are conditional gradient algorithms (also known as Franck-Wolfe) [11, 25, 8], moment methods [24, 14, 27] and adaptive sampling/exchange algorithms [30, 29]. Often, these algorithms are complemented with non-convex updates on the particle positions, which considerably improves their behavior. Given an initial condition that is close to the optimum and with the same structure (i.e. without over-parameterization), the local convergence for non-convex gradient descent is studied in [56, 29].

Wasserstein gradient flows for optimization.

The dynamics of two-layer neural networks optimization when the number of hidden units grows unbounded is studied in [48, 17, 45, 52, 54]. This series of work has led to various insights related to stochastic fluctuations and global convergence. The present paper can be seen as a quantitative counterpart to [17], although we consider a more restrictive setting111The algorithm we study in this paper corresponds to the “ $2$ -homogeneous case” in [17]. Also, [17] allows non-smooth regularizers and does not require non-degeneracy.. A global rate of convergence is obtained in [60] but for a modified dynamic where particles are re-sampled at each iteration. Instead, we focus on the basic case where particles are only sampled once at the beginning of the algorithm. It should be mentioned that our analysis is different from the line of research on lazy over-parameterized models [18] initiated by [26, 36], which does not apply to the regularized case and to the unsigned case. Finally, in the parametric case where the unknown measure is assumed to belong to a finite dimensional probability model, Wasserstein natural gradient [2, 41, 16] or accelerated versions [58] have been proposed. Our analysis is however of non-parametric nature because the number of parameters is not fixed a priori in the analysis.

Related techniques.

Our framework involves the theory of optimization on manifolds [1] and of Wasserstein gradient flows [3]. Some inspiration and interpretations of the algorithm under consideration come from unbalanced optimal transport theory [42, 38, 19] and in particular, from the lifting construction in [42]. Finally, our local analysis includes a functional and a gradient Łojasiewicz inequality of order $2$ in Wasserstein space. Such inequalities were studied in [34, 7] for displacement convex functions, which does not cover our setting.

1.3 Notation

The set of signed (resp. nonnegative) finite Borel measures on a metric space $(\mathcal{X},\operatorname{dist})$ is denoted by $\mathcal{M}(\mathcal{X})$ (resp. $\mathcal{M}_{+}(\mathcal{X})$ ). The relative entropy, a.k.a. Kullback-Leibler divergence, is defined for $\nu_{1},\nu_{2}\in\mathcal{M}_{+}(\mathcal{X})$ as $\mathcal{H}(\nu_{1},\nu_{2})=\int_{\mathcal{X}}\log(\mathrm{d}\nu_{1}/\mathrm{d}\nu_{2})\mathrm{d}\nu_{1}-\nu_{1}(\mathcal{X})+\nu_{2}(\mathcal{X})$ if $\nu_{1}$ is absolutely continuous w.r.t. $\nu_{2}$ , and $+\infty$ otherwise. The $p$ -Wasserstein distance on the set $\mathcal{P}_{p}(\mathcal{X})$ of probability measures with finite $p$ -th moment is defined, for $\mu_{1},\mu_{2}\in\mathcal{P}_{p}(\mathcal{X})$ as

[TABLE]

where $\Pi(\mu_{1},\mu_{2})$ is the set of measures on $\mathcal{X}\times\mathcal{X}$ with marginals $\mu_{1}$ and $\mu_{2}$ . The distance $W_{\infty}$ between compactly supported probabilities is defined as the limit of $W_{p}$ as $p\to\infty$ and can be directly defined as $W_{\infty}(\mu_{1},\mu_{2})=\inf_{\gamma\in\Pi(\mu_{1},\mu_{2})}\max_{(x_{1},x_{2})\in\operatorname{spt}\gamma}\operatorname{dist}(x_{1},x_{2})$ [53]. We also define the Bounded-Lipschitz norm for a continuous function $\psi:\mathcal{X}\to\mathbb{R}$ as $\|\psi\|_{\mathrm{BL}}=\|\psi\|_{\infty}+\mathrm{Lip}(\psi)$ where $\mathrm{Lip}(\psi)$ is the Lipschitz constant of $\psi$ and its dual norm on $\mathcal{M}(\mathcal{X})$ as $\|\nu\|^{*}_{\mathrm{BL}}\coloneqq\sup_{\|\varphi\|_{\mathrm{BL}}\leq 1}\int\varphi\,\mathrm{d}\nu$ . For a Riemannian manifold $\Theta$ , we denote by $T_{\theta}\Theta$ the tangent space of $\Theta$ at $\theta$ and by $\langle\cdot,\cdot\rangle_{\theta}:(T_{\theta}\Theta)^{2}\to\mathbb{R}_{+}$ the metric at $\theta$ .

2 Particle gradient descent

2.1 General case

Consider a smooth increasing bijection $h:\mathbb{R}_{+}\to\mathbb{R}_{+}$ (such as a power function $r\mapsto r^{p}$ ) and a number of particles $m\in\mathbb{N}^{*}$ . The idea behind particle gradient-based algorithms is to parameterize the unknown measure $\nu$ as $\frac{1}{m}\sum_{i=1}^{m}h(r_{i})\delta_{\theta_{i}}$ and to perform gradient-based optimization on the corresponding objective

[TABLE]

where the parameters $(r_{i},\theta_{i})$ of each particle belong to $\Omega\coloneqq\mathbb{R}_{+}\times\Theta$ endowed with a specific choice of metric. Clearly, if $J$ admits a minimizer that is a mixture of $m^{\star}$ atoms with $m^{\star}\leq m$ , then it is sufficient to minimize $F_{m}$ from Eq. 2 for solving (1). While (2) is finite dimensional, it is typically non-convex with possibly some strict local minima. Still, when $R$ is convex and for $h(r)=r^{p}$ for $p\in\{1,2\}$ , the message from [17] (see Theorem 2.2) is that solving (2) to global optimality with first-order methods is still possible by using over-parameterization, i.e. choosing $m$ much larger than $m^{\star}$ . Such a method involve various key hyper-parameters which role is discussed throughout the paper. They include (i) the choice of the function $h$ (ii) the choice of the metric on $\Omega^{m}$ and (iii) the choice of the initialization.

Expression of the gradient.

Under (A1), the objective $J$ , seen as a function on the space $\mathcal{M}(\Theta)$ endowed with the total variation norm, is Fréchet-differentiable. Its differential at $\nu\in\mathcal{M}(\Theta)$ can be represented by the function $J^{\prime}_{\nu}:\Theta\to\mathbb{R}$ given by

[TABLE]

in the sense that for any $\sigma\in\mathcal{M}(\Theta)$ , it holds $\frac{d}{d\epsilon}J(\nu+\epsilon\sigma)|_{\epsilon=0}=\int_{\Theta}J^{\prime}_{\nu}(\theta)\mathrm{d}\sigma(\theta)$ . Now, consider a metric on $(\Omega^{*})^{m}$ that is the average $(1/m)\sum_{i=1}^{m}\langle\cdot,\cdot\rangle_{(r_{i},\theta_{i})}$ of metrics on each factor $\Omega^{*}\coloneqq\mathbb{R}_{+}^{*}\times\Theta$ , where $\mathbb{R}_{+}^{*}$ is the set of positive real numbers, of the form

[TABLE]

where $\alpha$ and $\beta$ are smooth functions $\mathbb{R}_{+}^{*}\to\mathbb{R}_{+}^{*}$ to be specified222Extension of the metric and gradients to the whole of $\Omega$ can be made on a case by case basis, see Section 2.2., $(r,\theta)\in\Omega^{*}$ , $\delta r_{1},\delta r_{2}\in\mathbb{R}$ and $\delta\theta_{1},\delta\theta_{2}\in T_{\theta}\Theta$ . Using the fact that gradients are characterized by the relation $\mathrm{d}F_{m}(x)(\delta x)=\langle\nabla F_{m}(x),\delta x\rangle$ , we get that the gradient of $F_{m}$ is given, in components, by

[TABLE]

Lifted problem in Wasserstein space.

Assume now that $h$ has at most quadratic growth, and that the metric is defined on the whole of $\Omega$ . One can then see the discrete problem (2) as a discretization of a problem on the space $\mathcal{P}_{2}(\Omega)$ of probability measures on $\Omega$ with finite second moment endowed with the Wasserstein- $2$ metric given by

[TABLE]

This point of view leads to insights on the properties of $F_{m}$ that are independent of $m$ , which is crucial for our theoretical analysis. For a measure $\mu\in\mathcal{P}_{2}(\Omega)$ , we define following [42] the homogeneous projection operator $\mathsf{h}:\mathcal{P}_{2}(\Omega)\to\mathcal{M}_{+}(\Theta)$ where $\mathsf{h}\mu$ is characterized by

[TABLE]

for any continuous function $\varphi:\Theta\to\mathbb{R}$ . With this operator, we simply have $F(\mu)=J(\mathsf{h}\mu)$ .

Gradient flow.

There are various ways to optimize (2) with first order methods. Instead of directly focusing on a specific method, we first consider the gradient flow of $F_{m}$ , as it is known that (stochastic) gradient descent [32, 40] approximates this dynamics. Let us call $x=(r_{i},\theta_{i})_{i=1}^{m}\in\Omega^{m}$ the variable of $F_{m}$ . A gradient flow of $F_{m}$ is an absolutely continuous curve $(x(t))_{t\geq 0}$ in $\Omega^{m}$ that satisfies

[TABLE]

for $t\geq 0$ , with the gradient given in Eq. (5). Note that if $h^{\prime}(r)\alpha(r)^{-1}$ does not tend to [math] as $r\to 0$ , then the non-negativity constraint on $r$ should be explicitly enforced, which requires the notion of subgradient flows, see [17] for details in our setting.

Wasserstein gradient flow.

It is also possible to directly study the optimization dynamics in the space $\mathcal{P}_{2}(\Omega)$ for the functional $F$ of Eq. (6). For a measure $\nu\in\mathcal{M}_{+}(\Theta)$ , consider the vector field on $\Omega$ with expression

[TABLE]

We refer to $g_{\mathsf{h}\mu}$ as the Wasserstein gradient of $F$ at $\mu$ (this notation emphasizes that it only depends on $\mu$ through $\mathsf{h}\mu$ ). Gradient flows of $F_{m}$ are particular cases of Wasserstein gradient flows of $F$ . The latter are defined as the absolutely continuous curves $(\mu_{t})_{t\geq 0}$ in $\mathcal{P}_{2}(\Omega)$ that satisfy

[TABLE]

in the weak sense, which means that for any differentiable function $\varphi:\Omega\to\mathbb{R}$ , it holds $\frac{\mathrm{d}}{\mathrm{d}t}\left(\int\varphi\,\mathrm{d}\mu_{t}\right)=-\int\nabla\varphi\cdot g_{\mathsf{h}\mu_{t}}\mathrm{d}\mu_{t}$ , for almost every $t\geq 0$ , see [53]. This is a proper extension of the notion of gradient flow for $F_{m}$ in the sense that if $x(t)=(r_{i}(t),\theta_{i}(t))_{i=1}^{m}$ is a gradient flow of $F_{m}$ then it can be directly checked that $t\mapsto\mu_{t}=\frac{1}{m}\sum_{i=1}^{m}\delta_{(r_{i}(t),\theta_{i}(t))}$ is a Wasserstein gradient flow of $F$ .

2.2 The conic case

As seen in Eq. (5), the choice of the homogeneity degree and of the metric on $\Omega$ determine a specific way to combine the vertical and the spatial components of the gradient (along the variable $r$ and $\theta$ , respectively). From now on, we focus on what we refer to as the conic case, which corresponds to the following assumption:

(A2)

The mass parameterization is $h(r)=r^{2}$ and the metric on $\Omega^{*}$ is of the form Eq. (4) with $(\alpha(r),\beta(r))=(\alpha,\beta/r^{2})$ for some $\alpha,\beta>0$ .

The corresponding geodesic distance is $\operatorname{dist}((r_{1},\theta_{2}),(r_{1},\theta_{2}))^{2}=r_{1}^{2}+r_{2}^{2}-2r_{1}r_{2}\cos_{\pi}(\operatorname{dist}(\theta_{1},\theta_{2}))$ where $\cos_{\pi}(z)=\cos(\min\{\pi,z\})$ . This metric can be extended as a proper metric on $\widetilde{\Omega}$ , defined as the set $\Omega$ where the subset $\{0\}\times\Theta$ is identified to a single point, known as the cone metric, which is the canonical way to define a metric on $\tilde{\Omega}$ [12]. In our context, identifying $\{0\}\times\Theta$ to a single point is desirable because a particle located in this set is a “dead” particle carrying no mass.

Plugging the metric into Eq. (5) gives the gradient (extended by continuity to $\{0\}\times\Theta$ )

[TABLE]

and the Wasserstein gradient is represented by the vector field

[TABLE]

Existence of Wasserstein gradient flows under (A1-2), for any initialization in $\mathcal{P}_{2}(\Omega)$ can be proved along the same lines as in [17], see details in Appendix C.1. Abstracting away its geometric derivation, the important aspects about our choice of gradient (8) are that its leads updates in $r$ which are multiplicative and updates in $\theta$ which are independent of $r$ . These two properties are crucial for our local convergence analysis (Section 3). Moreover, multiplicative updates enjoy favorable convergence rates (Section 4). The resulting structure and dynamics admits several interpretations.

Transport-growth interpretation.

First, the projection $\nu_{t}=\mathsf{h}\mu_{t}$ of the gradient flow solves an advection-reaction equation. Importantly, this dynamics depends on $\mu_{t}$ only via the initialization $\mathsf{h}\mu_{0}$ , which is a property specific to the conic setting.

Proposition 2.1.

Under (A1-2), let $(\mu_{t})_{t\geq 0}$ be a Wasserstein gradient flow for $F$ , with $\mu_{0}\in\mathcal{P}_{2}(\Omega)$ . Then $\nu_{t}=\mathsf{h}\mu_{t}$ satisfies (in the weak sense)

[TABLE]

Proof.

For any differentiable function $\varphi:\Theta\to\mathbb{R}$ , since $\mu_{t}$ is a Wasserstein gradient flow it holds

[TABLE]

which is the definition of weak solutions for (9). ∎

When $\beta=0$ , we recover the gradient flow of $J$ for the Fisher-Rao (or Hellinger) metric, which also corresponds to continuous time mirror descent on $\mathcal{M}_{+}(\Theta)$ for the entropy mirror map [39]. When $\alpha=0$ , this is the gradient flow of $J$ for the Wasserstein metric [3]. When $\alpha,\beta>0$ , this is the gradient flow of the functional $J$ for the Wasserstein-Fisher-Rao metric, a.k.a. Hellinger-Kantorovich metric, see e.g. [31]. Under Assumption (A2), the dynamics (7) and (9) are directly related by Proposition 2.1. In the rest of this paper, we present the statements in terms of the projected dynamics $\nu_{t}$ , although they also could be stated in terms of $\mu_{t}$ . Note that an alternative discretization of the dynamic (9) was proposed in [51] using particle birth-death.

Spherical coordinates interpretation.

Consider the case when $\Theta=\SS^{d}$ is the $d$ -dimensional sphere in $\mathbb{R}^{d+1}$ . Then, the space $\widetilde{\Omega}$ endowed with the cone metric and $\mathbb{R}^{d+1}$ are isometric, through the spherical to Euclidean change of coordinates $(r,\theta)\mapsto r\theta$ . Identifying $\widetilde{\Omega}$ with $\mathbb{R}^{d+1}$ through this isometry, the class of functions of the form $r^{p}\phi(\theta)$ on $\widetilde{\Omega}$ , for $p>0$ is simply the class of $p$ -homogeneous functions on $\mathbb{R}^{d+1}$ .

It follows that the conic setting we consider boils down, when $\Theta=\SS^{d}$ and $p=2$ , to objectives defined on $\mathcal{P}_{2}(\mathbb{R}^{d+1})$ of the form

[TABLE]

with $\psi:\mathbb{R}^{d+1}\to\mathcal{F}$ positively $2$ -homogeneous. Moreover, the Wasserstein gradient on $\mathcal{P}_{2}(\widetilde{\Omega})$ with the cone metric can be identified with the Wasserstein gradient on $\mathcal{P}_{2}(\mathbb{R}^{d+1})$ with the Euclidean metric. One can thus understand our choice of conic metric and $p=2$ as a way to emulate the structure of $2$ -homogeneous problems on $\mathbb{R}^{d+1}$ in more general situations.

Asymptotic global convergence.

Let us recall the global convergence result of [17, Thm. 3.3], in our setting and notations. We give in Appendix C.1 a simplified proof, enabled by our stronger smoothness assumptions.

Theorem 2.2.

Under (A1-2), assume that $R$ is convex, that $\phi$ is $d$ -times continuously differentiable, that $\nu_{0}\in\mathcal{M}_{+}(\Theta)$ has full support and that the projected gradient flow $(\nu_{t})_{t\geq 0}$ converges weakly to some $\nu_{\infty}\in\mathcal{M}_{+}(\Theta)$ . Then $\nu_{\infty}$ is a global minimizer of $J$ .

This theorem can be understood as a consistency result for conic particle gradient descent. It also raises several questions: under which conditions does $\nu_{\infty}$ exist? Can we guarantee a convergence rate ? Can we relax the full support condition on the initialization? In this paper, we answer positively to these questions in the particular case of non-degenerate sparse problems.

2.3 Conic particle gradient descent algorithm

Cone compatible retractions.

The definition of discrete gradient descent in a Riemannian setting requires to introduce the notion of retraction. In general, a retraction on a Riemannian manifold $\mathcal{M}$ with tangent bundle $T\mathcal{M}$ is a smooth map $\mathrm{Ret}:T\mathcal{M}\to\mathcal{M}$ such that its restriction $\mathrm{Ret}_{x}$ to $T_{x}\mathcal{M}$ satisfies $\mathrm{Ret}_{x}(0)=x$ and $\mathrm{d}\mathrm{Ret}_{x}(0)=\mathrm{id}_{T_{x}\mathcal{M}}$ , see [1, Def. 4.1.1]. In our case, we need to slightly adapt the definition to deal with the cone structure.

Definition 2.3.

We say that $\mathrm{Ret}:\Omega\times(\mathbb{R}\times T\Theta)\to\Omega$ is a retraction compatible with the cone structure, if it satisfies the following:

(i)

(Retraction property)* It is a proper retraction on $\Omega^{*}\coloneqq\mathbb{R}_{+}^{*}\times\Theta$ . It is not necessarily defined everywhere but there exists $C>0$ such that $\mathrm{Ret}_{(r,\theta)}(\delta r,\delta\theta)$ is defined as long as $\max\{|\delta r|/r,\|\delta\theta\|_{\theta}\}<C$ .* 2. (ii)

(Zero preserving)* It satisfies $\mathrm{Ret}_{(0,\theta)}(\delta r,\delta\theta)=(0,f(\theta,\delta\theta))$ for some arbitrary measurable $f$ .* 3. (iii)

(Homogeneity)* For any $r,\tilde{r}\in\mathbb{R}_{+}^{*}$ , $\theta\in\Theta$ , $\delta r\in\mathbb{R}$ and $\delta\theta\in T_{\theta}\Theta$ satisfying $\max\{|\delta r|/r,\|\delta\theta\|_{\theta}\}<C$ , denoting $(r_{1},\theta_{1})=\mathrm{Ret}_{(r,\theta)}(r\delta r,\delta\theta)$ and $(r_{2},\theta_{2})=\mathrm{Ret}_{(\tilde{r},\theta)}(\tilde{r}\delta r,\delta\theta)$ , then $\theta_{1}=\theta_{2}$ and $\tilde{r}\cdot r_{1}=r\cdot r_{2}$ .*

These properties are satisfied in the following examples, where $\widetilde{\mathrm{Ret}}$ denotes any retraction defined on $\Theta$ (we give them names for future reference):

–

the canonical retraction $\mathrm{Ret}_{(r,\theta)}(\delta r,\delta\theta)=(r+\delta r,\widetilde{\mathrm{Ret}}_{\theta}(\delta\theta))$ (here $C=1$ );

–

the mirror retraction $\mathrm{Ret}_{(r,\theta)}(\delta r,\delta\theta)=(r\exp(\delta r/r),\widetilde{\mathrm{Ret}}_{\theta}(\delta\theta))$ , which allows to recover a version of mirror descent when $\delta\theta=0$ (here $C=+\infty$ );

–

the induced retraction when $\Theta$ is the $d$ -sphere, which is the retraction induced by the isometric embedding into $\mathbb{R}^{d+1}$ , see Section 2.2. It is defined as $\mathrm{Ret}_{(r,\theta)}(\delta r,\delta\theta)=(\|u\|,u/\|u\|)$ where $u=r\theta+\theta\delta r+r\delta\theta\in\mathbb{R}^{d+1}$ (here $C=1$ ). With this retraction, the iterates of gradient descent on $\Omega$ with the cone metric can be identified with the iterates of (Euclidean) gradient descent in $\mathbb{R}^{d+1}$ .

Gradient descent in $\mathcal{P}_{2}(\Omega)$ .

Given a retraction $\mathrm{Ret}$ compatible with the cone structure, we define the gradient descent as follows. Let $\mu_{0}\in\mathcal{P}_{2}(\Omega)$ and for $k\in\mathbb{N}$ define recursively

[TABLE]

where $T_{k}(r,\theta)=\mathrm{Ret}_{(r,\theta)}(-2\alpha rJ^{\prime}_{\nu_{k}}(\theta),-\beta\nabla J^{\prime}_{\nu_{k}}(\theta))$ and $\nu_{k}=\mathsf{h}\mu_{k}$ . The notation $\#$ stands for the pushforward operator333The pushfoward measure $T_{\#}\mu$ is characterized by $\int\psi\mathrm{d}(T_{\#}\mu)=\int(\psi\circ T)\mathrm{d}\mu$ for any continuous function $\psi$ .. When $\mu_{0}$ is a finite discrete probability measure with uniform weights, this gives Algorithm 1, which is a gradient descent for $F_{m}$ in the cone metric.

Transport-growth interpretation.

Just like the continuous-time gradient flow, the discrete time gradient descent has a corresponding projected dynamics in $\mathcal{M}_{+}(\Theta)$ . Here the equivalence also relies on the properties of compatible retractions.

Proposition 2.4.

Under (A1-2), let $\mathrm{Ret}$ be a retraction compatible with the cone structure and let $\mu_{k+1}=(T_{k})_{\#}\mu_{k}$ for some $\mu_{k}\in\mathcal{P}_{2}(\Omega)$ . Let $(T_{k}^{r}(\theta),T_{k}^{\theta}(\theta))\coloneqq T_{k}(1,\theta)$ . Then, the projected iterates $(\nu_{k+1},\nu_{k})\coloneqq(\mathsf{h}\mu_{k+1},\mathsf{h}\mu_{k})$ satisfy

[TABLE]

Proof.

First, remark that by Property (i) of Definition 2.3, $T_{k}$ is well-defined if $\max\{\alpha,\beta\}$ is small enough and that $T_{k}\in L^{2}(\mu_{k};\Omega)$ so $\mu_{k+1}\in\mathcal{P}_{2}(\Omega)$ . For any continuous function $\psi:\Theta\to\mathbb{R}$ , using Properties (ii)-(iii) of Definition 2.3, we get

[TABLE]

which proves the claim. ∎

Descent property of conic particle gradient descent.

The following lemma shows that, for sufficiently small step-sizes, the iterates (11) are well-defined and monotonously decrease the objective. As usual in optimization, this property is useful to convert results on gradient flows into results on gradient descent.

Lemma 2.5 (Descent property).

Assume (A1-2) and let $\mathrm{Ret}$ be a retraction compatible with the cone structure (Definition 2.3). For any $J_{\max}\geq J^{\star}$ , there exists $\eta_{\max}>0$ such that if $\nu_{0}\in\mathcal{M}_{+}(\Theta)$ satisfies $J(\nu_{0})\leq J_{\max}$ then the gradient descent iteration with $\max\{\alpha,\beta\}\leq\eta_{\max}$ is well defined for all $k\geq 0$ and satisfies

[TABLE]

Proof.

Let us first look at one step starting from $\nu_{k}\in\mathcal{M}_{+}(\Theta)$ . By Property (i) of Definition 2.3, there exists $\eta_{\max}>0$ such that this iteration is well-defined as long as $\max\{\alpha,\beta\}\leq\eta_{\max}$ . We first consider $\nu_{k}(\Theta)$ and $\|J^{\prime}_{\nu_{k}}\|_{\mathcal{C}^{2}}$ as constants, where $\|\phi\|_{\mathcal{C}^{2}}=\max\{\|\phi\|_{\infty},\|\nabla\phi\|_{\infty},\|\nabla^{2}\phi\|_{\infty}\}$ (we will see later that they can be upper bounded independently of the iteration $k$ ). With the notations of Proposition 2.4, we have $\nu_{k+1}=(T^{\theta}_{k})_{\#}((T^{r}_{k})^{2}\nu_{k})$ where $T_{k}^{r}(\theta)=1-2\alpha J^{\prime}_{\nu_{k}}(\theta)+O(\alpha^{2}J^{\prime}_{\nu_{k}}(\theta)^{2})$ and, in normal coordinates, $T_{k}^{\theta}(\theta)=\theta-\beta\nabla J^{\prime}_{\nu_{k}}(\theta)+O(\beta^{2}\|\nabla J^{\prime}_{\nu_{k}}(\theta)\|^{2})$ where the hidden constants are uniform in $\theta$ . It follows that for any twice continuously differentiable $\psi\in\mathcal{C}^{2}(\Theta;\mathbb{R})$ , it holds

[TABLE]

In particular, using this expression with $\psi_{f}(\theta)=\langle\phi(\theta),f\rangle$ where $\|f\|\leq 1$ (which have uniformly bounded norms $\|\psi_{f}\|_{\mathcal{C}^{2}}$ under our assumptions), we get that

[TABLE]

By a first order expansion of $R$ , we have for $f,f^{\prime}\in\mathcal{F}$ , $R(f^{\prime})-R(f)=\langle f^{\prime}-f,\nabla R(f)\rangle+O(\|f^{\prime}-f\|^{2})$ . Thus, using the expression of $J^{\prime}_{\nu}$ from Eq. (3), it follows

[TABLE]

So there exists $\eta_{\max}$ such that if $\max\{\alpha,\beta\}\leq\eta_{\max}$ , we have $J(\nu_{k+1})-J(\nu_{k})\leq-\frac{1}{2}\|g_{\nu_{k}}\|^{2}_{L^{2}(\mu_{k})}$ . Finally, since we have assumed that $\lambda>0$ and $\nabla R$ is bounded on sublevel sets, the quantities $\sup_{J(\nu)\leq J(\nu_{k})}\nu(\Theta)$ and $\sup_{J(\nu)\leq J(\nu_{k})}\|J^{\prime}_{\nu}\|_{\mathcal{C}^{2}}$ are finite. By the decrease property we just proved, these quantities decrease after one iteration if $\max\{\alpha,\beta\}\leq\eta_{\max}$ . So $\eta_{\max}$ , which depends on these quantities, can be chosen independently of $k\geq 0$ . ∎

3 Exponential local convergence

We now proceed to the theoretical analysis of the projected gradient flow (9) and projected gradient descent (12) in the conic setting. In light of Propositions 2.1 and 2.4, these dynamics correspond to the gradient flow and gradient descent of $F$ , seen through the projection operator $\mathsf{h}$ .

3.1 Non-degeneracy assumptions

In order to derive global optimality conditions, we assume the following.

(A3)

The loss $R$ is convex.

Commonly used losses that satisfy the smoothness and convexity conditions are the square loss and the logistic loss. Under this assumption, we have existence of minimizers and a global optimality condition.

Proposition 3.1 (Optimality condition).

Under (A1) and (A3), problem (1) admits minimizers. Moreover, a measure $\nu^{\star}\in\mathcal{M}_{+}(\Theta)$ is a minimizer if and only if it holds $J^{\prime}_{\nu^{\star}}(\theta)\geq 0$ for all $\theta\in\Theta$ and $J^{\prime}_{\nu^{\star}}(\theta)=0$ whenever $\theta$ in the support of $\nu^{\star}$ .

Proof.

As $\lambda$ is assumed positive, the sublevel sets of $J$ on $\mathcal{M}_{+}(\Theta)$ are bounded in total variation, and are thus weakly pre-compact. It follows that any minimizing sequence for $J$ admits at least one weak limit point $\nu^{\star}$ , which is a minimizer of (1) since $J$ is weakly continuous. The stated optimality condition is equivalent to having $\int J^{\prime}_{\nu^{\star}}\mathrm{d}(\nu-\nu^{\star})\geq 0$ for all $\nu\in\mathcal{M}_{+}(\Theta)$ . The latter is a sufficient optimality condition since by convexity of $J$ , $J(\nu)-J(\nu^{\star})\geq\int J^{\prime}_{\nu^{\star}}\mathrm{d}(\nu-\nu^{\star})$ . It is also necessary since it holds $\frac{d}{d\epsilon}J((1-\epsilon)\nu^{\star}+\epsilon\nu)|_{\epsilon=0^{+}}=\int J^{\prime}_{\nu^{\star}}\mathrm{d}(\nu-\nu^{\star})$ . ∎

Sparse minimizer.

Our local analysis requires sparsity of the minimizers of the objective $J$ , which can be guaranteed a priori in several settings (e.g. [28, 10]).

(A4)

Problem (1) admits a unique global minimizer on $\mathcal{M}_{+}(\Theta)$ which is of the form $\nu^{\star}=\sum_{i=1}^{m^{\star}}r_{i}^{2}\delta_{\theta_{i}}$ with $\nu^{\star}(\Theta)>0$ . We denote $f^{\star}\coloneqq\int\phi\mathrm{d}\nu^{\star}=\sum_{i=1}^{m^{\star}}r_{i}^{2}\phi(\theta_{i})$ .

Without loss of generality, we assume $r_{i}>0$ for all $i$ and $\theta_{i}\neq\theta_{i^{\prime}}$ whenever $i\neq i^{\prime}$ , so that $(r_{i},\theta_{i})_{i=1}^{m^{\star}}$ is uniquely well-defined, up to re-ordering. Let us fix from now on normal coordinates frames on the neighborhood of each $\theta_{i}$ . This allows to identify tensors at $\theta_{i}$ with their expression in coordinates and also induces a set of coordinates on the direct sum of the tangent spaces $T_{\theta_{i}}\Theta$ , which is of dimension $m^{\star}\times d$ .

Kernels and non-degeneracy.

We define the global kernel $K\in\mathbb{R}^{(m^{\star}\times(1+d))^{2}}$ by

[TABLE]

where $\bar{\nabla}\phi\coloneqq(2\alpha\phi,\beta\nabla\phi)$ can be interpreted as the gradient of $\mathsf{h}\phi$ at $(1,\theta)$ . Remark that $K$ is defined via the quadratic form associated to the Hessian of $R$ at $f^{\star}$ . This interaction kernel $K$ appears naturally in the various statistical and optimization analysis of the minimization problem under consideration [28, 56]. We also use the notation for the local kernels for $i\in 1,\dots,m^{\star}$

[TABLE]

expressed in local coordinates. In order to simplify notations, we concatenate these matrices in a large matrix $H\in\mathbb{R}^{(m^{\star}\times(1+d))^{2}}$ of the same size as $K$ defined as

[TABLE]

where here and in the proofs, we use [math] to label the $r$ ’s coordinate. The local analysis will be carried under the following non-degeneracy assumptions.

(A5)

The minimizer $\nu^{\star}$ is non-degenerate in the sense that $\nabla^{2}R(f^{\star})$ is positive definite and, calling $\sigma_{\min}(A)$ the smallest singular value of a linear operator $A$ , we have global curvature $\sigma_{\min}(K)>0$ , local curvature $\sigma_{\min}(H)=\min_{i}\sigma_{\min}(H_{i})>0$ , and strict slackness, i.e. the only points where $J^{\prime}_{\nu^{\star}}$ vanishes are $(\theta_{i})_{i=1}^{m^{\star}}$ .

The first property is always satisfied if $R$ is strictly convex. The second property is satisfied when the kernel associated to the feature function $\bar{\nabla}\phi$ is positive definite. The last two assumptions unfortunately depend on an a priori unknown object $J^{\prime}_{\nu^{\star}}$ , but are often required to perform analysis of Problem (1) [29, 28]. Yet, in some cases, they can be guaranteed to hold, see e.g. [50, 55]. In spite of this drawback, the local analysis leads to interesting qualitative insights on the dynamics in practice, see Section 5.

3.2 Convergence in $\mathcal{M}_{+}(\Theta)$

A first consequence of these assumptions is that convergence in value implies convergence to minimizers. The distance on $\mathcal{M}_{+}(\Theta)$ that naturally appears in the analysis is the Wasserstein-Fisher-Rao, a.k.a. Hellinger-Kantorovich metric $\widehat{W}_{2}$ , which is the extension of the Wasserstein $W_{2}$ metric to unnormalized measures. It admits many equivalent definitions [42, 38, 20], the most suitable to our context being [42, Thm. 7.20]

[TABLE]

where the Wasserstein distance on $\Omega$ is defined relative to the cone metric (in this paragraph, with $\alpha=\beta=1$ ). The proof of the following result involves the construction of a transport map in the lifted space $\mathcal{P}_{2}(\Omega)$ and is postponed to Appendix D.4.

Proposition 3.2.

Under (A1-5), for all $J_{\max}\geq J^{\star}$ , there exists $C,C^{\prime}>0$ , such that if $\nu\in\mathcal{M}_{+}(\Theta)$ satisfies $J(\nu)\leq J_{\max}$ then $\|\nu-\nu^{\star}\|^{*}_{\mathrm{BL}}\leq C\widehat{W}_{2}(\nu,\nu^{\star})\leq C^{\prime}(J(\nu)-J^{\star})^{\frac{1}{2}}$ .

3.3 Sharpness of the objective

Our first main result is a lower bound on the squared norm of the gradient in terms of the sub-optimality gap, an inequality known as sharpness, or Polyak-Łojasiewicz inequality [49, 37], which is a special case of Łojasiewicz gradient inequality. It involves the $L^{2}(\nu)$ norm of the gradient, which we denote for $\nu=\mathsf{h}\mu$ by

[TABLE]

Theorem 3.3 (Sharpness).

Under (A1-5), there exists $J_{0}>J^{\star}$ and $\kappa_{0}>0$ , such that for all $\nu\in\mathcal{M}_{+}(\Theta)$ satisfying $J(\nu)\leq J_{0}$ and $\alpha,\beta>0$ , one has

[TABLE]

While the objective is non-convex in the Wasserstein geometry and has typically an infinity of bad stationary points, this inequality guarantees exponential convergence to global minimizers of various gradient-based dynamics as long as their initialization $\nu_{0}$ has a small enough objective value. Crucially, the specific structure of $\nu$ does not matter, beyond the fact that is is close enough to optimality: it applies indifferently to discrete and absolutely continuous measures. Once Theorem 3.3 is established, it is straightforward to prove exponential convergence of gradient flow and gradient descent.

Corollary 3.4 (Local convergence of gradient flow).

Under (A1-5), let $J_{0}$ and $\kappa_{0}$ be given by Theorem 3.3. Consider $(\nu_{t})_{t\geq 0}$ a projected gradient flow for $J$ as in Eq. (9). If $J(\nu_{0})\leq J_{0}$ then

[TABLE]

Proof.

By Theorem 3.3 and direct computations, one has

[TABLE]

and the result follows by Grönwall’s lemma. ∎

Corollary 3.5 (Local convergence of gradient descent).

Assume (A1-5), let $J_{0}$ and $\kappa_{0}$ be given by Theorem 3.3, and let $\mathrm{Ret}$ be a retraction compatible with the cone structure (Definition 2.3). There exists $\eta_{\max}>0$ such that for any projected gradient descent $(\nu_{k})_{k\geq 0}$ for $J$ following recursion (11), if $J(\nu_{0})\leq J_{0}$ and $\max\{\alpha,\beta\}\leq\eta_{\max}$ , then

[TABLE]

Proof.

By Lemma 2.5, there exists $\eta_{\max}$ such that if $\max\{\alpha,\beta\}\leq\eta_{\max}$ , then $J(\nu_{k+1})-J(\nu_{k})\leq-\frac{1}{2}\|g_{\nu_{k}}\|^{2}_{L^{2}(\nu_{k})}$ . Combining this inequality with Theorem 3.3, one has $J(\nu_{k+1})-J(\nu_{k})\leq-\kappa_{0}\min\{\alpha,\beta\}(J(\nu_{k})-J^{\star})$ . Rearranging the terms, we get $J(\nu_{k+1})-J^{\star}\leq(1-\kappa_{0}\min\{\alpha,\beta\})(J(\nu_{k})-J^{\star})$ and the result follows by recursion. ∎

3.4 Proof strategy for the sharpness theorem

The proof of Theorem 3.3, in Appendix D, is based on a local expansion of $J(\nu)$ in terms of some local moments of $\nu$ . For a radius $\tau>0$ (that shall be fixed at some small enough value in the course of the proof), we define the sets for $i\in\{1,\dots,m^{\star}\}$ ,

[TABLE]

We assume that $\tau$ is smaller than $1$ and small enough so that these sets together with $\Theta_{0}\coloneqq\Theta\setminus\cup_{i=1}^{m^{\star}}\Theta_{i}$ form a partition of $\Theta$ and that the exponential map at $\theta_{i}$ has injectivity radius larger than $\tau$ , for $i\in\{1,\dots,m^{\star}\}$ . We then say that $\tau$ is an admissible radius.

Definition 3.6 (Local moments).

Given an admissible radius $\tau>0$ and a measure $\nu\in\mathcal{M}_{+}(\Theta)$ , we define for $i\in\{0,\dots,m^{\star}\}$ the local masses $\bar{r}_{i}^{2}=\nu(\Theta_{i})$ and the local means $\bar{\theta}_{i}\coloneqq\frac{1}{\bar{r}_{i}^{2}}\int_{\Theta_{i}}\theta\mathrm{d}\nu(\theta)$ if $\nu(\Theta_{i})>0$ and $\bar{\theta}_{i}=\theta_{i}$ otherwise. Finally, we define for $i\in\{1,\dots,m^{\star}\}$ the weighted biases

[TABLE]

and the weighted covariances $\Sigma_{i}\coloneqq\frac{1}{\bar{r}_{i}^{2}\beta^{2}}\int_{\Theta_{i}}(\theta-\bar{\theta}_{i})(\theta-\bar{\theta}_{i})^{\intercal}\mathrm{d}\nu(\theta)$ .

If $\nu$ has only $1$ atom in each $\Theta_{i}$ then its spatial coordinate is $\bar{\theta}_{i}$ and $\Sigma_{i}=0$ . When moreover $\nu(\Theta_{0})=0$ , the optimization reduces to a more classical gradient flow in $\Omega^{m^{\star}}$ which local behavior has already been studied [56, 29], but obtaining measures of this form is typically almost as hard as solving the original problem. This decomposition can be reminiscent of proof techniques used to study log-Sobolev inequalities (another type of sharpness inequality in Wasserstein space [7]) in the small temperature regime [46].

It turns out that the local moments of Definition 3.6 are sufficient to characterize the behavior of $J$ near optimality. In particular, we have the following approximations for $J$ and its gradient around optimality. These formulas are obtained as an intermediate step in the proof of Theorem 3.3 and follow by combining the bounds of Proposition D.4 and Proposition D.5 with Lemma D.3.

Proposition 3.7 (Local expansion).

Assuming (A1-5), for any $\nu\in\mathcal{M}_{+}(\Theta)$ it holds

[TABLE]

where $\|g_{\nu}\|^{2}_{L^{2}(\nu|_{\Theta_{0}})}=\int_{\Theta_{0}}(\alpha|J^{\prime}_{\nu}|^{2}+\beta\|\nabla J^{\prime}_{\nu}\|^{2}_{\theta})\mathrm{d}\nu$ and $\mathrm{err}(\tau,\Delta)=O(\Delta\tau+\Delta^{\frac{3}{2}}\tau^{-6})$ .

3.5 Discussion on the local behavior

Let us now explore what the expansion from Proposition 3.7 teaches us about the local behavior of the dynamics. In order to simplify the discussion, let us fix a small admissible radius $\tau_{0}$ and ignore the error terms in Proposition 3.7.

Effect of over-parameterization.

When there is no over-parameterization ( $m=m^{\star}$ ) and we have a single particle in the neighborhood $\Theta_{i}$ of each optimal particle, then there is no local variance: $\Sigma_{i}=0$ for $i=1,\dots,m^{\star}$ . In this case, we recover the Taylor expansion of $F_{m^{\star}}$ around its minimizer

[TABLE]

and the local convergence rate is dictated by the conditioning of $(K+H)$ . Now, for an arbitrary over-parameterization i.e. $\nu\in\mathcal{M}_{+}(\Theta)$ but with the support of the solution approximately identified, i.e. $\nu(\Theta_{0})=0$ , the objective is still entirely characterized locally by the local moments of $\nu$ , since

[TABLE]

This expression gives a clear picture of the energy landscape, so let us comment on it. If we think of the particles in $\Theta_{i}$ as a cluster, then the first term consists in a global interaction between the clusters, which only depends on the biases of each cluster relatively to their respective ground truth particles. The two other terms are local interactions within each cluster, which are due to the local curvature of $J^{\prime}_{\nu^{\star}}$ at each $\theta_{i}$ . Note in particular that the only term in this expansion that penalizes the variance of each cluster $\Sigma_{i}$ consists of local interactions.

Effect of the regularization parameter.

In this paper, the assumption that $\lambda$ is non-zero is not crucial as such. Instead the crucial assumption for the local analysis is (A5). Still, this assumption is intimately connected to the regularization: in the signed case (detailed in Appendix A), it is necessary that $\lambda>0$ to have (A5), because with $\lambda=0$ , the minimizer $\nu^{\star}$ is a global minimizer in the space of signed measures and thus the global optimality condition $J^{\prime}_{\nu^{\star}}=0$ holds. In fact, a finer analysis of the behavior as $\lambda\to 0$ is possible in the signed case: it can be shown that one has (emphasizing the dependency in $\lambda$ in the notation):

[TABLE]

for some $K_{0}$ , $H_{0}$ and $J^{\prime}_{0}$ [28, Prop. 1 and Thm. 2] (where the result is proved for $R$ being the square loss but can be directly generalized to $R$ smooth and strongly convex around the minimizer). Under the assumption that $J^{\prime}_{0}$ is non-degenerate in the sense of (A5), as soon as $\nu(\Theta_{0})>0$ or $\Sigma_{i}\neq 0$ for some $i\in\{1,\dots,m^{\star}\}$ , the local rate $\kappa_{0}$ is thus of order $\lambda$ and for $\lambda=0$ , the exponential convergence rate is lost. This shows that regularization is necessary for fast local convergence in the signed case, and in particular – remembering the previous paragraph – for the variance of each cluster of particles to vanish quickly. Note that it is an open question to even show local convergence when (A5) does not hold.

Choice of the metric and conditioning.

While our statements, in particular Corollary 3.5, seem to imply that it is best to choose $\alpha=\beta$ , this is in fact just an artefact of the way the upper bounds are presented, with some hidden constants. Instead, these parameters should be chosen, as usual, so as to make the local expression of $J$ above well-conditioned. Without additional information, a possible heuristic is to make the block diagonal matrix $\operatorname{diag}(K+H,H)$ well-conditioned by choosing $(\alpha,\beta)$ satisfying $2\alpha\|\phi\|_{\infty}\approx\beta\mathrm{Lip}(\phi)$ .

Polynomial dependency.

It can be seen from the proof of Theorem 3.3 that $(J_{0}-J^{\star})^{-1}$ and $\kappa_{0}^{-1}$ depend polynomially on the characteristics of the problem, which are the regularization $\lambda$ , the regularity parameters of $\phi$ and $R$ , the ratio $\max_{i}r_{i}/\min_{i}r_{i}$ , the inverses of the $\sigma_{\min}(\nabla^{2}R(f^{\star}))$ , $\sigma_{\min}(H)$ , $\sigma_{\min}(K)$ and finally the quantity $v^{\star}$ that quantifies the strict slackness assumption, in the following sense: $v^{*}>0$ is such that for any local minimum $\theta$ of $J^{\prime}_{\nu^{\star}}$ , either $\theta=\theta_{i}$ for some $i\in\{1,\dots,m^{\star}\}$ or $J^{\prime}_{\nu^{\star}}(\theta)\geq v^{*}$ .

4 Quantitative global convergence

There are several convex optimization-based algorithms that are known to return approximate minimizers of $J$ which are mixtures of atoms (with typically $m>m^{\star}$ ) with a guaranteed complexity, see Section 1.2. Starting from any such approximate minimizer, the results of the previous section imply that conic particle gradient descent converges exponentially fast to minimizers of $J$ . However, such a “two-algorithms” approach comes with a drawback: one has to decide when to switch from one algorithm to another. In this section, we show that it is possible to reach global optimality by only performing non-convex gradient descent. This is true under two main conditions: (i) the initialization samples $\Theta$ densely enough, and (ii) the ratio $\beta/\alpha$ is small, at least in the early stages of the algorithm.

4.1 Statement of the main results

In order to state the condition on the initialization, we first choose a reference measure $\rho\in\mathcal{M}_{+}(\Theta)$ with a smooth positive density, also denoted by $\rho$ , which represents our prior knowledge about the solution $\nu^{\star}$ . We introduce the quantity (analogous to a log-likelihood)

[TABLE]

It quantifies how good is $\rho$ as a prior for the unknown minimizer $\nu^{\star}$ and we will see that our convergence bounds are better when $\bar{\mathcal{H}}(\nu^{\star},\rho)$ is smaller. If nothing is known about the optimal positions $\theta_{i}$ , we should choose $\rho$ as a uniform density $\alpha\!\operatorname{vol}$ over $\Theta$ for some $\alpha>0$ . Minimizing $\bar{\mathcal{H}}(\nu^{\star},\alpha\operatorname{vol})$ in $\alpha$ suggests to choose $\alpha=\frac{\nu^{\star}(\Theta)}{\operatorname{vol}(\Theta)}$ .

To obtain an implementable algorithm, we then discretize $\rho$ and consider an initialization $\nu_{0}\in\mathcal{M}_{+}(\Theta)$ which is close to $\rho$ in the $W_{\infty}$ distance (our statements do not require $\nu_{0}$ to be discrete but this is necessary to obtain an implementable algorithm). We now state our main theorem.

Theorem 4.1 (Global convergence of gradient flow).

Under (A1-5), let $J_{0}$ and $\kappa_{0}$ given by Theorem 3.3, let $\rho=\rho\!\operatorname{vol}\in\mathcal{M}_{+}(\Theta)$ an absolutely continuous reference measure with $\log\rho$ $L$ -Lipschitz and let $B_{\nu_{0}}=\sup_{J(\nu)\leq J(\nu_{0})}\|J^{\prime}_{\nu}\|_{\mathrm{BL}}$ , where $\nu_{0}\in\mathcal{M}_{+}(\Theta)$ is the initialization. For any $0<\epsilon\leq 1/2$ , there exists $C_{\epsilon}>0$ that only depend on $\epsilon$ and bounds on the curvature of $\Theta$ such that if it holds $\beta/\alpha\leq(4B_{\nu_{0}}/\max\{1,L\})^{2}$ ,

[TABLE]

then the projected gradient flow $(\nu_{t})_{t\geq 0}$ initialized with $\nu_{0}$ converges to the global minimizer $\nu^{\star}$ . Denoting $t_{0}=1/\sqrt{\alpha\beta}$ it satisfies, for $t\geq t_{0}$ ,

[TABLE]

We also state a similar result for gradient descent, but without tracking the constants. The proof follows the same lines as that of Theorem 4.1 and is given in Appendix F.

Theorem 4.2 (Global convergence of gradient descent).

Under (A1-5), let $J_{0}$ and $\kappa_{0}$ be given by Theorem 3.3 and $\rho=\rho\!\operatorname{vol}\in\mathcal{M}_{+}(\Theta)$ an absolutely continuous reference measure with $\log\rho$ Lipschitz. For any $0<\epsilon\leq 1/2$ and $\nu_{0}\in\mathcal{M}_{+}(\Theta)$ , there exists $C,C^{\prime}>0$ that depends on the characteristics of the problem and increasingly on $\bar{\mathcal{H}}(\nu^{\star},\nu_{0})$ and $1/\epsilon$ , such that if

[TABLE]

then the projected gradient descent $(\nu_{k})_{k\in\mathbb{N}}$ initialized with $\nu_{0}$ converges to the global optimum $\nu^{\star}$ . Denoting $k_{0}=C/(J_{0}-J^{*})^{2+\epsilon}$ it satisfies, for $k\geq k_{0}$ ,

[TABLE]

We can make the following comments:

–

The non-asymptotic convergence rate does not appear explicitly in Theorem 4.1, because the result is obtained by trading-off various error terms. In an the idealized setting where $\nu_{0}=\rho$ and $\beta=0$ , a direct consequence of Lemma 4.3 and Lemma E.1 is that $J(\nu_{t})-J^{\star}$ decreases as $O(\log(t)/t)$ for the gradient flow and in $O(\log(k)/\sqrt{k})$ for the gradient descent in general. For the specific case of the mirror retraction, we show in Appendix G that a faster rate in $O(\log(k)/k)$ holds.

–

The condition on the initialization can be achieved by taking $\nu_{0}$ a weighted empirical distribution of $m$ samples from $\rho$ (typically the normalized volume measure), and it is known that the rate of convergence in $W_{\infty}$ of such approximation is in $\tilde{O}(m^{-1/d})$ , see [57]. Unfortunately, this exponential dependence in the dimension is unavoidable when approximating densities in Wasserstein distances [59]. This corresponds to a quantitative version of the condition on the initialization in Theorem 2.2. Also, note that $J_{0}-J^{\star}$ gets smaller as the problem becomes more difficult, in which case the overparameterization $m$ must increase, and the convergence speed slows down. In particular, the necessary condition $m\geq m^{\star}$ is implicitly implied by our assumptions.

–

The fact that the sublevel $J_{0}$ from Theorem 3.3 does not depend on the metric parameters $(\alpha,\beta)$ is crucial to prove these theorems. However, the local exponential rate of convergence in Theorem 4.2 may be deceptively bad if $\beta/\alpha$ is extremely small. An natural fix is to start with a small ratio $\beta/\alpha$ as required by Theorem 4.2, and to increase this ratio at each iteration so as to improve the conditioning of $J$ near optimality. The interest of Theorem 4.2 lies mostly in the qualitative insights it brings. In practice, we would advise to choose $W_{\infty}(\nu_{0},\rho)$ , $\alpha$ and $\beta$ via heuristics or parameter search rather than trying to derive the constants of Theorem 4.2, which could be deceptively conservative.

4.2 Proof of global convergence for gradient flows

The proof of Theorem 4.1 mostly relies on the the following general lemma which applies to any type of initialization or any structure of minimizers. It gives an upper bound on the optimality gap during along gradient flows in terms of a mirror rate function $\mathcal{Qq}_{\nu_{0},\nu^{\star}}:\mathbb{R}^{*}_{+}\to\mathbb{R}_{+}$ defined for $\nu^{\star},\nu_{0}\in\mathcal{M}_{+}(\Theta)$ and $\tau>0$ as

[TABLE]

This is a continuous and decreasing function of $\tau$ that satisfies

[TABLE]

which is [math] if and only if $\operatorname{spt}(\nu^{\star})\subset\operatorname{spt}(\nu_{0})$ . When $\beta=0$ , this function directly controls the rate of convergence of this mirror descent dynamics hence the name mirror rate function.

Lemma 4.3.

Assume ${\sf(A1-3)}$ and that $J$ admits a minimizer $\nu^{\star}\in\mathcal{M}_{+}(\Theta)$ . Then for all $\nu_{0}\in\mathcal{M}_{+}(\Theta)$ , denoting $B_{\nu_{0}}\coloneqq\sup_{J(\nu)\leq J(\nu_{0})}\|J^{\prime}_{\nu}\|_{\mathrm{BL}}$ , it holds for $t\geq 0$ ,

[TABLE]

A direct consequence of this lemma is that $\lim_{t\to\infty}J(\nu_{t})-J^{\star}$ is guaranteed to be small as $\beta$ gets smaller and as $\operatorname{spt}\nu_{0}$ gets closer to $\operatorname{spt}\nu^{\star}$ . In Appendix E we give an upper bound on $\mathcal{Qq}$ for the situation of interest here, leading to explicit convergence rates when combined with Lemma 4.3.

Proof.

Let $\nu^{\epsilon}_{0}\in\mathcal{M}_{+}(\Theta)$ be a measure to be specified later that satisfies $\mathcal{H}(\nu^{\epsilon}_{0},\nu_{0})<+\infty$ , and let $\nu^{\epsilon}_{t}$ satisfy $\partial_{t}\nu^{\epsilon}_{t}=\mathrm{div}(\beta\nu^{\epsilon}_{t}\nabla J^{\prime}_{\nu_{t}})$ for $t\geq 0$ weakly (this is a continuity equation with a smooth velocity field which admits a unique weak solution). Differentiating the relative entropy with respect to its second argument and using the invariance of the relative entropy under diffeomorphisms, it holds, for $t\geq 0$ ,

[TABLE]

where the first term comes from the convexity of $J$ and the second from the definition of $\|\cdot\|_{\mathrm{BL}}^{*}$ . After integrating in time and rearranging the terms we get

[TABLE]

For the last integral term, we use the triangular inequality

[TABLE]

where the last term is obtained by bounding the integrated flow of the velocity field $(\nabla J^{\prime}_{\nu_{t}})_{t\geq 0}$ . Since $\mathcal{H}(\nu_{\epsilon},\nu_{t})\geq 0$ and $J(\nu_{s})$ is decreasing, it follows

[TABLE]

Proof of Theorem 4.1 (gradient flow).

By Lemma E.1, we have for $\tau\geq L=\mathrm{Lip}(\log\rho)$ , by writing $\bar{\mathcal{H}}\coloneqq\bar{\mathcal{H}}(\nu^{\star},\rho\operatorname{vol})$ ,

[TABLE]

Combining this bound with Lemma 4.3, we get that for $t\geq L/(4\alpha B_{\nu_{0}})$ ,

[TABLE]

In particular, for $t=(\alpha\beta)^{-\frac{1}{2}}$ , we get

[TABLE]

Since this is valid only when $t\geq L/(4\alpha B_{\nu_{0}})$ , we require $(\alpha\beta)^{-\frac{1}{2}}\geq L/(4\alpha B_{\nu_{0}})$ which leads to the first condition on $\beta/\alpha$ . Now, we want the right-hand side of (16) to be smaller than $\Delta_{0}\coloneqq J_{0}-J^{\star}$ so that we can conclude with Corollary 3.4. To this end, we require, on the one hand $W_{\infty}(\nu_{0},\rho\operatorname{vol})\leq\Delta_{0}/(2B_{\nu_{0}}\nu^{\star}(\Theta))$ . On the other hand, we use the bound $\log(u)\leq C_{\epsilon}u^{\epsilon}$ for $\epsilon\in{]0,1/2]}$ , require $4B_{\nu_{0}}\sqrt{\alpha/\beta}\geq 1$ and obtain the condition

[TABLE]

This leads to the second condition on $\beta/\alpha$ is the theorem. ∎

4.3 Fully non-convex gradient descent

The results in the previous section require to set $\beta/\alpha$ at a small initial value. This might appear undesirable because the asymptotic convergence result of Theorem 2.2 holds irrespective of the choice of $\beta/\alpha$ . Also, in practice, this condition does not seem required, at least in the examples that we have considered (see Section 5). While the proof technique from Section 4.2 fails without controlling $\beta/\alpha$ , the question of wether it is possible to obtain convergence rates for any ratio $\beta/\alpha$ is a natural one.

For such a result, the key challenge is to obtain a convergence rate for the gradient flow dynamics (9) when initialized with a positive density, without conditions on $(\alpha,\beta)$ . While we were not able to prove such a result, in order to point out at the theoretical difficulty, we show in Appendix H with a proof technique inspired by [60], that a convergence rate in objective value in $O(1/\sqrt{\eta t})$ holds as long as the density $\nu_{t}$ is lower bounded by some $\eta>0$ (at least on a certain subset of $\Theta$ ).

Proposition 4.4.

Under (A1-3), for any $J_{\max}\geq J^{\star}$ , there exists $C>0$ such that for any $\eta,t>0$ and $\nu_{0}\in\mathcal{M}_{+}(\Theta)$ satisfying $J(\nu_{0})\leq J_{\max}$ , if the projected gradient flow (9) satisfies for $0\leq s\leq t$ ,

[TABLE]

where $S_{t}=\{\theta\in\Theta\;;\;J^{\prime}_{\nu_{s}}(\theta)\leq 0\text{ for some }s\in{[0,t]}\}$ , then $J(\nu_{t})-J^{\star}\leq\frac{C}{\sqrt{\alpha\eta t}}.$

Unfortunately, this result is not sufficient to obtain a convergence rate because the lower bound on the density may decrease too fast. When this happens, the gradient flow may stagnate an a priori unbounded time in neighborhoods of saddle points, although it is guaranteed to eventually escape by Lemma C.1. Note that the result above does not requires $\mathcal{F}$ to be finite dimensional nor $\lambda=0$ while this would be needed for a proof based on the positive definiteness of the tangent kernel [36].

5 Numerical experiments

All experiments can be reproduced with the Julia code available online444https://github.com/lchizat/2019-sparse-optim-measures. Our goal here is not to demonstrate the superiority of Algorithm 1 over other algorithms, but rather to illustrate the insights obtained by the analysis. We consider the following problems introduced in Section 1.1 :

–

(Sparse deconvolution) We consider the Dirichlet low-pass filter of order $n_{f}\in\mathbb{N}_{*}$ on the $d$ -torus with values in $L^{2}(\mathbb{T}^{d})$ i.e. $\phi(\theta):x\mapsto\sum_{k=-n_{f}}^{n_{f}}\exp(k\sqrt{-1}(x-\theta))$ when $d=1$ . We use the square-loss and solve problem (1) with conic particle gradient descent (Algorithm 1) with the “mirror retraction” from Section 2.3.

–

(Two-layer neural net) We consider the function $\phi(w):x\mapsto\max\left\{\sum_{j=1}^{d}x_{j}w_{j}\cdot|w_{j}|,0\right\}$ which is $2$ -homogeneous on $\mathbb{R}^{d+1}$ with $d+1=20$ . We use the square loss and solve problem (1) with stochastic gradient descent with a small fixed step-size for an input data distribution uniform on the sphere $\SS^{d}$ . This corresponds to a stochastic version of Algorithm 1 with the “induced retraction” from Section 2.3. For our purposes, the advantage of this architecture over classical ReLU neural networks (as presented in Section 1.1) is that here $\phi$ is differentiable on $\SS^{d}$ (see, e.g. [17, Lem. D.5]).

We focus in both cases on the “teacher-student” setting without noise with the square loss, because it guarantees that even the unregularized problem ( $\lambda=0$ ) has sparse solutions, in spite of $\mathcal{F}$ being infinite dimensional. We thus have $R(f)=\frac{1}{2}\|f-f^{\star}\|^{2}_{\mathcal{F}}$ where $f^{\star}=\sum_{i=1}^{m_{0}}r_{i}^{2}\phi(\theta_{i})$ and $m_{0}\in\mathbb{N}^{*}$ is the number of atoms for the teacher.

Local convergence rate.

We observe on Figure 3 the effect of the regularization parameter $\lambda$ and of the over-parameterization parameter $m$ on the local convergence rates (in $\widehat{W}_{2}$ distance – approximated by mapping each particle to its final position/mass – or in optimality gap). In accordance with the expansion of Proposition 3.7, we observe exponential convergence whenever $\lambda>0$ , with a rate that improves as $\lambda$ increases. For sparse deconvolution, we observe fast exponential convergence when $m=m_{0}=3$ which is explained by only the first term in the local expansion (13) being non-zero. By adding just a single particle, the second term comes into play and the behavior is qualitatively similar than with $20$ particles. For Figure 3-(c), the initialization is random and $m_{0}=5$ . Here the behavior for $m=m_{0}$ follows that of $m>m_{0}$ which suggests that the first term in the local expansion of Eq. (13) dominates.

Global convergence.

We observe on Figure 4 the effect on the success/failure of optimization of the two main parameters that appear in Theorem 4.1: the over-parameterization parameter $m$ (used to decrease the $W_{\infty}$ criterion) and the ratio of the vertical/spatial step-sizes $\beta/\alpha$ . In both (a) and (b) we have $m_{0}=5$ and $\lambda>0$ , and the final loss is averaged over $5$ random experiments. Without surprise, minimizers cannot be reached when $m$ is too small. It is also observed that increasing $m$ increases the chances of success even when $m\geq m_{0}$ . In contrast, these experiments do not reveal a clear role for $\beta/\alpha$ , beyond a change in the convergence speed (see Section 4.3).

Comparison of vertical geometries.

Finally, we compare on Figure 5 the behavior of mirror descent against that of Euclidean descent (here integrated with ISTA algorithm [22]). This corresponds respectively to $h(r)=r^{2}$ and $h(r)=r$ in Eq. 2 and $\beta=0$ . We consider the problem of recovering a single spike ( $m_{0}=1$ ) for 1D and 2D sparse deconvolution, starting from the uniform measure on $\Theta$ densely sampled on a grid ( $m=100$ ). We report the behavior in early stages of optimization, before the effect of the discretization comes into play. We observe that mirror descent outperforms Euclidean descent and enjoys a convergence rate of order $\sim 1/k$ around iteration number $k=100$ . This is in accordance with the result of Appendix G, where we show a convergence rate for mirror descent with continuous densities in $O(\log(k)/k)$ , independent of the dimension. The difference in behavior is illustrated on Figure 5-(c) where we plot $\nu_{1000}$ (in the setting of panel (a)).

6 Conclusion

In this paper, we have studied particle gradient descent for sparse convex optimization on measures and obtained complexity guarantees under non-degeneracy assumptions. One central idea underlying our analysis is to directly study the iterates in Wasserstein space. We believe that this approach, at the crossroads between analysis and optimization, may lead to other insights for over-parameterized and non-convex gradient descent.

An avenue for future research is to study the unregularized case. This may require to exploit finer properties of the problem than mere smoothness and could improve our understanding of the implicit bias of over-parameterized gradient descent. Another important question is to find theoretical explanations for the favorable behavior observed in high dimensions for two layer neural networks optimization.

Acknowledgments

The author thanks Francis Bach for fruitful discussions related to this work and the anonymous referees for their thorough reading and suggestions.

Appendix A Dealing with signed measures

Let us show that problems over signed measures with total variation regularization are covered by problem (1), after a suitable reformulation. Consider a function $\tilde{\phi}:\tilde{\Theta}\to\mathcal{F}$ and the functional on signed measures $\tilde{J}:\mathcal{M}(\tilde{\Theta})\to\mathbb{R}$ defined as

[TABLE]

where $|\mu|(\tilde{\Theta})$ is the total variation of $\mu$ . This is a continuous version of the LASSO problem, known as BLASSO [23]. Define $\Theta$ as the disjoint union of two copies $\tilde{\Theta}_{+}$ and $\tilde{\Theta}_{-}$ of $\tilde{\Theta}$ and define the symmetrized function $\phi:\Theta\to\mathcal{F}$ as

[TABLE]

With this choice of $\phi$ , minimizing (17) or minimizing (1) are equivalent, in a sense made precise in Proposition A.1. This symmetrization procedure, also suggested in [17], is simple to implement in practice: in Algorithm 1, we fix at initialization the sign attributed to each particle — depending on whether it belongs to $\tilde{\Theta}_{+}$ or $\tilde{\Theta}_{-}$ — and do not change it throughout the iterations.

Proposition A.1.

The infima of (17) and (1) are the same and:

(i)

if $\tilde{\mu}$ is a minimizer of $\tilde{J}$ and $\tilde{\mu}=\tilde{\mu}_{+}-\tilde{\mu}_{-}$ is its Jordan decomposition, then the measure which restriction to $\tilde{\Theta}_{+}$ (resp. $\tilde{\Theta}_{-}$ ) coincides with $\tilde{\mu}_{+}$ (resp. $\mu_{-}$ ) is a minimizer of $J$ ; 2. (ii)

reciprocally, if $\mu$ is a minimizer of $J$ then $\mu_{+}-\mu_{-}$ where $\mu_{+}$ (resp. $\mu_{-}$ ) is the restriction of $\mu$ to $\tilde{\Theta}_{+}$ (resp. $\Theta_{-}$ ) is a minimizer of $\tilde{J}$ .

Proof.

We recall that for any decomposition of a signed measure as the difference of nonnegative measures $\tilde{\mu}=\tilde{\mu}_{+}-\tilde{\mu}_{-}$ , it holds $|\tilde{\mu}|(\Theta)\leq\tilde{\mu}_{+}(\Theta)+\tilde{\mu}_{-}(\Theta)$ , with equality if and only if $(\tilde{\mu}_{+},\tilde{\mu}_{-})$ is the Jordan decomposition of $\tilde{\mu}$ [21, Sec. 4.1]. It follows that starting from any $\tilde{\mu}\in\mathcal{M}(\tilde{\Theta})$ , the construction in (i) yields a measure $\mu\in\mathcal{M}_{+}(\Theta)$ satisfying $\tilde{J}(\tilde{\mu})=J(\mu)$ . Also, starting from any $\mu\in\mathcal{M}_{+}(\Theta)$ , the construction in (ii) yields a measure $\tilde{\mu}\in\mathcal{M}(\tilde{\Theta})$ satisfying $\tilde{J}(\tilde{\mu})\leq J(\mu)$ , with equality if and only if $(\mu_{+},\mu_{-})$ is a Jordan decomposition. The conclusion follows. ∎

Appendix B Generic non-convex minimization

In this section, we show that any smooth optimization problem on a manifold is equivalent to solving a problem of the form (1). This corresponds to the case of a scalar-valued $\phi$ .

Proposition B.1.

Let $\phi:\Theta\to\mathbb{R}$ be a smooth function with minimum $\phi^{\star}<0$ that admits a global minimizer, and let

[TABLE]

where $0<\lambda<-2\phi^{\star}$ . Then $\emptyset\neq\operatorname{spt}\nu^{\star}\subset\arg\min\phi$ so minimizers of $\phi$ can be built from $\nu^{\star}$ . Reciprocally, from a minimizer of $\phi$ , one can build a minimizer for (18).

Proof.

For a measure $\nu\in\mathcal{M}_{+}(\Theta)$ , we define $f_{\nu}\coloneqq\int_{\Theta}\phi(\theta)\mathrm{d}\nu(\theta)\in\mathbb{R}$ . It holds

[TABLE]

Now suppose that $\nu$ is a global minimizer of $J$ . Then the optimality condition in Proposition 3.1 implies that

[TABLE]

Solving for $f_{\nu}$ is possible if $\lambda\nu(\Theta)<1$ and leads to $f_{\nu}=\sqrt{1-\lambda\nu(\Theta)}-1$ . We also deduce from the fact that $f_{\nu}>-1$ that $\arg\min J^{\prime}_{\nu}=\arg\min\phi$ , and so $\operatorname{spt}\nu\subset\arg\min\phi$ . It remains to find under which condition $\nu(\Theta)>0$ . We use the fact that $f_{\nu}=\phi^{\star}\nu(\Theta)$ in Equation (19), and get

[TABLE]

which in particular satisfies $\lambda\nu(\Theta)<1$ . Thus, as long as $-2\phi^{\star}>\lambda$ , we have $\nu(\Theta)>0$ . Finally, we verify that global minimizers exist, so that the above reasoning makes sense. If $-2\phi^{\star}-\lambda\leq 0$ , then $\nu=0$ satisfies the global optimality conditions. Otherwise, choose $\theta^{\star}$ a minimizer for $\phi^{\star}$ and define $\nu=\nu(\Theta)\delta_{\theta^{\star}}$ with the value above for $\nu(\Theta)$ , which also satisfies the global optimality conditions. ∎

Appendix C Wasserstein gradient flow

In this section, we recall and adapt some results and proofs from [17], for the sake of completeness.

C.1 Existence

For this result, we assume (A1-2). For a compactly supported initial condition $\mu_{0}\in\mathcal{P}_{2}(\Omega)$ , the proof of existence for Wasserstein gradient flows (Eq. (7)) in [17] goes through, as it is simply based on a compactness arguments which can be directly translated to this Riemannian setting (more precisely, we apply here Arzelà-Ascoli compactness criterion for curves in the Wasserstein space on the cone of $\Theta$ , which is a complete metric space [42]). Note that these arguments do not require convexity of $R$ , but in order to guarantee global existence in time, we need to assume that $\nabla R$ is bounded in sub-level sets of $F$ .

For the existence of solutions for projected dynamics on $\Theta$ for any $\nu_{0}\in\mathcal{M}_{+}(\Theta)$ , consider a measure $\mu_{0}\in\mathcal{M}_{+}(\Omega)$ such that $\mathsf{h}\mu_{0}=\nu_{0}$ (see [42] for such a construction) and the corresponding Wasserstein gradient flow $(\mu_{t})_{t\geq 0}$ for $F$ . Then $\mathsf{h}\mu_{t}$ is a solution to (9).

For the existence of Wasserstein gradient flows (Eq. (7)) for $F$ when $\mu_{0}$ is not compactly supported, proceed as follows: there exists a Wasserstein-Fisher-Rao gradient flow $\nu_{t}$ satisfying $\nu_{0}=\mathsf{h}\mu_{0}$ . Now we can simply define $\mu_{t}$ as the solution to $\partial_{t}\mu_{t}=\mathrm{div}(\mu_{t}J^{\prime}_{\nu_{t}})$ . It can be directly checked that $\mathsf{h}\mu_{t}=\nu_{t}$ for $t\geq 0$ and thus $\mu_{t}$ is a solution to Eq. (7).

We do not attempt to show uniqueness in the present work. Note that it is proved in [17] for the case where $\Theta$ is a sphere, by applying the theory developed in [3].

C.2 Asymptotic global convergence

In this section, we give a short proof of Theorem 2.2, adapted from [17]. The next lemma is the crux of the global convergence proof. It gives a criterion to espace from the neighborhood of measures which are not minimizers.

Lemma C.1 (Criteria to espace local minima).

Under (A1-3), let $\nu\in\mathcal{M}_{+}(\Theta)$ be such that $v^{\star}\coloneqq\min_{\theta\in\Theta}J^{\prime}_{\nu}(\theta)<0$ . Then there exists $v\in[2v^{\star}/3,v^{\star}/3]$ and $\epsilon>0$ such that if $(\nu_{t})_{t\geq 0}$ is a projected gradient flow of $J$ satisfying $\|\nu-\nu_{t_{0}}\|_{\mathrm{BL}}^{*}<\epsilon$ for some $t_{0}\geq 0$ and $\nu_{t_{0}}((J^{\prime}_{\nu})^{-1}(]-\infty,v]))>0$ then there exists $t_{1}>t_{0}$ such that $\|\nu-\nu_{t_{1}}\|_{\mathrm{BL}}^{*}\geq\epsilon$ .

Proof.

We first assume that $J^{\prime}_{\nu}$ takes nonnegative values and let $v\in[2v^{\star}/3,v^{\star}/3]$ be a regular value of $g_{\nu}$ , i.e. be such that $\|\nabla J^{\prime}_{\nu}\|$ does not vanish on the $v$ level-set of $J^{\prime}_{\nu}$ . Such a $v$ is guaranteed to exist thanks to Morse-Sard’s lemma and our assumption that $\phi$ is $d$ -times continuously differentiable, which implies that $J^{\prime}_{\nu}$ is the same. Let $K_{v}=(J^{\prime}_{\nu})^{-1}(]-\infty,v])\subset\Theta$ be the corresponding sublevel set. By the regular value theorem, the boundary $\partial K_{v}$ of $K_{v}$ is a differentiable orientable compact submanifold of $\Theta$ and is orthogonal to $\nabla J^{\prime}_{\nu}$ . By construction, it holds for all $\theta\in K_{v}$ , $J^{\prime}_{\nu}(\theta)\leq v^{\star}/3$ and, for some $u>0$ , by the regular value property, $\nabla J^{\prime}_{\nu}(\theta)\cdot\vec{n}_{\theta}>u$ for all $\theta\in\partial K_{v}$ where $\vec{n}_{\theta}$ is the unit normal vector to $\partial K_{v}$ pointing outwards. Since the map $\nu\mapsto J^{\prime}_{\nu}$ is locally Lipschitz as a map $(\mathcal{M}_{+}(\Theta),\|\cdot\|_{\mathrm{BL}}^{*})\to(\mathcal{C}^{1}(\Theta),\|\cdot\|_{\mathrm{BL}})$ , there exists $\epsilon>0$ such that if $\nu_{t}\in\mathcal{M}_{+}(\Theta)$ satisfies $\|\nu_{t}-\nu\|_{\mathrm{BL}}^{*}<\epsilon$ , then

[TABLE]

Now let us consider a projected gradient flow $(\nu_{t})_{t\geq 0}$ such that $\|\nu_{0}-\nu\|_{\mathrm{BL}}^{*}<\epsilon$ and let $t_{1}>0$ be the first time such that $\|\nu_{t_{1}}-\nu\|_{\mathrm{BL}}^{*}\geq\epsilon$ , which might a priori be infinite. For $t\in{[t_{0},t_{1}[}$ , it holds

[TABLE]

where the first inequality can be seen by using the “characteristic” representation of solutions to (9), see [44]. It follows by Grönwall’s lemma that $\nu_{t}(K_{v})\geq\exp(\alpha v^{\star}t)\nu_{0}(K_{v})$ which implies that $t_{1}$ is finite. Finally, if we had not assumed that [math] is in the range of $J^{\prime}_{\nu}$ in the first place, then we could simply take $K=\Theta$ and conclude by similar arguments. ∎

Proof of Theorem 2.2.

Let $\nu_{\infty}\in\mathcal{M}(\Theta)$ be the weak limit of $(\nu_{t})_{t}$ . It satisfies the stationary point condition $\int|J^{\prime}_{\nu_{\infty}}|^{2}\mathrm{d}\nu_{\infty}=0$ . Then by the optimality conditions in Proposition 3.1, either $\nu_{\infty}$ is a minimizer of $J$ , or $J^{\prime}_{\nu_{\infty}}$ is not nonnegative. For the sake of contradiction, assume the latter. Let $\epsilon$ be given by Lemma C.1 and let $t_{0}=\sup\{t\geq 0\;;\;\|\nu_{t}-\nu_{\infty}\|_{\mathrm{BL}}^{*}\geq\epsilon\}$ which is finite since we have assumed that $\nu_{t}$ weakly converges to $\nu_{\infty}$ . But $\nu_{t_{0}}$ has full support since it can be written as the pushforward of a rescaled version of $\nu_{0}$ by a diffeomorphism, see [44, Eq. (1.3)] (note that this step is considerably simplified here by the fact that we do not have a potentially non-smooth regularizer, unlike in [17] where topological degree theory comes into play). Then the conclusion of Lemma C.1 contradicts the definition of $t_{0}$ . ∎

Appendix D Proof of the gradient inequality

In this whole section, we consider without loss of generality $\alpha=\beta=1$ (we explain in Section D.7 how to adapt the results to arbitrary $\alpha,\beta$ ). For simplicity, we only track the dependencies in $\nu$ and $\tau$ . Any quantity that is independent of $\nu$ and $\tau$ is treated as a constant and represented by $C,C^{\prime},C^{\prime\prime}>0$ , and the quantity these symbols refer to can change from line to line.

D.1 Bound on the transport distance to minimizers

Given a measure $\nu\in\mathcal{M}_{+}(\Theta)$ , we consider the local centered moments introduced in Definition 3.6 and in addition, for $i\in\{1,\dots,m^{\star}\}$ ,

[TABLE]

Finally, we will quantify errors with the following quantity

[TABLE]

which also controls the $\widehat{W}_{2}$ distance (introduced in Section 3.1) to the minimizer $\nu^{\star}$ of $J$ , as shown in the next proposition.

Lemma D.1.

It holds $\widehat{W}_{2}(\nu,\nu^{\star})\leq W_{\tau}(\nu)(1+O(\tau^{2})+O(W_{\tau}(\nu)^{2}))$ .

Proof.

Note that for $W_{\tau}(\nu)$ small enough, it holds $\nu(\Theta_{i})>0$ for $i\in\{1,\dots,m^{\star}\}$ . Let $\mu\in\mathcal{P}_{2}(\Omega)$ be such that $\mathsf{h}\mu=\nu$ and consider the transport map $T:\Omega\to\Omega$ defined as

[TABLE]

By construction, it holds $\mathsf{h}(T_{\#}\mu)=\nu^{\star}$ . Let us estimate the transport cost associated to this map

[TABLE]

The geodesic distance associated to the cone metric is

[TABLE]

Now, if we only consider points $\theta\in\Theta_{i}$ with $\tilde{\theta}$ their coordinates in a normal frame centered at $\theta_{i}$ (note that in all other proofs, we do not need to distinguish between $\theta$ and $\tilde{\theta}$ ), we have the approximation

[TABLE]

Let us decompose $T(r,\theta)$ as $(rT^{r}(\theta),T^{\theta}(\theta))$ and estimate the two contributions forming $\mathcal{T}$ separately. On the one hand, we have

[TABLE]

On the other hand, we have

[TABLE]

As a consequence, we have $\mathcal{T}=W_{\tau}(\nu)(1+O(W_{\tau}(\nu)^{2})+O(\tau^{2}))$ . Remark that this estimate does not depend on the chosen lifting $\mu$ satisfying $\mathsf{h}\mu=\nu$ . We then conclude by using the characterization in [42, Thm. 7.20] for the distance $\widehat{W}_{2}$ :

[TABLE]

Thus $\widehat{W}_{2}(\nu,\nu^{\star})^{2}\leq W_{2}(\mu,T_{\#}(\mu))^{2}\leq\mathcal{T}$ , and the result follows. ∎

D.2 Local expansion lemma

Lemma D.2 (Expansion around $\nu^{\star}$ ).

Let $\psi$ be any (vector or real-valued) smooth function on $\Theta$ and $\nu\in\mathcal{M}_{+}(\Theta)$ . If $\tau>0$ is an admissible radius, then the following first and second-order expansions hold

[TABLE]

where $M_{k,\psi}(\theta_{i},\theta)$ is the remainder in the $k-1$ -th order Taylor expansion of $\psi$ around $\theta_{i}$ in local coordinates (and we recall that $\bar{\nabla}\psi:=(2\psi,\nabla\psi)$ ).

Proof.

By a Taylor expansion of $\psi$ around $\theta_{i}$ for $i\in\{1,\dots,m^{\star}\}$ , it holds

[TABLE]

and substracting $\int_{\Theta_{i}}\psi\mathrm{d}\nu^{\star}=r_{i}^{2}\phi(\theta_{i})$ yields

[TABLE]

where we have used a bias-variance decomposition for the quadratic term. The result follows by summing the integrals over each $\Theta_{i}$ and using the expression of $b$ . ∎

D.3 Bound on the distance to minimizers

In the next lemma, we globally bound the quantity $W_{\tau}(\nu)$ from Eq. (20) in terms of the function values. It involves the quantity $v^{\star}>0$ which is such that for any local minimum $\theta$ of $J^{\prime}_{\nu^{\star}}$ , either $\theta=\theta_{i}$ for some $i\in\{1,\dots,m^{\star}\}$ or $J^{\prime}_{\nu^{\star}}(\theta)\geq v^{*}$ (which is non-zero under (A5)). We also recall that $\tilde{b}^{\theta}_{i}=\bar{r}_{i}\delta\theta_{i}$ , as defined in Section D.1.

Lemma D.3 (Global distance bound).

Under (A1-5), let $\tau_{\mathrm{adm}}$ be an admissible radius $\tau$ as in Definition 3.6, fix some $J_{\max}>0$ and let

[TABLE]

Then there exists $C,C^{\prime}>0$ such that for all $\tau\leq\tau_{0}$ and $\nu\in\mathcal{M}_{+}(\Theta)$ such that $J(\nu)\leq J_{\max}$ , it holds

[TABLE]

Proof.

Let us write $f_{\nu}\coloneqq\int\phi\mathrm{d}\nu$ and $f^{\star}=\int\phi\mathrm{d}\nu^{\star}$ . By strong convexity of $R$ at $f^{\star}$ , and optimality of $\mu^{\star}$ , there exists $C>0$ such that for all $\nu\in\mathcal{M}_{+}(\Theta)$ it holds

[TABLE]

To prove the first claim, we thus have to bound $W_{\tau}(\nu)$ using the terms in the right-hand side of (21).

Step 1.

By a Taylor expansion, one has for $\theta\in\Theta_{i}$ for $i\in\{1,\dots,m^{\star}\}$ ,

[TABLE]

Thus, if $\|\theta-\theta_{i}\|\leq 3\sigma_{\min}(H)/(2\mathrm{Lip}(\nabla^{2}J^{\prime}_{\nu^{\star}}))$ , then $J^{\prime}_{\nu^{\star}}(\theta)\geq\frac{1}{4}(\theta-\theta_{i})^{\intercal}H_{i}(\theta-\theta_{i})$ for $\theta\in\Theta_{i}$ . Decomposing the integral of this quadratic term into bias and variance, we get

[TABLE]

and we deduce a first bound by summing the terms for $i\in\{1,\dots,m^{\star}\}$ ,

[TABLE]

Step 2.

In order to lower bound the integral over $\Theta_{0}$ , we first derive a lower bound for $J^{\prime}_{\nu^{\star}}$ on $\Theta_{0}$ . This is a continuously differentiable and nonnegative function on a closed domain $\Theta_{0}$ so its minimum is attained either at a local minima in the interior of $\Theta_{0}$ or on its boundary. Using the quadratic lower bound from the previous paragraph, it follows that for $\theta\in\Theta_{0}$ ,

[TABLE]

Thus, if we also assume that $\tau\leq 2\sqrt{v^{\star}/\sigma_{\min}(H)}$ then $J^{\prime}_{\nu^{\star}}(\theta)\geq\tau^{2}\sigma_{\min}(H)/4$ for $\theta\in\Theta_{0}$ and it follows that

[TABLE]

Using inequality (21) we have shown so far that

[TABLE]

Notice that $\tilde{W}_{\tau}(\nu)$ is similar to $W_{\tau}(\nu)$ but it does not contain the terms controlling the deviations of mass $|\bar{r}_{i}-r_{i}|$ . These quantities can be controlled by using the coercivity of $R$ , i.e. the last term in (21), as we do now.

Step 3.

Using the first order expansion of Lemma D.2 then squaring gives

[TABLE]

Since we have assumed that $K$ is positive definite, it follows

[TABLE]

and thus, after rearranging the terms

[TABLE]

It follows that $\|b\|\leq C\|f_{\nu}-f^{\star}\|+C\tilde{W}_{\tau}(\nu)^{2}$ . Also, by inequality (21), if $J(\nu)\leq J_{\max}$ , then $\|f_{\nu}-f^{\star}\|^{2}\leq C(J(\nu)-J^{\star})$ . Moreover, by inequality (22), we get

[TABLE]

We finally combine with the bound on $\tilde{W}_{\tau}(\nu)$ to conclude since $W_{\tau}(\nu)^{2}\leq\tilde{W}_{\tau}(\nu)^{2}+\|b\|^{2}$ ∎

D.4 Proof of the distance inequality (Proposition 3.2)

By Lemma D.1, it holds

[TABLE]

Moreover, by Lemma D.3, there exists $\tau_{0}>0$ and $C>0$ such that

[TABLE]

Combining these two lemmas, it follows that for some $C^{\prime}>0$ , we have

[TABLE]

This also implies a control on the Bounded-Lipschitz distance since it holds $(\|\nu-\nu^{\star}\|_{\mathrm{BL}}^{*})^{2}\leq(2+\pi^{2}/2)(\nu(\Theta)+\nu^{\star}(\Theta))\widehat{W}_{2}(\nu,\nu^{\star})^{2}$ , see [42, Prop. 7.18].

D.5 Local estimate of the objective

We now prove a local expansion formula for $J$ .

Proposition D.4 (Local expansion).

It holds

[TABLE]

where $\mathop{\mathrm{err}}(\tau,\nu)=O(\tau(\|\tilde{b}^{\theta}\|^{2}+\|s\|^{2})+W_{\tau}(\nu)^{3})$ . In particular, if $\tau$ is fixed small enough,

[TABLE]

Proof.

Let us write $f_{\nu}\coloneqq\int\phi\mathrm{d}\nu$ and $f^{\star}=\int\phi\mathrm{d}\nu^{\star}$ . By a second order Taylor expansion of $R$ around $f^{\star}$ , we have

[TABLE]

Using the first order expansion of Lemma D.2 for $\phi$ , we get $\|f_{\nu}-f^{\star}\|^{2}_{\star}=b^{\intercal}Kb+O(W_{\tau}(\nu)^{3})$ . Also, using the second order expansion of Lemma D.2 for $J^{\prime}_{\nu^{\star}}$ and using the fact that $J^{\prime}_{\nu^{\star}}$ and its gradient vanish for all $\theta_{i}$ , we get

[TABLE]

and the expansion follows. Notice also that in the expression of $J(\nu)$ , $\bar{r}_{i}$ and $r_{i}$ are interchangeable up to introducing higher order error, since $|r_{i}-\bar{r}_{i}|=O(|b^{r}_{i}|)$ (and also $\|\tilde{b}^{\theta}\|=\|b^{\theta}\|(1+O(W_{\tau}(\nu)))$ ). ∎

D.6 Local estimate of the gradient norm

Proposition D.5 (Gradient estimate).

For $\nu\in\mathcal{P}_{2}(\Omega)$ , it holds

[TABLE]

where $\mathop{\mathrm{err}}(\tau,\nu)\lesssim\tau(\|\tilde{b}^{\theta}\|^{2}+\|s\|^{2})+W_{\tau}(\nu)^{3}$ . In particular, if $\tau$ is fixed small enough

[TABLE]

Proof.

For this proof, we write $f_{\nu}-f^{\star}=\delta f_{0}+\delta f_{b}+\delta f_{\mathrm{err}}$ where

[TABLE]

where the decomposition follows from Lemma D.2. The expression for the norm of the gradient is as follows:

[TABLE]

where $\bar{\nabla}J=(2J,\nabla J)$ . We start with the following decomposition for $\theta\in\Theta_{i}$ (recall that $J^{\prime}_{\nu}(\theta)=\langle\phi(\theta),\nabla R(\int\phi\mathrm{d}\nu)\rangle+\lambda$ ):

[TABLE]

Here we use the notation $\langle\cdot,\cdot\rangle_{\star}$ to denote the quadratic form associated to $\nabla^{2}R(f^{\star})$ . Thanks to the optimality conditions $\bar{\nabla}J^{\prime}_{\nu^{\star}}(\theta_{i})=0$ for $i\in\{1,\dots,m\}$ , we get

[TABLE]

where $N$ collects the higher order terms and is defined as

[TABLE]

where $\|\bar{\nabla}_{j}M_{\phi,3}(\theta_{i},\theta)\|=O(\|\theta-\theta_{i}\|^{2})$ if $j>0$ and $O(\|\theta-\theta_{i}\|^{3})$ if $j=0$ . Expanding the square gives the following ten terms:

[TABLE]

Terms (I) to (II) are the main terms in the expansion, while the other terms are higher order. The term (I) is a local curvature term and can be expressed as $\mathrm{(I)}=\sum_{i=1}^{m}\bar{r}_{i}^{2}\operatorname{tr}\Sigma_{i}H_{i}^{2}$ . The term (II) is a global interaction term that writes

[TABLE]

where the entries of $\bar{K}$ and $\bar{H}$ differ from those of $K$ and $H$ by a factor $\bar{r}_{i}/r_{i}$ . More precisely,

[TABLE]

and similarly for $\bar{H}-H$ . Since $|\bar{r}_{i}/r_{i}-1|=O(|b^{r}_{i}|)$ we have $\sigma_{\max}(\bar{K}-K)=O(W_{\tau}(\nu))$ . It follows, by expanding the square, that

[TABLE]

The remaining terms are error terms, that we estimate directly in terms of $W_{\tau}(\nu)$ and $\tau$ . We use in particular the fact that by Hölder’s inequality, $\int_{\Theta_{i}}\|\theta-\bar{\theta}_{i}\|\mathrm{d}\nu(\theta)=O(\bar{r}_{i}^{2}\operatorname{tr}\Sigma_{i}^{\frac{1}{2}})$ . One has

•

$\mathrm{(III)}=O\left(\sum_{i=1}^{m}\bar{r}_{i}^{2}\bar{r}_{0}^{4}\right)=O(W_{\tau}^{4}(\nu))$ ;

•

$\mathrm{(IV)}=\mathrm{(V)}=0$ because the integral of the terms $H_{i}(\theta-\bar{\theta}_{i})$ vanishes;

•

$\mathrm{(VI)}=O\left((\sum_{i=1}^{m}\bar{r}_{i}^{2}(\|b\|+\|\delta\theta_{i}\|)\cdot\bar{r}_{0}^{2}\right)=O(W_{\tau}^{3}(\nu))$ ;

•

$\mathrm{(VII)}=O(\tau(\|\tilde{b}^{\theta}\|^{2}+\|s\|^{2}))+O(W_{\tau}(\nu)^{3})$ ;

•

$\mathrm{(VIII)}=O(W_{\tau}^{3}(\nu))$ ;

•

$\mathrm{(IX)}=O(W_{\tau}^{4}(\nu))$ ;

•

$\mathrm{(X)}=O(\tau^{2}(\|\tilde{b}^{\theta}\|^{2}+\|s\|^{2}))+O(W_{\tau}(\nu)^{4})$ .

It follows that overall, the error term is in $O(\tau(\|\tilde{b}^{\theta}\|^{2}+\|s\|^{2})+W_{\tau}(\nu)^{3})$ . There remains to lower bound the norm of the gradient over $\Theta_{0}$ , which can be done as follows. As seen in the proof of Lemma D.3, if $\tau$ is small enough then $J^{\prime}_{\nu^{\star}}(\theta)\geq\tau^{2}\sigma_{\min}(H)/4$ for $\theta\in\Theta_{0}$ . Considering only the first component of the gradient, it holds

[TABLE]

Using the expansion $J^{\prime}_{\nu}(\theta)=J^{\prime}_{\nu^{\star}}(\theta)+\langle\phi(\theta),M_{\nabla R,1}(f^{\star},f_{\nu})\rangle$ , we get

[TABLE]

The result follows by collecting all the estimates above. ∎

D.7 Proof of the sharpness inequality (Theorem 3.3)

By Proposition D.4 we have that for $\tau>0$ small enough

[TABLE]

where $C=\sigma_{\max}(K+H)+\|J^{\prime}_{\nu^{\star}}\|_{\infty}$ .

Similarly, by Proposition D.5, for $\tau$ small enough, it holds

[TABLE]

where $C^{\prime}=\frac{1}{8}\sigma_{\min}(H)^{2}\tau^{4}$ . Now fix $\tau>0$ satisfying the hypothesis of Lemma D.3 and the two previous inequalities. By Lemma D.3, $W_{\tau}(\nu)=O((J(\nu)-J^{\star})^{\frac{1}{2}})$ . We deduce that there exists $J_{0}>J^{\star}$ and $\kappa_{0}>0$ , such that whenever $\nu\in\mathcal{M}_{+}(\Theta)$ satisfies $J(\nu)<J_{0}$ , one has

[TABLE]

Finally, notice that if different metric factors $(\alpha,\beta)\neq(1,1)$ are introduced, one can always lower bound the new gradient squared norm as

[TABLE]

which proves the statement for any $(\alpha,\beta)$ . Note however that if one wants to make a more quantitative bound, then there are values $(\alpha_{0},\beta_{0})$ that would lead to a better conditioning and potentially higher values for $J_{0}$ . In this case, the factor appearing in the sharpness inequality should rather be $\min\{\alpha/\alpha_{0},\beta/\beta_{0}\}$ .

Appendix E Estimation of the mirror rate function

We provide an upper bound for the mirror rate function $\mathcal{Qq}$ in the situation that is of interest to us, with $\nu^{\star}$ sparse. Note that this approach could be generalized to arbitrary $\nu^{\star}$ .

Lemma E.1.

Under (A1), there exists $C_{\Theta}>0$ that only depends on the curvature of $\Theta$ , such that for all $\nu^{\star},\nu_{0}\in\mathcal{M}_{+}(\Theta)$ where $\nu^{\star}=\sum_{i=1}^{m^{\star}}r_{i}^{2}\delta_{\theta_{i}}$ and $\nu_{0}=\rho\operatorname{vol}$ where $\log\rho$ is $L$ -Lipschitz, then

[TABLE]

Moreover, for any other $\hat{\nu}_{0}\in\mathcal{M}_{+}(\Theta)$ , it holds $\mathcal{Qq}_{\nu^{\star},\hat{\nu}_{0}}(\tau)\leq\mathcal{Qq}_{\nu^{\star},\nu_{0}}(\tau)+\nu^{\star}(\Theta)\cdot W_{\infty}(\nu_{0},\hat{\nu}_{0}).$

In the context of Lemma E.1, we introduce the quantity,

[TABLE]

which measures how much $\rho$ is a good prior for the (a priori unknown) minimizer $\nu^{\star}$ . With this quantity, the conclusion of Lemma E.1 reads, for $\tau\geq L$ ,

[TABLE]

Proof.

Let us build $\nu_{\epsilon}$ in such a way that the quantity defining $\mathcal{Qq}_{\nu^{\star},\nu_{0}}(\tau)$ in Eq. (15) is small. For this, consider a radius $\epsilon>0$ and consider the measure $\nu_{\epsilon}$ defined as the normalized volume measure on each geodesic ball of radius $\tau$ around each $\theta_{i}$ , with mass $r_{i}^{2}$ on this ball, and vanishing everywhere else. Using the transport map that maps these balls to their centers $\theta_{i}$ , we get if $\Theta$ is flat,

[TABLE]

where $V^{(d)}(\epsilon)$ is the volume of a ball of radius $\epsilon$ in $\mathbb{R}^{d}$ , that scales as $\epsilon^{d}$ . Using an integration by parts, it follows

[TABLE]

thus $W_{1}(\nu_{\epsilon},\nu^{\star})\leq\nu^{\star}(\Theta)\epsilon$ . In the general case where $\Theta$ is a potentially curved manifold, this upper bound also depends on the curvature of $\Theta$ around each $\theta_{i}$ , a dependency that we hide in the multiplicative constant so $W_{1}(\nu_{\epsilon},\nu^{\star})\leq C\nu^{\star}(\Theta)\epsilon$ . Let us now control the entropy term. Writing $\rho_{\epsilon}=\mathrm{d}\nu_{\epsilon}/\mathrm{d}{\operatorname{vol}}$ and $\Theta_{i}$ for the geodesic ball of radius $\epsilon$ around $\theta_{i}$ , it holds

[TABLE]

The integral term can be estimated as follows,

[TABLE]

Recalling that $-\log V^{(d)}(\epsilon)\leq-d\log(\epsilon)+C$ for some $C$ that only depends on the curvature of $\Theta$ , we get that the right-hand side of (15) is bounded by

[TABLE]

Let us fix $\epsilon>0$ by minimizing $C\nu^{\star}(\Theta)\epsilon-\nu^{\star}(\Theta)d\log(\epsilon)/\tau$ , which gives $\epsilon=d/(C\tau)$ . The first claim follows by plugging this value for $\epsilon$ in the expression above.

For the second claim of the statement, let us build a suitable candidate $\hat{\nu}_{\epsilon}$ in order to upper bound the infimum that defines $\mathcal{Qq}_{\nu^{\star},\hat{\nu}_{0}}(\tau)$ . Let $T$ be an optimal transport map from $\nu_{0}$ to $\hat{\nu}_{0}$ for $W_{\infty}$ , i.e. a measurable map $T:\Theta\to\Theta$ satisfying $T_{\#}\nu_{0}=\hat{\nu}_{0}$ and $\max\{\operatorname{dist}(\theta,T(\theta))\;;\;\theta\in\operatorname{spt}\nu_{0}(=\Theta)\}=W_{\infty}(\nu_{0},\hat{\nu}_{0})$ (see [53, Sec. 3.2], the absolute continuity of $\nu_{0}$ is sufficient for such a map to exist). Now we define $\hat{\nu}_{\epsilon}=T_{\#}\nu_{\epsilon}$ where $\nu_{\epsilon}$ is such that $\mathcal{H}(\nu_{\epsilon},\nu_{0})<\infty$ . Since the relative entropy is non-increasing under pushforwards, it holds $\mathcal{H}(\hat{\nu}_{\epsilon},\hat{\nu}_{0})\leq\mathcal{H}(\nu_{\epsilon},\nu_{0})$ . Moreover, it holds $\|\nu_{\epsilon}-\hat{\nu}_{\epsilon}\|_{\mathrm{BL}}^{*}\leq W_{1}(\nu_{\epsilon},\hat{\nu}_{\epsilon})\leq\nu^{\star}(\Theta)W_{\infty}(\nu_{\epsilon},\hat{\nu}_{\epsilon})$ . Thus we have

[TABLE]

The claim follows by noticing that, by construction, $W_{\infty}(\nu_{\epsilon},\hat{\nu}_{\epsilon})\leq W_{\infty}(\nu_{0},\hat{\nu}_{0})$ and then by taking the infimum in $\nu_{\epsilon}$ . ∎

Appendix F Global convergence for gradient descent

In the following, result, we study the non-convex gradient descent updates $\mu_{k+1}=(T_{k})_{\#}\mu_{k}$ and $\nu_{k}=\mathsf{h}\mu_{k}$ where

[TABLE]

with step-sizes $\alpha,\beta>0$ . When $\beta=0$ , we recover mirror descent updates in $\mathcal{M}_{+}(\Theta)$ with the entropy mirror map (more specifically, this is true when $\mathrm{Ret}$ is the “mirror” retraction defined in Section 2.3).

Lemma F.1.

Assume ${\sf(A1-3)}$ and that $J$ admits a minimizer $\nu^{\star}\in\mathcal{M}_{+}(\Theta)$ . Then there exists $C,\eta_{\max}>0$ such that for all $\nu_{0}\in\mathcal{M}_{+}(\Theta)$ , denoting $B=\sup_{J(\nu)\leq J(\nu_{0})}\|J^{\prime}_{\nu}\|_{\mathrm{BL}}$ , if $\max\{\alpha,\beta\}<\beta_{\max}$ , it holds

[TABLE]

Proof.

As in the proof of Lemma 2.5, we define $(T_{k}^{r}(\theta),T_{k}^{\theta}(\theta))\coloneqq T_{k}(1,\theta)$ and we define recursively $\nu^{\epsilon}_{k+1}=(T^{\theta}_{k})_{\#}\nu^{\epsilon}_{k}$ where $\nu^{\epsilon}_{0}$ is such that $\mathcal{H}(\nu^{\epsilon}_{0},\nu_{0})<\infty$ . Using the invariance of the relative entropy under diffeomorphisms (indeed, $T^{\theta}_{k}$ is a diffeomorphism of $\Theta$ for $\beta$ small enough), and doing a first order expansion of $T^{r}_{k}=1-2\alpha J^{\prime}_{\nu_{k}}+O(\alpha^{2})$ it holds for $\beta$ small enough

[TABLE]

where the term in $O(\alpha)$ originates from a first order approximation of the retraction. Now, taking $\max\{\alpha,\beta\}$ small enough to ensure decrease of $(J(\nu_{k}))_{k}$ (by Lemma 2.5) so that $C$ above can be chosen independently of $k$ , it follows

[TABLE]

by bounding each term $\|\nu_{k^{\prime}}^{\epsilon}-\nu_{0}^{\epsilon}\|$ by $B\beta k^{\prime}$ . ∎

Proof of Theorem 4.2 (gradient descent).

The proof follows closely that of Theorem 4.1 but we do not track the “constants” (this would be more tedious). By Lemma E.1, there exists $C>0$ (that depends on $\bar{\mathcal{H}}$ , the curvature of $\Theta$ and $\nu^{\star}(\Theta)$ ) such that $\mathcal{Qq}_{\nu^{\star},\hat{\nu}_{0}}(\tau)\leq C(\log\tau)/\tau+\nu^{\star}(\Theta)W_{\infty}(\nu_{0},\hat{\nu}_{0})$ . Combining this with Lemma F.1, we get that when $\max\{\alpha,\beta\}\leq\eta_{\max}$ ,

[TABLE]

Our goal is to choose $k_{0},\alpha,\beta$ and $W_{\infty}(\nu_{0},\hat{\nu}_{0})$ so that this is quantity smaller than $\Delta_{0}\coloneqq J_{0}-J^{\star}$ . With $\alpha=1/\sqrt{k}$ and $\beta=\beta_{0}/k$ we get

[TABLE]

Then, using a bound $\log(u)\leq C_{\epsilon}u^{\epsilon}$ , we may choose $k\gtrsim\Delta_{0}^{-2-\epsilon}$ , $\beta_{0}\leq\frac{1}{3}\Delta_{0}/B^{2}$ and $W_{\infty}(\nu_{0},\hat{\nu}_{0})\leq\frac{1}{3}\Delta_{0}/(B\nu^{\star}(\Theta))$ in order to have $J(\nu_{k})-J^{\star}\leq\Delta_{0}$ . This gives $\alpha\lesssim\Delta_{0}^{1+\epsilon/2}$ , $\beta\lesssim\Delta_{0}^{3+\epsilon}$ and the regime of exponential convergence kicks off after $k=\Delta_{0}^{-2-\epsilon}$ iterations. ∎

Appendix G Faster rate for mirror descent

In this section, we show that for a specific choice of retraction, the convergence rate of $O(\log(t)/t)$ for the gradient flow is preserved for the gradient descent.

Proposition G.1 (Mirror flow, fast rate).

Assume (A1-4) and consider the infinite dimensional mirror descent update

[TABLE]

which corresponds to the so-called mirror retraction in Section 2.3 and $\beta=0$ . For any $\nu_{0}\in\mathcal{M}_{+}(\Theta)$ , there exists $\alpha_{\max}>0$ such that for $\alpha\leq\alpha_{\max}$ it holds, denoting $B_{\nu_{0}}=\sup_{J(\nu)\leq J(\nu_{0})}\|J^{\prime}_{\nu}\|_{\mathrm{BL}}$ ,

[TABLE]

In particular, combining with Lemma E.1, if $\nu_{0}=\rho\!\operatorname{vol}$ has a smooth positive density, then $J(\nu_{k})-J^{\star}=O(\log(k)/k)$ .

Proof.

Consider $\nu_{\epsilon}\in\mathcal{M}_{+}(\Theta)$ such that $\mathcal{H}(\nu_{\epsilon},\nu_{0})<\infty$ . It holds

[TABLE]

where the first equality is obtained by rearranging terms in the definition of $\mathcal{H}$ , and the second one is specific to the mirror retraction. Let us estimate the two terms in the right-hand side. Using convexity inequalities, we get

[TABLE]

Here the term in $O(\alpha\|g_{\nu_{k}}\|^{2}_{L^{2}(\nu_{k})})$ comes from the proof of Lemma 2.5 (note that the iterates remain in a sublevel of $J$ for $\alpha$ small enough). As for the relative entropy term, we have, using the convexity inequality $\exp(u)\geq 1+u$ ,

[TABLE]

We use this inequality in place of the strong convexity of the mirror function used in the usual proof of mirror descent (because there is no Pinsker inequality on $\mathcal{M}_{+}(\Theta)$ ). Coming back to the first equality we have derived, it holds,

[TABLE]

Thus for $\alpha$ small enough, it holds

[TABLE]

Summing over $K$ iterations and dividing by $K$ , we get

[TABLE]

Since for $\alpha$ small enough $(J(\nu_{k}))_{k\geq 1}$ is decreasing (by Lemma 2.5), the result follows. ∎

Appendix H Convergence rate for lower bounded densities

In this section, we justify the claim made in Section 4.3 about the convergence without condition on $\beta/\alpha$ . Let us recall the result that we want to prove.

Proposition H.1.

Under (A1-3), for any $J_{\max}>J^{\star}$ , there exists $C>0$ such that for any $\eta,t>0$ and $\nu_{0}\in\mathcal{M}_{+}(\Theta)$ satisfying $J(\nu_{0})\leq J_{\max}$ , if the projected gradient flow (9) satisfies for $0\leq s\leq t$ ,

[TABLE]

where $S_{t}=\{\theta\in\Theta\;;\;J^{\prime}_{\nu_{s}}(\theta)\leq 0\text{ for some }s\in{[0,t]}\}$ , then $J(\nu_{t})-J^{\star}\leq\frac{C}{\sqrt{\alpha\eta t}}.$

Proof.

Following [60], we start with the convexity inequality

[TABLE]

Let us control these two terms separately. On the one hand, one has by Jensen’s inequality

[TABLE]

Using the fact that on sublevels of $J$ , $\nu(\Theta)$ and $\|g_{\nu}\|^{2}_{L^{2}(\nu)}$ are bounded, we have, for some $C>0$ ,

[TABLE]

On the other hand, we have

[TABLE]

where the last equality defines $v_{t}\leq 0$ . Using the gradient flow structure, let us show that a non-zero $v_{t}$ and a lower bound $\eta$ on the density of $\nu_{t}$ (at least on the set $\{J^{\prime}_{\nu_{t}}\leq 0$ }) guarantees a decrease of the objective. Indeed, letting $\Theta_{t}=\{\theta\in\Theta\;;\;J^{\prime}_{\nu_{t}}(\theta)\leq v_{t}/2\}$ (which could be empty), we get

[TABLE]

Moreover, the Lipschitz regularity of $J^{\prime}_{\nu}$ is bounded on sublevels of $J$ , and thus along gradient flow trajectories, so there exists $C^{\prime}>0$ such that $\operatorname{vol}(\Theta_{t})\geq C^{\prime}\cdot|v_{t}|$ . It follows

[TABLE]

Coming back to our first inequality, we have

[TABLE]

for some $C^{\prime\prime}>0$ that, given $J(\nu_{0})$ , is independent of $\alpha,\eta$ and $\nu_{t}$ . It remains to remark that a continuously differentiable and positive function $h$ that satisfies $h(t)\leq C^{-1/3}\cdot(-h^{\prime}(t))^{1/3}$ satisfies $C\leq-h^{\prime}(t)/h(t)^{3}=\frac{1}{2}\frac{\mathrm{d}}{\mathrm{d}t}(h(t)^{-2})$ and, after integrating between [math] and $t$ , $h(t)\leq\left(2Ct+h(0)^{-2}\right)^{-1/2}\leq\frac{1}{\sqrt{2Ct}}$ . We conclude by taking $h(t)=J(\nu_{t})-J^{\star}$ and $C\propto\alpha\eta$ . ∎

Bibliography60

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] P.-A. Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds . Princeton University Press, 2009.
2[2] Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation , 10(2):251–276, 1998.
3[3] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures . Springer Science & Business Media, 2008.
4[4] Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research , 18(1):629–681, 2017.
5[5] Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends® in Machine Learning , 4(1):1–106, 2012.
6[6] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters , 31(3):167–175, 2003.
7[7] Adrien Blanchet and Jérôme Bolte. A family of functional inequalities: Łojasiewicz inequalities and displacement convex functions. Journal of Functional Analysis , 275(7):1650–1673, 2018.
8[8] Nicholas Boyd, Geoffrey Schiebinger, and Benjamin Recht. The alternating descent conditional gradient method for sparse inverse problems. SIAM Journal on Optimization , 27(2):616–639, 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Sparse Optimization on Measures

Abstract

1 Introduction

1.1 Examples of applications

Sparse deconvolution.

Two-layer neural networks.

Non-convex optimization.

1.2 Related work

Sparse optimization on measures.

Wasserstein gradient flows for optimization.

Related techniques.

1.3 Notation

2 Particle gradient descent

2.1 General case

Expression of the gradient.

Lifted problem in Wasserstein space.

Gradient flow.

Wasserstein gradient flow.

2.2 The conic case

Transport-growth interpretation.

Proposition 2.1**.**

Proof.

Spherical coordinates interpretation.

Asymptotic global convergence.

Theorem 2.2**.**

2.3 Conic particle gradient descent algorithm

Cone compatible retractions.

Definition 2.3**.**

Gradient descent in P2(Ω)\mathcal{P}_{2}(\Omega)P2​(Ω).

Transport-growth interpretation.

Proposition 2.4**.**

Proof.

Descent property of conic particle gradient descent.

Lemma 2.5** (Descent property).**

Proof.

3 Exponential local convergence

3.1 Non-degeneracy assumptions

Proposition 3.1** (Optimality condition).**

Proof.

Sparse minimizer.

Kernels and non-degeneracy.

3.2 Convergence in M+(Θ)\mathcal{M}_{+}(\Theta)M+​(Θ)

Proposition 3.2**.**

3.3 Sharpness of the objective

Theorem 3.3** (Sharpness).**

Corollary 3.4** (Local convergence of gradient flow).**

Proof.

Corollary 3.5** (Local convergence of gradient descent).**

Proof.

3.4 Proof strategy for the sharpness theorem

Definition 3.6** (Local moments).**

Proposition 3.7** (Local expansion).**

3.5 Discussion on the local behavior

Effect of over-parameterization.

Effect of the regularization parameter.

Choice of the metric and conditioning.

Polynomial dependency.

4 Quantitative global convergence

4.1 Statement of the main results

Theorem 4.1** (Global convergence of gradient flow).**

Theorem 4.2** (Global convergence of gradient descent).**

4.2 Proof of global convergence for gradient flows

Lemma 4.3**.**

Proof.

Proof of Theorem 4.1 (gradient flow).

4.3 Fully non-convex gradient descent

Proposition 4.4**.**

5 Numerical experiments

Local convergence rate.

Global convergence.

Comparison of vertical geometries.

6 Conclusion

Acknowledgments

Proposition 2.1.

Theorem 2.2.

Definition 2.3.

Gradient descent in $\mathcal{P}_{2}(\Omega)$ .

Proposition 2.4.

Lemma 2.5 (Descent property).

Proposition 3.1 (Optimality condition).

3.2 Convergence in $\mathcal{M}_{+}(\Theta)$

Proposition 3.2.

Theorem 3.3 (Sharpness).

Corollary 3.4 (Local convergence of gradient flow).

Corollary 3.5 (Local convergence of gradient descent).

Definition 3.6 (Local moments).

Proposition 3.7 (Local expansion).

Theorem 4.1 (Global convergence of gradient flow).

Theorem 4.2 (Global convergence of gradient descent).

Lemma 4.3.

Proposition 4.4.

Proposition A.1.

Proposition B.1.

Lemma C.1 (Criteria to espace local minima).

Lemma D.1.

Lemma D.2 (Expansion around $\nu^{\star}$ ).

Lemma D.3 (Global distance bound).

Proposition D.4 (Local expansion).

Proposition D.5 (Gradient estimate).

Lemma E.1.

Lemma F.1.

Proposition G.1 (Mirror flow, fast rate).

Proposition H.1.