Particle-based Online Bayesian Sampling

Yifan Yang; Chang Liu; Zheng Zhang

arXiv:2302.14796·cs.LG·March 1, 2023

Particle-based Online Bayesian Sampling

Yifan Yang, Chang Liu, Zheng Zhang

PDF

Open Access

TL;DR

This paper introduces an online particle-based variational inference algorithm that effectively tracks dynamic posterior distributions in streaming data scenarios, with theoretical guarantees and superior empirical performance.

Contribution

It proposes a novel online Bayesian sampling method using particles and a variance reduction technique, with theoretical analysis and improved results over existing methods.

Findings

01

Achieves better tracking of dynamic posteriors in experiments.

02

Provides theoretical analysis using Wasserstein gradient flow.

03

Outperforms naive Bayesian sampling in online settings.

Abstract

Online optimization has gained increasing interest due to its capability of tracking real-world streaming data. Although online optimization methods have been widely studied in the setting of frequentist statistics, few works have considered online optimization with the Bayesian sampling problem. In this paper, we study an Online Particle-based Variational Inference (OPVI) algorithm that uses a set of particles to represent the approximating distribution. To reduce the gradient error caused by the use of stochastic approximation, we include a sublinear increasing batch-size method to reduce the variance. To track the performance of the OPVI algorithm with respect to a sequence of dynamically changing target posterior, we provide a detailed theoretical analysis from the perspective of Wasserstein gradient flow with a dynamic regret. Synthetic and Bayesian Neural Network experiments show…

Tables1

Table 1. Table 1: Results on a BNN classification task on the Kin8nm dataset, averaged over 20 tries.

Methods	Avg. RMSE	Avg. LL	Time
OPVI $B_{t} = t^{0.55}$	$.127 \pm .008$	$.653 \pm .060$	2.4
OPVI $B = 20$	$.145 \pm .003$	$.516 \pm .021$	2.4
SVGD $B = 20$	$.144 \pm .003$	$.525 \pm .019$	2.4
SVGD $B = 10 k$	$.112 \pm .002$	$.783 \pm .017$	5.8
LD $B = 20$	$.159 \pm .004$	$.425 \pm .024$	1.7
LD $B = 10 k$	$.143 \pm .002$	$.527 \pm .015$	5.2

Equations168

w_{t} \in W max t = 1 \sum T (k = 1 \sum N_{T} lo g p (d_{k} ∣ w_{t}) + η_{t} lo g p_{0} (w_{t})),

w_{t} \in W max t = 1 \sum T (k = 1 \sum N_{T} lo g p (d_{k} ∣ w_{t}) + η_{t} lo g p_{0} (w_{t})),

w_{t} \in W min t = 1 \sum T c_{t} (w_{t}) + η_{t} c_{0} (w_{t})

w_{t} \in W min t = 1 \sum T c_{t} (w_{t}) + η_{t} c_{0} (w_{t})

R (T) = t = 1 \sum T c_{t} (w_{t}) + η_{t} c_{0} (w_{t}) - c_{t} (w_{t}^{*}) - η_{t} c_{0} (w_{t}^{*}) .

R (T) = t = 1 \sum T c_{t} (w_{t}) + η_{t} c_{0} (w_{t}) - c_{t} (w_{t}^{*}) - η_{t} c_{0} (w_{t}^{*}) .

e_{t} = \frac{1}{B _{t}} k \in B_{t} \sum \nabla c_{t}^{k} (w_{t}) - \nabla c_{t} (w_{t}),

e_{t} = \frac{1}{B _{t}} k \in B_{t} \sum \nabla c_{t}^{k} (w_{t}) - \nabla c_{t} (w_{t}),

E_{T} := t = 1 \sum T ϵ_{t}

E_{T} := t = 1 \sum T ϵ_{t}

w_{t} = {w_{1} \in W Π_{W} [w_{t - 1} - α v_{t} (w_{t - 1})] t = 1 t > 1,

w_{t} = {w_{1} \in W Π_{W} [w_{t - 1} - α v_{t} (w_{t - 1})] t = 1 t > 1,

where v_{t} (w_{t - 1}) = \nabla \overset{c}{^}_{t - 1} (w_{t - 1}) + η_{t} \nabla c_{0} (w_{t - 1})

∥\nabla c_{t} (w_{1}) + η_{t} \nabla c_{0} (w_{1}) - \nabla c_{t} (w_{2}) - η_{t} \nabla c_{0} (w_{2}) ∥

∥\nabla c_{t} (w_{1}) + η_{t} \nabla c_{0} (w_{1}) - \nabla c_{t} (w_{2}) - η_{t} \nabla c_{0} (w_{2}) ∥

\leq L ∥ w_{1} - w_{2} ∥ t \in [1, T] .

E [R (T)] \leq O (max (1, E_{T}, V_{T})) .

E [R (T)] \leq O (max (1, E_{T}, V_{T})) .

E [∥ e_{t} ∥^{2}] = \frac{N _{T} - B _{t}}{N _{T} B _{t}} Λ^{2},

E [∥ e_{t} ∥^{2}] = \frac{N _{T} - B _{t}}{N _{T} B _{t}} Λ^{2},

\frac{1}{N _{T} - 1} i = 1 \sum N^{T} \nabla c_{t}^{i} (w) - \nabla c_{t} (w)^{2} \leq Λ^{2} w \in W

\frac{1}{N _{T} - 1} i = 1 \sum N^{T} \nabla c_{t}^{i} (w) - \nabla c_{t} (w)^{2} \leq Λ^{2} w \in W

E_{T} = t = 1 \sum T ϵ_{t} \leq t = 1 \sum T \frac{1}{t ^{q}} \leq \frac{2}{2 - ρ} T^{1 - \frac{ρ}{2}}

E_{T} = t = 1 \sum T ϵ_{t} \leq t = 1 \sum T \frac{1}{t ^{q}} \leq \frac{2}{2 - ρ} T^{1 - \frac{ρ}{2}}

E_{T} = t = 1 \sum T ϵ_{t} = t = 1 \sum T \frac{1}{B} - \frac{1}{N _{T}} \leq t = 1 \sum T \frac{1}{B} (1 - \frac{1}{T}) \leq O (T)

E_{T} = t = 1 \sum T ϵ_{t} = t = 1 \sum T \frac{1}{B} - \frac{1}{N _{T}} \leq t = 1 \sum T \frac{1}{B} (1 - \frac{1}{T}) \leq O (T)

grad F (q_{t}) := v : ∥ v ∥_{T_{q_{t}} P_{2}} = 1 max \cdot argmax \frac{d}{d ε} F ((id + ε v)_{#} q_{t})_{ε = 0},

grad F (q_{t}) := v : ∥ v ∥_{T_{q_{t}} P_{2}} = 1 max \cdot argmax \frac{d}{d ε} F ((id + ε v)_{#} q_{t})_{ε = 0},

v_{t} = - \nabla_{q_{t}} KL_{p} (q_{t}) = \nabla lo g p - \nabla lo g q_{t},

v_{t} = - \nabla_{q_{t}} KL_{p} (q_{t}) = \nabla lo g p - \nabla lo g q_{t},

v_{H}^{SVGD} (\cdot) := \nabla lo g p (x) k (x, \cdot) + \nabla k (x, \cdot)

v_{H}^{SVGD} (\cdot) := \nabla lo g p (x) k (x, \cdot) + \nabla k (x, \cdot)

v_{H}^{SVGD} = v \in H, ∥ v ∥_{H = 1} max argmax ⟨ v_{L_{q_{t}}^{2}}^{SVGD}, v ⟩_{L_{q_{t}}^{2}}

v_{H}^{SVGD} = v \in H, ∥ v ∥_{H = 1} max argmax ⟨ v_{L_{q_{t}}^{2}}^{SVGD}, v ⟩_{L_{q_{t}}^{2}}

KL_{p} (q_{t}) = E_{q_{t}} [lo g p] - E_{q_{t}} [lo g q_{t}]

KL_{p} (q_{t}) = E_{q_{t}} [lo g p] - E_{q_{t}} [lo g q_{t}]

=

O-KL_{p_{t}} (q_{t})

O-KL_{p_{t}} (q_{t})

=

x_{t + 1}^{(i)} = x_{t}^{(i)} + α v_{t}^{OPVI- H},

x_{t + 1}^{(i)} = x_{t}^{(i)} + α v_{t}^{OPVI- H},

v_{t}^{OPVI- H} (\cdot) = E_{q (x)} [K (x, \cdot) \nabla k = 1 \sum B_{t} lo g p (d_{k} ∣ x)

v_{t}^{OPVI- H} (\cdot) = E_{q (x)} [K (x, \cdot) \nabla k = 1 \sum B_{t} lo g p (d_{k} ∣ x)

+ η_{t} K (x, \cdot) p_{0} (x) + \nabla K (x, \cdot)],

x_{t + 1}^{(i)} = x_{t}^{(i)} + α v_{t}^{OPVI- H} (x_{t}^{(i)})

x_{t + 1}^{(i)} = x_{t}^{(i)} + α v_{t}^{OPVI- H} (x_{t}^{(i)})

v_{t}^{OPVI- H} (x_{t}^{(i)}) = E_{q (x)} [K (x, x_{t}^{(i)}) \nabla k = 1 \sum B_{t} lo g p (d_{k} ∣ x)

v_{t}^{OPVI- H} (x_{t}^{(i)}) = E_{q (x)} [K (x, x_{t}^{(i)}) \nabla k = 1 \sum B_{t} lo g p (d_{k} ∣ x)

+ η_{t} K (x, x_{t}^{(i)}) \nabla p_{0} (x) + \nabla K (x, x_{t}^{(i)})]

v_{t}^{OPVI- L^{2}} (q_{t}) = - (c_{t} (q_{t}) + e_{t} + c_{t}^{0} (q_{t}))

v_{t}^{OPVI- L^{2}} (q_{t}) = - (c_{t} (q_{t}) + e_{t} + c_{t}^{0} (q_{t}))

q_{t + 1} = Exp_{q_{t}} (- α (c_{t} (q_{t}) + e_{t} + c_{t}^{0} (q_{t})))

q_{t + 1} = Exp_{q_{t}} (- α (c_{t} (q_{t}) + e_{t} + c_{t}^{0} (q_{t})))

d_{K} (q_{1}, q_{2}) \leq 1 + R

d_{K} (q_{1}, q_{2}) \leq 1 + R

∣\nabla c_{t} (q_{1}) + \nabla c_{t}^{0} (q_{1}) - \nabla c_{t} (q_{2}) - \nabla c_{t}^{0} (q_{1}) ∣ \leq L \cdot d (q_{1}, q_{2}),

∣\nabla c_{t} (q_{1}) + \nabla c_{t}^{0} (q_{1}) - \nabla c_{t} (q_{2}) - \nabla c_{t}^{0} (q_{1}) ∣ \leq L \cdot d (q_{1}, q_{2}),

\forall q_{1}, q_{2} \in P_{2} (W),

E [d (q_{t + 1}, q_{t}^{*})] \leq E [d (q_{t}, q_{t}^{*})]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Sparse and Compressive Sensing Techniques · Advanced Adaptive Filtering Techniques

MethodsVariational Inference

Full text

Particle-based Online Bayesian Sampling

Yifan Yang

Chang Liu

Zheng Zhang

Abstract

Online learning has gained increasing interest due to its capability of tracking real-world streaming data. Although it has been widely studied in the setting of frequentist statistics, few works have considered online learning with the Bayesian sampling problem. In this paper, we study an Online Particle-based Variational Inference (OPVI) algorithm that updates a set of particles to gradually approximate the Bayesian posterior. To reduce the gradient error caused by the use of stochastic approximation, we include a sublinear increasing batch-size method to reduce the variance. To track the performance of the OPVI algorithm with respect to a sequence of dynamically changing target posterior, we provide a detailed theoretical analysis from the perspective of Wasserstein gradient flow with a dynamic regret. Synthetic and Bayesian Neural Network experiments show that the proposed algorithm achieves better results than naively applying existing Bayesian sampling methods in the online setting.

Machine Learning, ICML

1 Introduction

Online learning is an indispensable paradigm for problems in the real world, as a machine learning system is often expected to adapt to newly arrived data and respond in real-time. The key challenge in this setting is that the model cannot be updated with all data in history each time, which grows linearly and would make the system unsustainable. There are quite a few online optimization methods developed over the decades that address the challenge by only taking the last arrived batch of data for each update and by using a shrinking step size to control the increase of error. They have been successfully applied to a wide range of tasks like online ranking, network scheduling and portfolio selection (Yu et al., 2017; Pang et al., 2022).

Online optimization methods can directly be applied to update models that are fully specified by a certain value of its parameters. Beyond such models, there is another class of models known as Bayesian models that treat the parameters as random variables, thus giving an output also as a random variable (often the expectation is taken as the final output on par with the conventional case). The stochasticity enables Bayesian models to provide diverse outputs, characterize prediction uncertainty, and be more robust to adversarial attacks (Hernández-Lobato and Adams, 2015; Li and Gal, 2017; Yoon et al., 2018; Zhang et al., 2019; Tolpin et al., 2021; Wagner et al., 2023). Hence Bayesian models are receiving increasing attention in research and practice, and an online learning method for them is highly desired.

Nevertheless, the learning procedure of Bayesian models is different from conventional models, which poses a challenge in directly applying online optimization methods in an online setting. This is because a Bayesian model is characterized by the distribution of its parameters but not a single value, and the learning task, a.k.a. Bayesian inference, is to approximate the posterior distribution of the parameters given received data. A tractable solution is Variational Inference (VI) (Jordan et al., 1999; Blundell et al., 2015), which approaches the posterior using a parameterized approximating distribution, which enables optimization methods again (Hoffman et al., 2010a; Broderick et al., 2013a; Foti et al., 2014; Chérief-Abdellatif et al., 2019). However, the accuracy is restricted by the expressiveness of the approximating distribution which is not systematically improvable.

A more accurate method is Monte Carlo which aims to draw samples from the posterior. As the posterior is only known with an unnormalized density function, direct sampling is intractable, and Markov chain Monte Carlo (MCMC) is employed. While it makes sampling tractable, it comes with the issue of sample efficiency due to the correlation among the samples. Recently, a new class of Bayesian inference methods is developed, known as particle-based variational inference (ParVI) (Liu and Wang, 2016; Chen et al., 2018; Liu et al., 2019; Zhu et al., 2020; Zhang et al., 2020; Korba et al., 2021; Liu and Zhu, 2022). They try to approximate the posterior using a set of particles (i.e., samples) of a given size, which are iteratively updated to minimize the difference between the particle distribution from the posterior. The accuracy of the method can be systematically improved with more particles, and due to the limited number of particles, sample efficiency is enforced so as to minimize the difference. While ParVI methods have been successfully applied to the full-batch and mini-batch settings, to our knowledge there is no online version of ParVI.

In this work, we develop an Online Particle-based Variational Inference (OPVI) method to meet this desideratum and also provide an analysis on its regret bound which can achieve a sublinear order in the number of iterations. The method and analysis are inspired by the distribution optimization view of ParVI on the Wasserstein space, under which we could leverage techniques and theory of conventional online optimization methods. To do this, we first extend existing Maximum a Posterior (MAP) methods to better handle the prior term, and give the regret bound analysis for the online MAP algorithm. We then extend the results to the Wasserstein space as an online sampling method by leveraging the Riemannian structure of the space. Notably, we leverage techniques from online optimization that improves upon naively applying existing ParVI methods in an online setting. Here, we bound the dynamic regret under the Wassesterin space by using a trigonometric distance inequality for the inexact gradient descent method. We study the empirical performance of the method on a 2-dimensional synthetic setting which allows easy visualization, and real-world applications using Bayesian neural networks for image classification. The results suggest better posterior approximation and classification accuracy than naive online ParVI methods and online MCMC methods, which is even comparable to full batch results.

2 Related Work

Since (Cesa-Bianchi and Lugosi, 2006) study the online properties of VI, there are a couple of works showing online VI gives good performance in practice cases (Hoffman et al., 2010b, 2013; Broderick et al., 2013b). Furthermore, researchers in (Chérief-Abdellatif et al., 2019) derive the theoretical results for the generalization properties of the Online VI algorithm. Even though online VI is well studied, few papers pay attention to the problem of online MCMC, except (Chopin, 2002; Kantas et al., 2009; Christensen et al., 2012) study a series of sequential Monte Carlo methods that combine importance sampling with Monte Carlo schemes to track the changing distribution. Unfortunately, no previous work considers an online MCMC method from the perspective of optimization methods, not to mention the theory behind them.

Our method employs a gradient descent-based optimization strategy to update particles toward the target posterior. However, the target posterior is dynamically changing with streaming data arriving in the system, which makes the optimal solutions change. To solve this problem, we consider a performance metric called dynamic regret in our analysis. In previous research, it has been proved that algorithms that achieve low regret under the traditional regret may perform poorly in dynamic environment (Besbes et al., 2015) and it’s impossible to achieve a sublinear dynamic regret for an arbitrary sequence of loss functions (Yang et al., 2016). To achieve a sublinear regret, researchers propose different constraints on the sequence of loss functions, like the functional variation (Besbes et al., 2015), gradient variation (Rakhlin and Sridharan, 2013; Yang et al., 2014) and path variation (Yang et al., 2016; Bedi et al., 2018; Cesa-Bianchi et al., 2012). However, even though this dynamic problem is essential to be considered in the analysis of Bayesian inference algorithms, no previous papers considered this. As a result, existing theoretical guarantees regarding the online VI (e.g. (Chérief-Abdellatif et al., 2019)) may be insufficient under the dynamic changing online environment.

The stochastic gradient descent algorithm is widely used as an incremental gradient algorithm that offers inexpensive iterations by approximating the gradient with a mini-batch of observations. Through the past decade, it has been used in a wide variety of problems with different variations, like network optimization (Pang et al., 2022; Zhou et al., 2022) reinforcement learning (Liu et al., 2021b, a), federated learning (Sun and Wei, 2022) and recommendation system (Yang et al., 2020). However, this method, at the same time, incurs gradient error when approximating the gradient. In most of the novel sampling methods, we normally obtain diverse solutions by injecting diffusion noises (e.g. Langevin Dynamic (LD) (Neal et al., 2011), Stochastic Gradient Langevin Dynamics (SGLD) (Welling and Teh, 2011), which makes this type of algorithm sensitive to the noise. For Stein Variational Gradient Descent (SVGD) (Liu and Wang, 2016), there is also a similar instability observed in the experiments, even though the reason is still unknown. This instability makes reducing the stochastic gradient error important.

To reduce the gradient error, researchers studied multiple variance reduction methods, like using adaptive learning rates and increasing batch size. In the previous work, an adaptive learning rate was used to adapt the optimization to the most informative features with Adagrad (Ward et al., 2019) and estimate the momentum for Adam(Kingma and Ba, 2014). Compared with the adaptive methods, the increasing batch size methods have greater parallelism and shorter training times (Smith et al., 2017) and are also studied in offline and online cases (Friedlander and Schmidt, 2012; Zhou et al., 2018), which shows great importance to achieve applicable convergence rate and sublinear regret bound. Especially, (Bedi et al., 2018; Yang et al., 2016), give algorithms that consider a more general case of optimization with inexact gradient.

3 The online Maximum a Posterior on Euclidean Space $\mathcal{W}$

In this section, we first introduce an online MAP algorithm on Euclidean decision space $\mathcal{W}$ with gradient descent method, which helps the reader to understand our OPVI sampling method on Wasserstein space. Here, we give some prior knowledge about the online MAP problem and the dynamic regret metric. Then, we give a detailed policy using an online stochastic gradient descent algorithm to solve the online MAP problem and a detailed theoretical analysis based on the dynamic regret metric.

3.1 Preliminaries

For an online MAP algorithm run with time slots $t\in[1,T]$ , let $\mathcal{W}\in\mathcal{R}^{d}$ denote a convex set, set $w_{t}\in\mathcal{W}$ be some parameter of interest and $\mathcal{N}_{T}=\{d_{1},\cdots,d_{N_{T}}\}$ be the set of i.i.d. observations. In a typical problem of MAP, we aim to maximize a target posterior $p(w):=p_{0}(w)\prod_{k=1}^{N_{T}}p(d_{k}\mid w)$ , where we usually take logarithm on both sides to simplify the computation as $\log p(w)=\log p_{0}(w)+\sum_{k}\log p(d_{k}|w)$ .

Different from the offline MAP, we set a $\eta_{t}=\frac{6}{\pi^{2}t^{2}}$ adaptive weight for the prior in our online setting, which divides the whole prior for each update with $\sum_{t=1}^{T}\eta_{t}=1$ when $T\rightarrow\infty$ . Then, the goal of the online MAP problem on $\mathcal{W}$ is to find parameter $w_{t}$ that maximizes the cumulative of a linear combination of minus likelihood and partial prior, which can be given as:

[TABLE]

To simplify the notation, we use $c_{t}^{k}(w_{t}):=-\log p(d_{k}\mid w)$ to denote the log-likelihood with data $d_{k}$ and $c_{0}(w_{t}):=-\log p_{0}(w_{t})$ to denote the log-prior, where $c$ is called the cost function in the literature of optimization and we take minus logarithm since we want to make sure the cost function to be positive all the time. We denote $c_{t}(w_{t})=\sum_{k=1}^{N_{T}}c_{t}^{k}(w_{t})$ as the true likelihood considering all data in the dataset. Then, we can formulate the goal eq. (1) to be an optimization problem with $c_{t}+\eta_{t}c_{0}$ as the objective function and follow the goal of:

[TABLE]

As we have mentioned in Section 2, the target posterior is dynamically changing with the new observations, we are interested in using dynamic regret as the performance metric for our problem, which is defined as the difference between the total cost incurred at each time slot and a sequence of optimal solutions $\{w_{t}^{*}\}$ in hindsight, i.e.,

[TABLE]

In this paper, instead of using all data in $\mathcal{N}_{T}$ , we consider using mini-batch $\mathcal{B}_{t}$ to approximate the gradient $\nabla c_{t}(w_{t})$ as $\nabla\hat{c}_{t}$ , where the approximation leads to a gradient error $e_{t}:=\nabla\hat{c}_{t}(w_{t})-\nabla c_{t}(w_{t})$ that can calculated by:

[TABLE]

where $B_{t}=|\mathcal{B}_{t}|$ is the batch size.

Note that the gradient error $e_{t}$ can be deterministic or stochastic, depending on the way we set up the mini-batch. In this paper, we choose to select samples for mini-batch $\mathcal{B}_{t}$ arbitrarily from $\mathcal{N}_{T}$ , which makes the gradient error to be stochastic in this paper. As a result, the expectation of $\|e_{t}\|$ can be bounded by some time-varying variable $\epsilon_{t}$ as $\mathbb{E}[\|e_{t}\|]\leq\epsilon_{t}$ . Then, we introduce an error bound $E_{T}$ to measure the cumulative gradient error lead by the stochastic gradient approximation over $t\in[1,T]$ , which is given by:

[TABLE]

We will show a sublinear increasing batch size is enough to keep $E_{T}$ growing sublinear, which enables the online MAP algorithm to enjoy a sublinear dynamic regret.

3.2 Dynamic Algorithm for Online Maximum a Posterior

It is well known that the online gradient descent algorithm can be used to solve online optimization problems (Zinkevich, 2003; Besbes et al., 2015; Yang et al., 2022). Here, we give an online stochastic gradient descent algorithm with increasing batch size for the online MAP problem in the following updating policy:

[TABLE]

where $\Pi_{\mathcal{W}}$ is the projection back to the convex set $\mathcal{W}$ .

We will illustrate the relationship between $e_{t}$ and $B_{t}$ in the following analysis. Next, we first introduce some widely used assumptions required for the theoretical analysis.

Assumption 1.

(Bounded Convex Set) For any two decisions $w_{1},w_{2}\in\mathcal{W}$ , we have $d(w_{1},w_{2})\leq R$ .

Assumption 2.

(Convexity and Lipschitz smooth) The function $c_{t}+\eta_{t}c_{0}$ is convex and Lipschitz smooth, so its derivatives are Lipschitz continuous with constant $L$ with a constant $L$ , i.e., for two real $w_{1},w_{2}\in\mathcal{W}$ , we have:

[TABLE]

Assumption 3.

(Vanishing gradient) We assume the optimal solutions $w^{*}_{t}$ lie in the interior of the convex set $\mathcal{W}$ , where we assume there exists $w^{*}_{t}$ such that $\nabla c_{t}(w^{*}_{t})+\eta_{t}\nabla c_{0}(w^{*}_{t})=0$

We give a sublinear regret upper bound in the next subsection, which means $\|w_{t}-w_{t}^{*}\|$ is decreasing and the parameter of interest $w_{t}$ can converge to the dynamic changing optimal solutions $w_{t}^{*}$ when $T$ is large enough. That indicates we can obtain a promising MAP result with the policy in eq. (15). Furthermore, we also give an analysis of the influence of the increasing batch-size setting on the performance of the algorithm.

3.3 Theoretical Analysis for Online MAP

In this section, we begin with the proof of the online MAP algorithm following the policy in eq. (15) over the Euclidean space $\mathcal{W}$ . As we mentioned in Section 2, it is impossible to achieve a sublinear regret bound for any sequence of cost functions. To solve this problem, we consider a path variation budget $V_{T}$ for the sequence of optimal solutions $\{w_{t}^{*}\}$ , which bound the cumulative path length of the optimal solutions as $V_{T}:=\sum_{t=1}^{T}\|w_{t}^{*}-w^{*}_{t-1}\|$ .

We give the following Theorem following the proof of (Bedi et al., 2018, Theorem 2). Note that following eq. 3, we use true gradient $\nabla c_{t}(w_{t})$ and a gradient error $e_{t}$ to represent the approximated gradient $\nabla\hat{c}_{t}(w_{t})$ to highlight the influence of the gradient error and simplify the proof. The result is summarized in Theorem 4, which gives the sublinear bound for the dynamic regret $\mathcal{R}(T)$ .

Theorem 4.

(Regret Bound under $\mathcal{W}$ (Bedi et al., 2018, Theorem 2)) Under the Assumption 1 - 3, given a sequence of optimal solutions $\{w_{t}^{*}\}$ , variational budget $V_{T}$ and gradient error bound $E_{T}$ . Following the updating policy in eq. (15) on Euclidean Space $\mathcal{W}\in\mathcal{R}^{n}$ , we have the dynamic regret:

[TABLE]

Proof.

Detail of the proof can be found in Appendix A. ∎

To further find the relationship between $E_{T}$ and $B_{t}$ to bound $E_{T}$ , we give some analysis for the gradient error led by the stochastic batch sampling with the sublinear increasing batch size following Theorem 4. Base on section 2.8 in (Lohr, 2021), we have:

[TABLE]

where $N_{T}$ is the total number of data samples we have and $\Lambda$ is a bound on the sample variance of the gradients, which is defined by:

[TABLE]

To fulfill the requirement of $\mathbb{E}[\|e_{t}\|^{2}]$ in eq. (6), we assume $\epsilon_{t}=\sqrt{\frac{1}{B_{t}}-\frac{1}{N_{T}}}$ and the sublinear increasing batch-size as $B_{t}=\frac{N_{T}t^{\rho}}{N_{T}+t^{\rho}}\quad\rho>0$ . Then, we can bound $E_{T}$ as:

[TABLE]

We can see when the batch size $B_{t}$ is growing sublinear, the gradient error bound $E_{T}$ is sublinear. Thus if the variational budget $V_{T}$ is constrained to be sublinear, the regret bound is proved to be sublinear. Note that in the regret analysis, we set a static stepsize $\alpha$ for convenience. The algorithm can also achieve a sublinear regret bound when the stepsize is set to be digressive like $\alpha_{t}=t^{0.55}$ . Next, we illustrate why a static batch size fills to achieve a sublinear regret.

Remark: Set the batch size to be static as $B$ . The agent update $w_{t}$ for over $T$ rounds and use a total of $N_{T}$ data samples, where $N_{T}$ can be calculated by $N_{T}=\sum_{t=1}^{T}B=BT$ . Following a similar setting in eq. (7), we bound the gradient error over $t\in[1,T]$ as:

[TABLE]

which gives a linear increasing gradient error bound $E_{T}\leq\mathcal{O}(T)$ . That makes it impossible to give a sublinear regret bound, which is necessary to ensure the algorithm can finally converge to the optimal solutions.

4 Online Particle-based Variational Inference on Wasserstein Space $\mathcal{P}_{2}(\mathcal{W})$

In this section, we propose the OPVI algorithm on $\mathcal{P}_{2}(\mathcal{W})$ , which formulate the online MAP problem in Section 3 as an online sampling method from the perspective of Wasserstein gradient flow. To begin with, we first introduce some preliminary knowledge about the $2$ -Wasserstein Space $\mathcal{P}_{2}(\mathcal{W})$ , as well as its Riemannian structure and the gradient flow on it. Then, we give a brief introduction to a well-known ParVI method, called SVGD (Liu and Wang, 2016) and take it as an example to illustrate how to simulate a ParVI problem as a gradient flow on $\mathcal{P}_{2}(\mathcal{W})$ . Based on this idea, we give the theoretical analysis for OPVI as a distribution optimization flow on $\mathcal{P}_{2}(\mathcal{W})$ to show a sublinear dynamic regret. For convenience, we only consider Wasserstein Space supported on the Euclidean space $\mathcal{W}$ in our analysis.

Here, we first clarify the notation used in this section. We use $\mathcal{C}_{c}^{\infty}$ as a set of compactly supported $R^{D}-$ valued functions on $\mathcal{W}$ and use $C_{c}^{\infty}$ to denote the scalar-valued functions in $\mathcal{C}_{c}^{\infty}$ . Except for the Euclidean space $\mathcal{W}$ and Wasserstein space $\mathcal{P}_{2}(\mathcal{W})$ we just mentioned, we consider two other types of space in this paper, the Hilbert space $\mathcal{L}^{2}_{q}$ and the vector-valued Reproducing Kernel Hilbert Space (RKHS) $\mathcal{H}^{D}$ of a kernel $K$ . The Hilbert space $\mathcal{L}^{2}_{q}$ , is a space of $\mathbf{R}^{D}$ -valued functions $\left\{u:\mathbb{R}^{D}\rightarrow\mathbb{R}^{D}\mid\int\|u(x)\|_{2}^{2}\mathrm{~{}d}q<\infty\right\}$ with inner product $\langle u,v\rangle_{\mathcal{L}_{q}^{2}}:=\int u(x)\cdot v(x)\mathrm{d}q$ . The RKHS $\mathcal{H}$ is a kernel version of the Hilbert space $\mathcal{L}^{2}_{q}$ , which is the closure of linear span $\left\{f:f(x)=\sum_{i=1}^{m}a_{i}k\left(w,w_{i}\right),a_{i}\in\mathbb{R},m\in\mathbb{N},w_{i}\in\mathcal{W}\right\}$ equipped with inner products $\langle f,g\rangle_{\mathcal{L}^{2}_{q}}=\sum_{ij}a_{i}b_{j}k\left(w_{i},w_{j}\right)$ for $g(w)=\sum_{i}b_{i}k\left(w,w_{i}\right)$ .

4.1 The Wasserstein Space $\mathcal{P}_{2}(\mathcal{W})$ , its Riemannian Structure and the Gradient Flow

Generally, the Wasserstein space is a metric space equipped with Wasserstein distance $d(\cdot,\cdot)$ . Set $P(\mathcal{W})$ as the space of probability measures on the Euclidean support space $\mathcal{W}$ . The $2$ -Wasserstein space on $\mathcal{W}$ can be defined as $\mathcal{P}_{2}(\mathcal{W}):=\{\mu\in P(\mathcal{W}):\int_{\mathcal{W}}\|w\|^{2}d\mu(w)<\infty\}$ . Since the Riemannian structure of Wasserstein space is discovered (Otto, 2001; Benamou and Brenier, 2000), several interesting quantities have been defined, like the gradient and the inner product on it.

To define the gradient of a smooth curve $(q_{t})_{t}$ on $\mathcal{P}_{2}(\mathcal{W})$ , we can set a time-dependent vector field $v_{t}(w)$ on $\mathcal{W}$ , such that for a.e. $t\in\mathbb{R},\partial_{t}q_{t}+\nabla\cdot\left(v_{t}q_{t}\right)=0$ and $v_{t}\in\overline{\left\{\nabla\varphi:\varphi\in C_{c}^{\infty}\right\}}^{\mathcal{L}_{q_{t}}^{2}}$ , where the overline means closure (Villani, 2009). Note that the vector field $v_{t}$ here is the so-called tangent vector of the curve $(q_{t})_{t}$ at $q_{t}$ and the closure is denoted as tangent space $T_{q_{t}}\mathcal{P}_{2}$ at $q_{t}$ , whose elements are the tangent vectors for the curves passing through the point $q_{t}$ . The relation between $T_{q_{t}}\mathcal{P}_{2}$ , $v_{t}$ and $\mathcal{P}_{2}(\mathcal{W})$ can be found in Fig. 1. The inner product in the tangent space $T_{q_{t}}\mathcal{P}_{2}$ is defined on $\mathcal{L}^{2}_{q}$ , which defines the Riemannian structure on $\mathcal{P}_{2}(\mathcal{W})$ and is consistent with the Wasserstein distance due to the Benamou-Brenier formula (Benamou and Brenier, 2000).

An important role of the vector field representation is that we can approximate the change of distribution $q_{t}$ within a distribution curve $(q_{t})_{t}$ . For a single update in each time slot, we can set $(\operatorname{id}+\varepsilon v_{t})_{\#}q_{t}$ as a first-order approximation of the updated distribution $q_{t+1}$ in the next time slot (Ambrosio et al., 2005). Therefore, for a set of particles $\{x_{t}^{(i)}\}_{i}$ that obey distribution $q_{t}$ at time $t$ , we can update these particles with a stepsize of $\varepsilon$ as $\{x_{t}^{(i)}+\varepsilon v_{t}(x_{t}^{(i)})\}_{i}$ , to approximate distribution $q_{t+1}$ in time $t+1$ , when $\varepsilon$ is small. We show this approximation as a red arrow in Fig. 1.

Another important concept on $\mathcal{P}(\mathcal{W})$ is the definition of the gradient flow. Given a function $F$ , the gradient flow can be described as the family of descending curves $\{(q_{t})_{t}\}$ that maximize the decreasing rate of the derivative of $F$ . In $\mathcal{P}_{2}(\mathcal{W})$ , the tangent vector of the gradient flow $(q_{t})_{t}$ can be defined by the gradient of $F$ at $q_{t}$ , which is given by:

[TABLE]

where we define a measurable transformation $\mathcal{T}:\mathcal{W}\rightarrow\mathcal{W}$ and denote $\mathcal{T}_{\#q_{t}}$ as the $\mathcal{T}$ -transformed distribution for $q_{t}$ .

In the task of Bayesian inference, our goal is to minimize the KL-divergence between a current estimated distribution $q_{t}$ and the target posterior $p$ as $\mathop{KL}_{p}(q_{t}):=\int_{\mathcal{W}}\log(q_{t}|p)dq_{t}$ , which has the tangent vector for its gradient flow $(q_{t})_{t}$ as a vector field of:

[TABLE]

4.2 Particle-based Variational Inference Methods

In this section, we first use SVGD as an example to illustrate the ParVI methods. Then we show how to simulate SVGD as the gradient flow on Wasserstein space $\mathcal{P}_{2}(\mathcal{W})$ , which can help the analysis of OPVI in the following subsection. For SVGD, let $\{x^{(i)}_{t}\}_{i=1}^{n}$ be a set of particles that obey an empirical measure of distribution $q_{t}$ . We initialize $q_{t}$ as some simple distribution $q_{0}$ , then use a vector field $v$ to update these particles toward the target posterior $p$ : $x_{t+1}^{(i)}=x_{t}^{(i)}+\varepsilon v(x_{t}^{(i)})$ , where $v$ should be chosen to maximize the decreasing of the KL-divergence $-\left.\frac{\mathrm{d}}{\mathrm{d}\varepsilon}\mathrm{KL}_{p}\left((\mathrm{id}+\varepsilon v)_{\#}q\right)\right|_{\varepsilon=0}$ . In SVGD, the vector field is chosen to be optimized over RKHS $\mathcal{H}$ with a closed-form solution:

[TABLE]

Note that the updating of SVGD particles is actually an approximation of the $\mathcal{P}_{2}(\mathcal{W})$ gradient flow by taking $\mathcal{H}$ as its tangent space instead of $\mathcal{L}^{2}_{q_{t}}$ , since the function in $\mathcal{H}$ is roughly a kernel smoothed function in $\mathcal{L}^{2}_{q_{t}}$ (Liu and Zhang, 2019). Thus, the vector field $v_{\mathcal{H}}^{\text{SVGD}}$ in eq. (8) can be used to approximate the vector field $v_{\mathcal{L}^{2}_{q_{t}}}^{\text{SVGD}}$ in $\mathcal{L}^{2}_{q_{t}}$ on $P_{2}(\mathcal{W})$ (Liu et al., 2019, Theorem 2), where the solution gives:

[TABLE]

That enables us to use $v_{\mathcal{L}^{2}_{q_{t}}}^{\text{SVGD}}$ to approximate the vector field $v_{\mathcal{H}}^{\text{SVGD}}$ on $P_{2}(\mathcal{W})$ in the following analysis, which like doing a projection from $\mathcal{H}$ to $\mathcal{L}^{2}_{q_{t}}$ .

4.3 Online Particle-based Variational Inference on $\mathcal{P}_{2}(\mathcal{W})$

In this section, we aim to develop an online sampling method on $\mathcal{P}_{2}(\mathcal{W})$ and proposed the OPVI algorithm. We first illustrate the policy of OPVI over RKHS $\mathcal{H}$ . Then, we interpret the OPVI as the gradient flow on $\mathcal{P}_{2}(\mathcal{W})$ and conduct the theoretical analysis by transferring the proof in 3 from Euclidean space $\mathcal{W}$ to Wasserstein space $\mathcal{P}_{2}(\mathcal{W})$ . Note that we use $v_{t}^{\text{OPVI-}\mathcal{H}}$ as the vector field on RKHS $\mathcal{H}$ and $v_{t}^{\text{OPVI-}\mathcal{L}^{2}}$ as the vector field on $\mathcal{L}_{q_{t}}^{2}$ .

We begin with reviewing the KL-divergence in an offline setting, which is given as:

[TABLE]

where $N_{T}$ is the number of data samples in the dataset. Following a similar idea as the online MAP algorithm, we set a $\eta_{t}=\frac{6}{\pi^{2}t^{2}}$ adaptive weight for the prior in our online setting and using mini-batch with batch size $B_{t}$ to approximate the likelihood. Thus, we give an online stochastic version of KL-divergence between $q_{t}$ and the dynamic changing posterior $p_{t}$ as:

[TABLE]

Similar to SVGD, we first draw a set of particles $\{x^{(i)}_{0}\}_{i=1}^{n}$ that obey some simple initial distribution $q_{0}$ . Then, we update these particles with a gradient descent updating scheme with step size $\alpha$ :

[TABLE]

where $v_{t}^{\text{OPVI-}\mathcal{H}}$ is the vector field on $\mathcal{H}$ that maximizes the decrease of online stochastic KL-divergence $-\frac{d}{d\alpha}\text{O-KL}_{p_{t}}((id+\alpha v_{t})_{\#q})|_{\alpha=0}$ to give a closed-form solution:

[TABLE]

where $K(x,x^{\prime})$ is satisfied by commonly used kernels like the exponential kernel $K(x,x^{\prime})=\exp(-\frac{1}{h}\|x-x^{\prime}\|^{2}_{2})$ and the general workflow of the OPVI algorithm is summarized in Alg. 1.

4.4 Proof of Dynamic Regret Bound under $\mathcal{P}_{2}(\mathcal{W})$

To begin with, we first formulate the updating rule in Alg. 1 as a Wasserstein gradient flow. Here, we ignore the kernel smooth used in the implementation of the algorithm by approximating the vector field $v_{t}^{\text{OPVI-}\mathcal{H}}$ on RKHS $\mathcal{H}$ with the vector field $v_{t}^{\text{OPVI-}\mathcal{L}^{2}}$ on Hilbert space $\mathcal{L}^{2}$ . To simplify the proof, we denote $c_{t}^{k}(q_{t})=-\mathbb{E}_{q_{t}}[\log p(d_{k}|\cdot)]$ and $c_{t}^{0}(q_{t})=-\eta_{t}\mathbb{E}_{q_{t}}[\log p_{0}]+\mathbb{E}_{q_{t}}[\log q_{t}]$ in eq. (4.3) and follow eq. (3) to represent the stochastic approximation as the sum of the true gradient and a gradient error $e_{t}$ , which gives:

[TABLE]

Then, the updating of the particles can be formulated as an optimal transport for distribution $q_{t}$ over $\mathcal{P}_{2}(\mathcal{W})$ as:

[TABLE]

Before we give the proof for the regret bound, we first re-assume some assumption under the $\mathcal{P}_{2}(\mathcal{W})$ .

Assumption 5.

(Bounded geodesically-convex (g-convex) set on $\mathcal{P}_{2}(\mathcal{W})$ ) Assume $\mathcal{K}$ to be a g-convex set on some Wasserstein space $\mathcal{P}_{2}(\mathcal{W})$ supported on $\mathcal{W}$ . From Theorem 2 of (Gibbs and Su, 2002), we can establish a bound for the maximum Wasserstein distance in a bounded support space with $\dim(\mathcal{W})<R$ . Then $\forall q_{1},q_{2}\in\mathcal{P}_{2}(\mathcal{W})$ , we have:

[TABLE]

which bound the geodescially convex set $\mathcal{K}$ .

Assumption 6.

(Geodesically-L-Lipschitz (g-L-Lipschitz)). Similar to the definition over $\mathcal{W}$ , we assume $c_{t}(q_{1})+c_{t}^{0}(q_{1})$ to be a g-convex function and has a geodesically L-Lipschitz continuous gradient on $\mathcal{P}_{2}{\mathcal{W}}$ if there exists a constant $L>0$ that:

[TABLE]

where $d(a,b)$ should be some Wasserstein distance.

Compared with the proof on $\mathcal{W}$ , the key difference is the way to obtain Lemma 9. Instead of updating a set of parameters of interest over $\mathcal{W}$ , we update the distribution $q_{t}$ by optimal transport over $\mathcal{P}_{2}(\mathcal{W})$ .

Lemma 7.

Suppose that $\mathcal{P}_{2}(\mathcal{W})$ is a Wasserstein space supported on Euclidean space $\mathcal{W}$ with the sectional curvature lower bounded by $-\kappa(\kappa>0)$ . Under Assumption 3, 5, 6, for any $q_{t}\in\mathcal{K}$ , following the updating rule in eq. (11), we have:

[TABLE]

where $\Phi=2\alpha-3L\alpha^{2}\zeta(\kappa,R)$ .

Proof.

The proof can be found in Appendix B ∎

Using Lemma 7 and the definition of dynamic regret in eq. (2), we give the dynamic regret bound on $\mathcal{P}_{2}(\mathcal{W})$ in the following Theorem.

Theorem 8.

(Regret Bound over $\mathcal{P}_{2}(\mathcal{W})$ ) Under the Assumption 3, 5, 6, given a sequence of optimal solutions $\{q_{t}^{*}\}$ , define the variational budget $V_{T}:=\sum_{t=1}^{T}d(q^{*}_{t},q^{*}_{t+1})$ and the error bound $E_{T}$ . Following the updating rule in eq. (11), we have the dynamic regret bound:

[TABLE]

Proof.

The detail of the proof can be found in Appendix C. ∎

Different from the proof of the inexact gradient descent on Euclidean space, we include the trigonometric distance inequality introduced in (Zhang et al., 2016) and give the first dynamic regret bound for the inexact infinitesimal gradient descent methods over $\mathcal{P}_{2}(\mathcal{W})$ . Note the regret bound here is related to a curvature bound $\kappa$ , where we set $\kappa$ as a constant since it is not the key point of this paper.

Since the gradient error is denied in $\mathbb{R}^{D}$ , we can follow the same analysis as Section 3.3 to bound the gradient error bound $E_{T}$ , which gives a sublinear error bound. As a result, by setting a sublinear increasing constraint for the variational budget $V_{T}$ , we can make sure $R_{\mathcal{P}_{2}(\mathcal{W})}(T)$ is increasing sublinear. That means the OPVI methods can converge to the dynamic changing target posterior $p_{t}$ when $T$ is large enough.

In SVGD, the author didn’t consider this gradient error in their algorithm. However, since the gradient error can be viewed as a part of noise added into the updating process, we should not use the whole diffusion noise $\nabla K(x,\cdot)$ in eq. (8). In the experiment, we set the diffusion term as $0.1\cdot\nabla K(x,\cdot)$ for OPVI. We observe that this trick gives tremendous improvements in performance, especially in some high-dimensional tasks like image classification.

5 Experiments

In this section, we test the performance of the proposed OPVI algorithm, and compare it with two famous Bayesian sampling methods, the LD (Welling and Teh, 2011) and SVGD (Liu and Wang, 2016). We run these methods with three types of batch settings, mini-batch with increasing batch size, mini-batch with static batch size, and full batch. To make the comparison fair, we set a Fixed Iterations and Total Data Samples (FITDS) policy for experiments under the mini-batch setting, which means we set the total number of data samples $N_{T}$ and the total number of time slots $T$ to be same for each experiment.

Except for the full-batch methods, all algorithms follow the FITDS policy. For a dataset of nearly 10k data samples, we run all methods for 500 rounds and set $B=20$ for the static batch size methods and $B_{t}=t^{0.55}$ for the increasing batch size methods to keep $N_{T}$ same. For full batch methods, we use all 10k data samples in each round to show the best possible results. All experiments are run under the same setting (unless otherwise stated), codes for these experiments are available at https://github.com/yifanycc/OPVI.

5.1 Synthetic Experiments

The synthetic experiments follow the setting in (Welling and Teh, 2011) that conduct a simple example with two parameters, based on the mixture Gaussian distribution:

[TABLE]

where $\sigma_{1}^{2}=10$ , $\sigma_{2}^{2}=1$ and $\sigma_{x}=2$ .

Here, we draw approximately 10,000 data samples from the above distribution with $\theta_{1}=0$ and $\theta_{2}=1$ . Except for the full-batch methods, all algorithms follow the FITDS policy. Fig. 2 shows the results for the OPVI, SVGD, and LD with 100 particles, where the true posteriors are shown as contour and the inference results are represented by the particles.

As we can observe from the result, the proposed increasing batch size OPVI gives a better result than the static batch size OPVI, which is caused by the use of increasing batch size as a variance reduction method. Compared with previous SVGD and LD, the OPVI method shows much better performance for tracking the posterior. That should be led by the influence of the gradient noise on the noise injection process of the LD method since we use a smaller diffusion term to offset the gradient error. In the last two figures, we can see the performance of OPVI is approaching or even better than the full batch methods.

5.2 Bayesian Neural Network (BNN) Experiment

In this subsection, we further compare our work with SVGD and LD on some Bayesian Neural Networks (BNN) tasks. We follow the experiment setting in (Liu and Tao, 2015), which uses a single hidden layer BNN with 50 hidden units. We use a Gamma(1, 0.1) function in the prior distribution, Kin8nm as the dataset and divide the dataset randomly 90% for training and 10% for testing. For all methods, we set the number of particles to 20.

All ParVI methods use the same stepsize, except for LD, which uses a smaller but best possible stepsize. We test the Root Mean Squared Error (RMSE) and the test Log-Likelihood (LL). The experiment results are shown in Table. 1. The OPVI algorithm can achieve an ** 11.8% and 20.1% improvement** compared with SVGD and LD with the same total number of data $N_{T}$ and the same total time slots $T$ respectively. This result is even comparable to the full batch SVGD algorithm. Note that the running time for OPVI is the same as the SVGD algorithm, which is less than half of the full batch methods.

5.3 Image classification Task

Finally, we conduct experiments to test the performance of the proposed algorithm on a high-dimensional image classification problem. The dataset we used is the MNIST dataset, which contains 60,000 training cases and 10,000 test cases. We consider a two-layer BNN model with 100 hidden variables, with a sigmoid input layer and a softmax output layer. All experiments are using 20 particles. The comparison result is shown in Fig. 3. As we can see from the figure, except for the full batch LD algorithm, the OPVI algorithm with an increasing batch size achieves the best result. However, the full batch LD method uses much more time (30 times) and data samples (500 times), and the result is similar. We can observe that the noise of the increasing batch size OPVI is decreasing with $t$ increase, which verifies our analysis for the gradient error. An interesting thing is that SVGD shows poor performance in this high-dimensional task, which may lead by an incorrect approximation for the diffusion term with limited particle numbers. Instead, we improve the diffusion term in OPVI, which solves this problem.

6 Conclusion

In this paper, we consider the OPVI algorithm as a possible sampling method for the intractable posterior under the online setting. To reduce the variance, we include an increasing batch size scheme and analyze the influence of the choice of batch size on the performance of the algorithm. Furthermore, we develop a detailed analysis by understanding the algorithm as a Wasserstein gradient flow. Experiments show the proposed algorithm outperforms other naive online particle-based VI and online MCMC methods.

Appendix A Proof of Theorem 4

The main idea of this proof follows [Bedi et al., 2018, Theorem 2]. We start by providing a Lemma that gives the relationship between the distance $d(w_{t+1},w^{*}_{t})$ and the quantity $c_{t}(w_{t})+\eta_{t}c_{0}(w_{t})-c_{t}(w_{t}^{*})-\eta_{t}c_{0}(w_{t}^{*})$ . Then, we use this Lemma to prove the Theorem 4.

Lemma 9.

Under Assumptions 1 - 3, given a sequence of optimal solutions $\{w_{t}^{*}\}$ , gradient error $\mathbb{E}[e_{t}]\leq\epsilon_{t}$ and the updating policy eq. (15), the online MAP algorithm adheres to the following inequality:

[TABLE]

where $\xi:=2\alpha-4L\alpha^{2}$ .

Proof.

We start by proving a fact with Assumption 2. For any $w\in\mathcal{W}$ , by the smoothness and convexity of $c_{t}(w)+\eta_{t}c_{0}(w)$ , in [Zhou, 2018, Lemma 4] we have:

[TABLE]

Here, we set a specific value for $w$ as $w=w^{\prime}_{t}:=w_{t}-\frac{1}{L}(\nabla c_{t}(w_{t})+\eta_{t}\nabla c_{0}(w_{t}))$ in the above inequality and get:

[TABLE]

On the other hand, by the convexity of $c_{t}(w)$ and $c_{0}(w)$ and the vanishing gradient assumption $\nabla c_{t}(w^{*}_{t})+\eta_{t}\nabla c_{0}(w^{*}_{t})=0$ in Assumption 3, we have:

[TABLE]

which leads to the following inequality of interest:

[TABLE]

which is equivalent to:

[TABLE]

Following eq. 3, we use true gradient $\nabla c_{t}(w_{t})$ and a gradient error $e_{t}$ to represent the approximated gradient $\nabla\hat{c}_{t}(w_{t})$ , which simpify the the updating policy eq. (15) as:

[TABLE]

following eq. 3 to highlight the influence of the gradient error and simplify the proof. Then, we bound the left-hand side of Lemma 9 by evolving the updating policy eq. (15) in $\|w_{t+1}-w_{t}^{*}\|^{2}$ , then:

[TABLE]

where (a) can be obtained by expanding the squared term, (b) is following the convexity property that $f(b)-f(a)\geq\nabla c(a)^{\top}(b-a)$ and eq. (14). Next, we bound the last two terms in eq. (21) separately. We take expectation on the sequence of $\{e_{t}\}$ , which is denoted by $\mathbb{E}_{e_{t}}$ and get:

[TABLE]

where (a) is obtained by the definition of the stochastic gradient error in Section 3.1, (b) is given by using the fact $2ab\leq a^{2}+b^{2}$ for the first term and (c) follows eq. (14). Taking expectations for eq. (21) and combining the above inequalities lead to:

[TABLE]

where $\xi:=2\alpha-4L\alpha^{2}$ .

Finally, we take full expectations (for both $\{e_{t}\}$ and $\{w_{t}^{*}\}$ ) as $\mathbb{E}$ and take a root on both sides of the above equation and give:

[TABLE]

where the inequality follows $\|w_{t}-w_{t}^{*}\|\leq R$ and the fact $\sqrt{a^{2}-b+c^{2}}\leq a-\frac{b}{2a}+c$ proved as following:

[TABLE]

where $a,b,c$ are all positive and $a^{2}>b$ . ∎

Note that if we take a summation over $t\in[1,T]$ on the second term on the right side of eq. (9), we can get the regret $\mathcal{R}(T)$ . However, we can divide the left side of eq. (9) into two parts with triangle inequality for a tighter bound, which give the proof for Theorem 4 as follows.

First, we start with using triangle inequality on a quantity $\mathbb{E}[\|w_{t+1}-w_{t+1}^{*}\|]$ , which gives:

[TABLE]

Rearranging the above inequality and take summation for $t\in[1,T]$ , by the definition of the dynamic regret in eq. (2) we have:

[TABLE]

where (a) is obtained by using Lemma 9 and (b) follows Assumption 1 and the definition of variational budget $V_{T}$ . The second term in the last inequality can be bounded by:

[TABLE]

where (a) can be obtained by Hölder’s inequality as $\|\bm{v}\|_{1}\leq\sqrt{T}\|\bm{v}\|_{2}$ for a vector $\bm{v}=\{\sqrt{\epsilon_{1}},\cdots,\sqrt{\epsilon_{T}}\}$ Taking it back, we get the dynamic regret of:

[TABLE]

Appendix B Proof of Lemma 7

Proof.

We start from a fact proved in Lemma 6 of [Zhang and Sra, 2016], which gives an inequality for a geodesic triangle with curvature bounded by $\kappa$ , where the length of sides for the triangle is $a$ , $b$ , $c$ and $A$ is the angle between sides $b$ and $c$ , then:

[TABLE]

In our work, we map our problem on a triangle, where the vertices of this triangle is set to be three status of the decisions in our problem, the current step decision $q_{t}$ , the next step decision $q_{t+1}$ and the optimal solution in current step $q_{t}^{*}$ . Denote $d(a,b)$ to be the Wasserstein distance between two distribution $a$ and $b$ over $\mathcal{P}_{2}(\mathcal{W})$ . As a result, the three sides of the triangle should be $a=d(q_{t+1},q_{t}^{*})$ , $b=d(q_{t},q_{t+1})$ and $c=d(q_{t},q_{t}^{*})$ . Base on the updating rule, we have $d(q_{t},q_{t+1})=\alpha\|\nabla c_{t}(q_{t})+e_{t}+\nabla c^{0}_{t}(q_{t})\|$ when $\alpha$ is small enough. and $d(q_{t},q_{t+1})d(q_{t},q_{t}^{*})\cos(\angle q_{t+1}q_{t}q_{t}^{*})=\langle-\alpha(\nabla c_{t}(q_{t})+e_{t}+\nabla c^{0}_{t}(q_{t})),\operatorname{Exp}^{-1}_{q_{t}}(q_{t}^{*})\rangle$ . Taking all sides into the triangle inequality, we have:

[TABLE]

where (a) follows [Zhang and Sra, 2016, Lemma 6], $\zeta(\kappa,d(q_{t},q_{t}^{*}))=\frac{\sqrt{|\kappa|}d(q_{t},q_{t}^{*})}{\tanh(\sqrt{|\kappa|}d(q_{t},q_{t}^{*})}$ , (b) follows the assumption 5, (c) follows the fact proved in eq. (14) and the convexity of $c_{t}(q_{t})+c_{t}^{0}(q_{t})$ Then, we bound the last two terms in the above inequality with the expectation on the sequence of $\{e_{t}\}$ , which is denoted by $\mathbb{E}_{e_{t}}$ and get::

[TABLE]

where (a) is obtained by the fact $2ab\leq a^{2}+b^{2}$ . Taking the above inequality back gives:

[TABLE]

where $\Phi=2\alpha-3L\alpha^{2}\zeta(\kappa,R)$ .

Finally, using the fact $\sqrt{a^{2}-b+c^{2}}\leq a-\frac{b}{2a}+c$ and full expectation $\mathbb{E}$ , we finish the proof:

[TABLE]

∎

Appendix C Proof of Theorem 8

Proof.

The proof start from Lemma 7 with using the triangle inequality, which gives:

[TABLE]

Rearrange the inequity, taking summation from $t\in[1,T]$ , evolving the definition of the regret, we have:

[TABLE]

where the second term can be simplified as:

[TABLE]

Then, we finally prove the regret bound as:

[TABLE]

∎

Bibliography59

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ambrosio et al. [2005] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures . Springer Science & Business Media, 2005.
2Bedi et al. [2018] Amrit Singh Bedi, Paban Sarma, and Ketan Rajawat. Tracking Moving Agents via Inexact Online Gradient Descent Algorithm. IEEE Journal of Selected Topics in Signal Processing , 12(1):202–217, February 2018.
3Benamou and Brenier [2000] Jean-David Benamou and Yann Brenier. A computational fluid mechanics solution to the monge-kantorovich mass transfer problem. Numerische Mathematik , 84(3):375–393, 2000.
4Besbes et al. [2015] Omar Besbes, Yonatan Gur, and Assaf Zeevi. Non-stationary Stochastic Optimization. page 52, 2015.
5Blundell et al. [2015] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In International conference on machine learning , pages 1613–1622. PMLR, 2015.
6Broderick et al. [2013 a] Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and Michael I Jordan. Streaming variational Bayes. Advances in neural information processing systems , 26, 2013 a.
7Broderick et al. [2013 b] Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and Michael I Jordan. Streaming Variational Bayes. In Advances in Neural Information Processing Systems , volume 26. Curran Associates, Inc., 2013 b.
8Cesa-Bianchi and Lugosi [2006] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games . Cambridge university press, 2006.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Particle-based Online Bayesian Sampling

Abstract

1 Introduction

2 Related Work

3 The online Maximum a Posterior on Euclidean Space W\mathcal{W}W

3.1 Preliminaries

3.2 Dynamic Algorithm for Online Maximum a Posterior

Assumption 1**.**

Assumption 2**.**

Assumption 3**.**

3.3 Theoretical Analysis for Online MAP

Theorem 4**.**

4 Online Particle-based Variational Inference on Wasserstein Space P2(W)\mathcal{P}_{2}(\mathcal{W})P2​(W)

4.1 The Wasserstein Space P2(W)\mathcal{P}_{2}(\mathcal{W})P2​(W), its Riemannian Structure and the Gradient Flow

4.2 Particle-based Variational Inference Methods

4.3 Online Particle-based Variational Inference on P2(W)\mathcal{P}_{2}(\mathcal{W})P2​(W)

4.4 Proof of Dynamic Regret Bound under P2(W)\mathcal{P}_{2}(\mathcal{W})P2​(W)

Assumption 5**.**

Assumption 6**.**

Lemma 7**.**

Theorem 8**.**

5 Experiments

5.1 Synthetic Experiments

5.2 Bayesian Neural Network (BNN) Experiment

5.3 Image classification Task

6 Conclusion

Appendix A Proof of Theorem 4

Lemma 9**.**

Appendix B Proof of Lemma 7

Appendix C Proof of Theorem 8

3 The online Maximum a Posterior on Euclidean Space $\mathcal{W}$

Assumption 1.

Assumption 2.

Assumption 3.

Theorem 4.

4 Online Particle-based Variational Inference on Wasserstein Space $\mathcal{P}_{2}(\mathcal{W})$

4.1 The Wasserstein Space $\mathcal{P}_{2}(\mathcal{W})$ , its Riemannian Structure and the Gradient Flow

4.3 Online Particle-based Variational Inference on $\mathcal{P}_{2}(\mathcal{W})$

4.4 Proof of Dynamic Regret Bound under $\mathcal{P}_{2}(\mathcal{W})$

Assumption 5.

Assumption 6.

Lemma 7.

Theorem 8.

Lemma 9.