An Online Sample Based Method for Mode Estimation using ODE Analysis of   Stochastic Approximation Algorithms

Chandramouli Kamanchi; Raghuram Bharadwaj Diddigi; Prabuchandran K.; J.; Shalabh Bhatnagar

arXiv:1902.03806·cs.LG·June 4, 2019

An Online Sample Based Method for Mode Estimation using ODE Analysis of Stochastic Approximation Algorithms

Chandramouli Kamanchi, Raghuram Bharadwaj Diddigi, Prabuchandran K., J., Shalabh Bhatnagar

PDF

TL;DR

This paper introduces an online, sample-based iterative algorithm for estimating the mode of a unimodal density, using ODE analysis to prove convergence and stability, suitable for real-time applications with unknown density functions.

Contribution

It presents a novel, computationally efficient online mode estimation method that does not require prior knowledge of the density's analytical form or batch processing.

Findings

01

Algorithm converges asymptotically to the true mode.

02

Proven stability of the mode estimates via regularization.

03

Experimental results confirm effectiveness in practical scenarios.

Abstract

One of the popular measures of central tendency that provides better representation and interesting insights of the data compared to the other measures like mean and median is the metric mode. If the analytical form of the density function is known, mode is an argument of the maximum value of the density function and one can apply the optimization techniques to find mode. In many of the practical applications, the analytical form of the density is not known and only the samples from the distribution are available. Most of the techniques proposed in the literature for estimating the mode from the samples assume that all the samples are available beforehand. Moreover, some of the techniques employ computationally expensive operations like sorting. In this work we provide a computationally effective, on-line iterative algorithm that estimates the mode of a unimodal smooth density given…

Tables2

Table 1. Table I: Examples of kernels

Name of the Kernel	Analytical Form
Gaussian	$\frac{1}{\sqrt{2 π}} e^{- x^{2} / 2}$
Cauchy	$\frac{1}{π (1 + x^{2})}$
Fejer	$\frac{\sin^{2} (x)}{π x^{2}}$
Multivariate Gaussian	$\frac{1}{{(2 π)}^{n / 2}} ϵ^{\frac{- x^{T} x}{2}}$

Table 2. Table II: Performance of our proposed algorithm on standard distributions

Distribution	Actual Mode	Estimated Mode $\pm$ Std.Dev
Normal	10	9.971783 $\pm$ 0.333930
Gamma	5	5.182626 $\pm$ 0.306337
Exponential	0	0.697886 $\pm$ 0.007722
Weibull	0	0.697192 $\pm$ 0.007544
Beta	1	0.900645 $\pm$ 0.001356
Bivariate Normal	[20; 15]	[20.030044; 15.015614 ] $\pm$ [0;0]
Dirichlet	[0.5; 0.5]	[0.498404; 0.501579] $\pm$ [0;0]

Equations101

(u * v) (x) := \int_{R^{p}} u (x - t) v (t) d t (x \in R^{p}) .

(u * v) (x) := \int_{R^{p}} u (x - t) v (t) d t (x \in R^{p}) .

K_{\epsilon}(x):=\epsilon^{-p}K\Big{(}\frac{x}{\epsilon}\Big{)}.

K_{\epsilon}(x):=\epsilon^{-p}K\Big{(}\frac{x}{\epsilon}\Big{)}.

\int_{\mathbb{R}^{p}}K_{\epsilon}=\frac{1}{\epsilon^{p}}\int_{\mathbb{R}^{p}}K\Big{(}\frac{x}{\epsilon}\Big{)}dx=\int_{\mathbb{R}^{p}}K(y)dy=\int_{\mathbb{R}^{p}}K.

\int_{\mathbb{R}^{p}}K_{\epsilon}=\frac{1}{\epsilon^{p}}\int_{\mathbb{R}^{p}}K\Big{(}\frac{x}{\epsilon}\Big{)}dx=\int_{\mathbb{R}^{p}}K(y)dy=\int_{\mathbb{R}^{p}}K.

\int_{∣∣ x ∣∣ > δ} ∣ K_{ϵ} (x) ∣ d x

\int_{∣∣ x ∣∣ > δ} ∣ K_{ϵ} (x) ∣ d x

= \int_{∣∣ y ∣∣ > δ / ϵ} ∣ K (y) ∣ d y .

\displaystyle|u_{\epsilon}(x)-u(x)|=\Bigg{|}\int u(x-t)K_{\epsilon}(t)dt-u(x)\int K_{\epsilon}(t)dt\Bigg{|}

\displaystyle|u_{\epsilon}(x)-u(x)|=\Bigg{|}\int u(x-t)K_{\epsilon}(t)dt-u(x)\int K_{\epsilon}(t)dt\Bigg{|}

\displaystyle=\Bigg{|}\int\displaylimits_{||t||<\delta}\big{(}u(x-t)-u(x)\big{)}K_{\epsilon}(t)dt

\displaystyle\hskip 98.16191pt+\int\displaylimits_{||t||\geq\delta}\big{(}u(x-t)-u(x)\big{)}K_{\epsilon}(t)dt\Bigg{|}

\leq η \int \displaylimits_{∣∣ t ∣∣ < δ} ∣ K_{ϵ} (t) ∣ d t + \int \displaylimits_{∣∣ t ∣∣ \geq δ} ∣ u (x - t) - u (x) ∣∣ K_{ϵ} (t) ∣ d t

\leq η ∥ K ∥_{1} + \int \displaylimits_{∣∣ t ∣∣ \geq δ} ∣ u (x - t) ∣∣ K_{ϵ} (t) ∣ d t

+ ∣ u (x) ∣ \int \displaylimits_{∣∣ t ∣∣ \geq δ} ∣ K_{ϵ} (t) ∣ d t .

\int \displaylimits_{∣∣ t ∣∣ \geq δ} ∣ u (x - t) ∣∣ K_{ϵ} (t) d t ∣

\int \displaylimits_{∣∣ t ∣∣ \geq δ} ∣ u (x - t) ∣∣ K_{ϵ} (t) d t ∣

=

\leq

\leq

x_{k + 1} = x_{k} + a_{k} [\nabla h (x_{k}) + N_{k + 1}] .

x_{k + 1} = x_{k} + a_{k} [\nabla h (x_{k}) + N_{k + 1}] .

a_{k} > 0 \leavevmode \forall k, k \sum a_{k} = \infty, k \sum a_{k}^{2} < \infty.

a_{k} > 0 \leavevmode \forall k, k \sum a_{k} = \infty, k \sum a_{k}^{2} < \infty.

F_{k} := σ {x_{0}, N_{1}, \dots, N_{k}}, k \geq 0.

F_{k} := σ {x_{0}, N_{1}, \dots, N_{k}}, k \geq 0.

E [N_{k + 1} ∣ F_{k}] = 0 a.s .

E [N_{k + 1} ∣ F_{k}] = 0 a.s .

E [∥ N_{k + 1} ∥^{2} ∣ F_{k}] \leq C (1 + ∥ x_{k} ∥^{2}) a.s.

E [∥ N_{k + 1} ∥^{2} ∣ F_{k}] \leq C (1 + ∥ x_{k} ∥^{2}) a.s.

\overset{x}{˙} (t) = \nabla h_{\infty} (x (t))

\overset{x}{˙} (t) = \nabla h_{\infty} (x (t))

\overset{x}{˙} (t) = \nabla h (x (t)) .

\overset{x}{˙} (t) = \nabla h (x (t)) .

m_{n + 1} = m_{n} + a_{n} \nabla f (m_{n}),

m_{n + 1} = m_{n} + a_{n} \nabla f (m_{n}),

g (m) = f (m) - \frac{1}{2} λ ∥ m ∥^{2},

g (m) = f (m) - \frac{1}{2} λ ∥ m ∥^{2},

m_{n + 1}

m_{n + 1}

= m_{n} + a_{n} (\nabla f (m_{n}) - λ m_{n}) .

\overset{m}{^} (λ) := ar g m max g (m)

\overset{m}{^} (λ) := ar g m max g (m)

m^{*} := ar g m max f (m) .

m^{*} := ar g m max f (m) .

\overset{m}{^} (0) = m^{*} .

\overset{m}{^} (0) = m^{*} .

\overset{m}{^} (λ) \to \overset{m}{^} (0) = m^{*} .

\overset{m}{^} (λ) \to \overset{m}{^} (0) = m^{*} .

\nabla f (m) \approx \nabla f_{ϵ} (m) = \int_{R^{p}} \nabla f (m - t) K_{ϵ} (t) d t .

\nabla f (m) \approx \nabla f_{ϵ} (m) = \int_{R^{p}} \nabla f (m - t) K_{ϵ} (t) d t .

\int_{R^{p}} \nabla f (m - t) K_{ϵ} (t) d t

\int_{R^{p}} \nabla f (m - t) K_{ϵ} (t) d t

= E_{X \sim f} [\nabla K_{ϵ} (m - X)] .

m_{n + 1} = m_{n} + a_{n} (\nabla K_{ϵ} (m_{n} - X_{n + 1}) - λ m_{n}),

m_{n + 1} = m_{n} + a_{n} (\nabla K_{ϵ} (m_{n} - X_{n + 1}) - λ m_{n}),

m_{k + 1} = m_{k} + a_{k} d_{k + 1} = m_{k} + a_{k} (E [d_{k + 1} ∣ F_{k}] + N_{k + 1}),

m_{k + 1} = m_{k} + a_{k} d_{k + 1} = m_{k} + a_{k} (E [d_{k + 1} ∣ F_{k}] + N_{k + 1}),

E [∥ N_{k + 1} ∥^{2} ∣ F_{k}] \leq C (1 + ∥ m_{k} ∥^{2}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

An Online Sample Based Method for Mode Estimation using ODE Analysis of Stochastic Approximation Algorithms

Chandramouli Kamanchi1 Raghuram Bharadwaj Diddigi2 Prabuchandran K.J.3 Shalabh Bhatnagar4,5 1 Department of Computer Science and Automation, Indian Institute of Science, Bangalore, India. [email protected]2 Department of Computer Science and Automation, Indian Institute of Science, Bangalore, India. [email protected]3 Amazon-IISc Postdoctoral fellow, Indian Institute of Science, Bangalore, India. [email protected]4 Department of Computer Science and Automation and Department of Robert Bosch Centre for Cyber-Physical Systems, Indian Institute of Science, Bangalore, India. [email protected]5 Supported by RBCCPS, IISc and a grant from the Department of Science and Technology, India.

Abstract

One of the popular measures of central tendency that provides better representation and interesting insights of the data compared to the other measures like mean and median is the metric mode. If the analytical form of the density function is known, mode is an argument of the maximum value of the density function and one can apply optimization techniques to find the mode. In many of the practical applications, the analytical form of the density is not known and only the samples from the distribution are available. Most of the techniques proposed in the literature for estimating the mode from the samples assume that all the samples are available beforehand. Moreover, some of the techniques employ computationally expensive operations like sorting. In this work we provide a computationally effective, on-line iterative algorithm that estimates the mode of a unimodal smooth density given only the samples generated from the density. Asymptotic convergence of the proposed algorithm using an ordinary differential equation (ODE) based analysis is provided. We also prove the stability of estimates by utilizing the concept of regularization. Experimental results further demonstrate the effectiveness of the proposed algorithm.

Index Terms:

Statistical learning, Optimization algorithms, Machine learning.

I Introduction

There are many metrics that are used to represent the central tendency of the data. Among them, the popular ones are mean, median and mode. Mean is extensively studied due to its simplicity, linearity and ease of estimation via sample averages. However mean is susceptible to outliers. For example, when we are estimating the mean from a finite number of samples, one bad outlier can shift the estimate far away from the original mean. Also, in some of the applications, mean may not be the desired quantity to analyze. For example, it is often interesting to know the income that majority of the population in a country earns rather than the average income of the country which is a skewed quantity.

In this work, we focus on finding the mode of a density when the analytical form of it is not known. That is, we are given only the samples of the distribution and we need to estimate the mode from these samples. We utilize stochastic approximation techniques to solve this problem. Stochastic approximation is a popular paradigm that is applied sucessfully to analyze random iterative models [1, 2, 3].

We first discuss some of the works that have been reported in the literature for estimating the mode. This problem of estimation of the mode of a unimodal density has been first considered in [4] where a sequence of density functions is iteratively constructed from the samples and the respective modes are calculated as maximum likelihood estimates. It is shown that these estimates of mode converge in probability to the actual mode.

We can broadly classify the solution techniques for the mode estimation problem into two groups. The first group comprises a non-parametric way of estimating the mode where the mode is estimated directly from the sample data without constructing the density function. The second group of methods comprises the parametric way of estimation in which the density function is approximately constructed and the mode is computed using optimization techniques.

In [5], a non-parametric estimator (popularly known as Grenander’s estimate) for estimating the mode directly from the samples is proposed. The later developments of mode estimation methods are based on the idea that the mode is situated at the center of an interval of selected length that contains majority of observed points. A sequence of nested intervals and intermediate mode estimates is constructed based on the above described idea and mode is taken to be the point that these intermediate estimates of mode converge to. Different ways of selecting the interval lengths are studied in [6, 7] along with their convergence properties. A variant of this idea involves selecting the interval of shortest length that contains some desired number of points instead of deciding the lengths of interval. The estimation methods in [8, 9, 10] are based on this idea. Detailed survey of the above discussed techniques along with their robustness is extensively studied in [11, 12].

In [13], a parametric method of estimating the mode is proposed. The idea here is to fit the samples to a normal distribution. Then the mode is estimated by calculating the mean of this fitted normal distribution. This idea is recently extended to find the multivariate mode in [14]. In [15], multivariate mean shift algorithm is proposed for estimating the multivariate mode from the samples. The idea here is to iteratively shift the estimated mode towards the actual mode using Gaussian kernels. In [16], a minimum volume peeling method is proposed to estimate the multivariate mode from the sample data. The idea here is to iteratively construct subsets of the set of samples with minimum volume and discard the remaining points. The mode is then calculated by averaging the points in the constructed subset. This is based on the observation that mode is generally situated in the minimum volume set of a given fixed number of samples. An effective way of selecting the subset of points is discussed in [16].

Most of the algorithms considered in the literature so far make the assumption that all the samples are available upfront. These techniques cannot be extended to the case of streaming data where the samples arrive online one at a time. Also, the non-parametric techniques (refer [12]) require the samples to be in a sorted order.

Our proposed algorithm is fundamentally different from the above algorithms in the sense that ours is an online algorithm that works with the data as it becomes available. This enables us to work with online samples without storing them in the memory. Also, we do not resort to any computationally expensive operations like sorting. In addition, our algorithm works for both univariate and multivariate distributions. We provide a convergence analysis of our proposed technique and show the robustness of our technique using simulation results.

Our work is closest to [17]. In [17] a gradient-like recursive procedure for estimating the mode is proposed and convergence is provided utilizing the theory of martingales. In our work to mitigate the lack of analytical form of density, we construct a kernel based representation (refer section II) of the density function and use stochastic gradient ascent to calculate mode. Our work is different from [17] in the following ways.

•

Our proposed algorithm is based on assumptions different from those of [17]. Moreover, we prove the stability of the mode estimates by introducing the concept of regularization [18].

•

We demonstrate the effectiveness of our algorithm by providing empirical evaluation on well-known distributions.

•

Our convergence proof utilizes the well-known ODE based analysis of stochastic approximation algorithms. To the best of our knowledge, ours is the first work that makes use of ODE based analysis in the context of mode estimation.

II Background and Preliminaries

To begin, suppose we have a probability space $(\Omega,\mathcal{F},\mathbb{P})$ and a random vector $X:\Omega\to\mathbb{R}^{p}$ with a smooth density function $f(x).$ A mode of the random vector $X$ is defined as an argument of the maximum of $f.$ Suppose we have a unique mode i.e. $f$ has a unique maximizer then the mode is a measure of central tendency of the distribution of $X.$ A natural problem that arises is the estimation of the mode given a sequence of independent and identically distributed samples of $X$ denoted by $X_{1},X_{2},\cdots,X_{n}.$ In what follows we provide necessary formal definitions and prove some properties that are utilized in the motivation and the convergence of our proposed algorithm to solve this problem.

II-A Approximation of Identity

Definition 1

The convolution of two functions $u$ and $v$ on $\mathbb{R}^{p}$ is defined as

[TABLE]

Definition 2

Given a function $K:\mathbb{R}^{p}\to\mathbb{R}$ and $\epsilon>0$ we define

[TABLE]

For a given function $K$ as above, the family of functions $\{K_{\epsilon}|\epsilon>0\}$ is called approximation of the identity.

Lemma 1

Suppose $\int_{\mathbb{R}^{p}}|K(t)|dt<\infty$ . Then, given any $\epsilon>0$

$\int_{\mathbb{R}^{p}}K_{\epsilon}=\int_{\mathbb{R}^{p}}K.$ * * 2. 2.

$\int_{||x||>\delta}|K_{\epsilon}|\to 0$ * as $\epsilon\to 0$ , for any fixed $\delta>0.$ *

Proof:

For statement 1 choose $y=\frac{x}{\epsilon}.$ So $dx=\epsilon^{p}\ dy.$ Then we have

[TABLE]

Again for statement 2, let $y=x/\epsilon$ and choose $\delta>0.$ Now

[TABLE]

Since $\int_{\mathbb{R}^{p}}|K(t)|dt<\infty$ and $\delta/\epsilon\to\infty$ as $\epsilon\to 0$ , the proof is complete. ∎

Theorem 1

Let $u_{\epsilon}:=u*K_{\epsilon}$ , where $\int_{\mathbb{R}^{p}}K=1,\leavevmode\nobreak\ \|u\|_{1}<\infty$ and $K(x)=o(\|x\|^{-p})$ as $\|x\|\to\infty$ . Then $u_{\epsilon}\to u$ as $\epsilon\to 0$ at each point of continuity of $u.$

Proof:

Let $u$ be continuous at $x.$ By the definition of continuity, given $\eta>0,$ there exists $\delta>0$ such that $|u(x-t)-u(x)|<\eta$ if $||t||<\delta.$ Since $\int_{\mathbb{R}^{p}}K_{\epsilon}=\int_{\mathbb{R}^{p}}K=1$ from Lemma 1 and hypothesis we have,

[TABLE]

The third term approaches zero with $\epsilon$ by Lemma 1. It is enough to show that the second term approaches zero. From the hypothesis $|K(x)|=\mu(x)||x||^{-p}$ for some non-negative $\mu(x)$ where $\mu(x)\to 0$ as $||x||\to\infty.$ We have

[TABLE]

Again from the hypothesis $\mu\big{(}\frac{t}{\epsilon}\big{)}\to 0$ as $|\frac{t}{\epsilon}|\to\infty.$ So $\sup\displaylimits_{||t||\geq\delta}\mu\big{(}\frac{t}{\epsilon}\big{)}\to 0$ as $\epsilon\to 0$ . This concludes the proof. This theorem is utilized to obtain an approximate analytical form for the gradient of density (refer Section IV). ∎

Corollary 1

Let $\nabla u_{\epsilon}=\nabla u*K_{\epsilon}$ , where $\int_{\mathbb{R}^{p}}K=1,\leavevmode\nobreak\ \|\nabla u\|_{1}<\infty$ and $K(x)=o(\|x\|^{-p})$ as $\|x\|\to\infty$ . Then $\nabla u_{\epsilon}\to\nabla u$ as $\epsilon\to 0$ at each point of continuity of $\nabla u.$ Here the convolution, $\nabla u*K_{\epsilon}$ , is performed component wise.

Proof:

The result is obtained by applying Theorem 1 to each component of $\nabla u$ . ∎

II-B Stochastic Gradient Ascent

Stochastic gradient ascent [19] deals with the study of iterative algorithms of the type

[TABLE]

Here $x_{k}\in\mathbb{R}^{p},\leavevmode\nobreak\ k\geq 0$ are the parameters that are updated according to (1). The function $h:\mathbb{R}^{p}\rightarrow\mathbb{R}$ is an underlying cost function whose maximum we are interested in finding. Also, $a_{k},k\geq 0$ is a prescribed step-size sequence. Further, $N_{k+1},k\geq 0$ constitute the noise terms. We state here a theorem that is utilized in the convergence analysis of our algorithm. Consider the following assumptions [19, 20].

A1.

The step-sizes $a_{k}$ , $k\geq 0$ satisfy the requirements:

[TABLE] 2. A2.

The sequence $N_{k},k\geq 0$ is a martingale difference sequence with respect to the following increasing sequence of sigma fields:

[TABLE]

Thus, in particular, $\forall k\geq 0$ ,

[TABLE]

Further $N_{k},k\geq 0$ are square integrable and

[TABLE]

for a given constant $C>0.$ 3. A3.

The function $\nabla h:\mathbb{R}^{p}\to\mathbb{R}^{p}$ is Lipschitz continuous. 4. A4.

The functions $\nabla h_{c}(x):=\frac{\nabla h(cx)}{c},\leavevmode\nobreak\ c\geq 1,x\in\mathbb{R}^{p}$ , satisfy $\nabla h_{c}(x)\rightarrow\nabla h_{\infty}(x)$ as $c\rightarrow\infty$ , uniformly on compacts. Furthermore, the o.d.e

[TABLE]

has the origin as the unique globally asymptotically stable equilibrium.

Consider the ordinary differential equation

[TABLE]

Let $H$ denote the compact set of asymptotically stable equilibrium points of the ODE (3).

Theorem 2

Under (A1)-(A4), $\sup_{n}\|x_{n}\|<\infty$ (stability) a.s. Further $x_{k}\rightarrow H$ almost surely as $k\rightarrow\infty$ .

Proof:

Follows as a consequence of Theorem 2 in chapter 2 and Theorem 7 in chapter 3 of [19]. ∎

III Motivation and Algorithm

In this section we motivate and present our iterative algorithm for estimating the mode of a unimodal density. The idea of computing the mode is described below. Let $f$ denote the unimodal density function. As mode is the maximizer of the density function, we can estimate the mode by gradient ascent as follows:

[TABLE]

where $a_{n}$ and $m_{n}$ are the step-size and current mode estimate, respectively, at time $n$ .

We introduce a function $g$ defined as follows:

[TABLE]

where $\lambda>0$ is the regularization coefficient [18]. The idea is to find an $m$ that maximizes the function $g(m)$ . This is done to maintain the stability of the estimates in our algorithm (refer proposition 3 in Section IV). Therefore the gradient ascent update is performed as follows:

[TABLE]

It remains to be shown that solution obtained using this update equation (7) converges to the mode obtained using the update equation (4) as $\lambda\xrightarrow{}0$ . Let

[TABLE]

and

[TABLE]

It is easy to see that

[TABLE]

From the continuity of $\arg\max(.)$ function given by the Maximum Theorem [21] we have as $\lambda\rightarrow 0$ ,

[TABLE]

The update equation (7), however, needs the information of $\nabla f(m_{n})$ , which is not known. We therefore make use of the ideas in section II to estimate $\nabla f(m)$ as follows. To make the notations easy, we replace $m_{n}$ with $m$ and derive $\nabla f(m).$ Applying Corollary 1 to $\nabla f$ with the kernel $K$ we get for small $\epsilon>0$ ,

[TABLE]

By the properties of convolution

[TABLE]

Note that there are several valid choices for function $K$ (also called kernel) to obtain approximation of identity.

Now, (7) can be re-written using stochastic gradient ascent as follows:

[TABLE]

where $X_{n+1}$ is the sample obtained at time $n+1$ .

In the following table, we indicate some of the popular kernels [22].

It is easily verified that the kernels defined in Table I satisfy the hypotheses of Corollary 1. The full algorithm for estimating the mode from online streaming data is described in Algorithm 1.

Let $m_{0}$ denote an initial mode estimate and $\epsilon$ , a small constant. The algorithm works as follows. At time $n+1$ , the algorithm takes as input the current mode estimate $m_{n}$ and the sample $X_{n+1}$ . It then computes the direction $d_{n+1}$ and updates current approximation of the mode $m_{n}$ along $d_{n+1}$ as shown in step 2. The output of the algorithm is the updated mode estimate computed from samples obtained so far, i.e., $X_{1},\ldots,X_{n+1}$ . We prove the convergence of the algorithm in the next section.

IV Convergence Analysis

Let $\mathcal{F}_{k}=\sigma(m_{j},0\leq j\leq k;X_{j},0<j\leq k),k\geq 0$ be a sequence of sigma fields. Observe that $\{\mathcal{F}_{k}\}$ forms a filtration. Let $d_{k+1}=\nabla K_{\epsilon}(m_{k}-X_{k+1})-\lambda m_{k}.$ Note that $d_{k}$ is $\mathcal{F}_{k}$ -measurable. Moreover $d_{k}$ is integrable i.e. $\mathbb{E}[\|d_{k}\|]<\infty$ under the assumption that $\nabla K_{\epsilon}$ is integrable. Now the basic algorithm can be written as

[TABLE]

where $N_{k+1}=d_{k+1}-\mathbb{E}[d_{k+1}|\mathcal{F}_{k}]$ is a mean zero term. Also, $\{N_{k+1}\}$ constitutes a martingale difference sequence (see proof of Proposition 1). Here $\mathbb{E}[.]$ is the expectation with respect to the density $f.$

Our convergence analysis rests on Theorem 2. Our algorithm is in the form of the general iterative scheme (1) with $\nabla h(m_{k})=\mathbb{E}[d_{k+1}|\mathcal{F}_{k}]=\nabla f_{\epsilon}(m_{k})-\lambda m_{k}$ and $N_{k+1}=d_{k+1}-\mathbb{E}[d_{k+1}|\mathcal{F}_{k}]$ . We choose Gaussian kernel for our analysis of the algorithm. Similar analysis can be carried out for other choice of kernels. We first validate the assumptions of Theorem 2 below.

The choice $a_{k}=1/k,\leavevmode\nobreak\ k\geq 1$ assures assumption A1. The following proposition validates assumption A2.

Proposition 1

$(N_{k},\mathcal{F}_{k}),k\geq 0$ * is a martingale difference sequence with*

[TABLE]

for all $k\geq 0$ and for some $C>0$ .

Proof:

It is easy to see that $\mathbb{E}[N_{k+1}|\mathcal{F}_{k}]=\mathbb{E}[d_{k+1}-\mathbb{E}[d_{k+1}|\mathcal{F}_{k}]|\mathcal{F}_{k}]=0.$ From the foregoing, $N_{k}$ is $\mathcal{F}_{k}-$ measurable and integrable $\forall k\geq 0$ . So clearly $(N_{k},\mathcal{F}_{k})$ is a martingale difference sequence. Now

[TABLE]

The first inequality follows from the simple identity $(a-b)^{2}\leq 2(a^{2}+b^{2})$ , while the second inequality follows from a simple application of Jensen’s inequality. Since the higher derivatives and in particular Hessian of $K_{\epsilon}$ is bounded, it follows that $\nabla K_{\epsilon}$ is Lipschitz continuous. We have

[TABLE]

where $L>0$ is the Lipschitz constant. Hence for all $m$ ,

[TABLE]

where $C_{0}=\max\{\|\nabla K_{\epsilon}(0)\|,L\}.$ Therefore,

[TABLE]

where $C_{1}=\mathbb{E}[\|X_{k+1}\|^{2}]$ and $C=8\max\{2C_{0}^{2}+4C_{0}^{2}C_{1},4C_{0}^{2}+\lambda^{2}\}.$ This completes the proof. ∎

The following lemma is useful in proving assumption A3 (see Proposition 2).

Lemma 2

$\mathbb{E}[d_{k+1}|\mathcal{F}_{k}]=\nabla f_{\epsilon}(m_{k})-\lambda m_{k}.$ **

Proof:

Now $\mathbb{E}[d_{k+1}|\mathcal{F}_{k}]=\mathbb{E}[\nabla K_{\epsilon}(m_{k}-X_{k+1})|\mathcal{F}_{k}]-\lambda m_{k}.$ Also $\nabla K_{\epsilon}$ is analytic and has a power series expansion around $m_{k}.$ Using power series of $\nabla K_{\epsilon}$ , linearity of the expectation and independence of $X_{k+1}$ from $\mathcal{F}_{k}$ we obtain $\mathbb{E}[\nabla K_{\epsilon}(m_{k}-X_{k+1})|\mathcal{F}_{k}]=\mathbb{E}[\nabla K_{\epsilon}(m_{k}-X)]=\nabla f_{\epsilon}(m_{k}).$ ∎

Owing to Lemma 2 our iterative update (11) transforms into $m_{k+1}=m_{k}+a_{k}(\nabla f_{\epsilon}(m_{k})-\lambda m_{k}+N_{k+1})$ and we validate assumption A3 below.

Proposition 2

$\nabla f_{\epsilon}(m)-\lambda m$ * is Lipschitz continuous.*

Proof:

Now for any $x$ and $y$

[TABLE]

where $L$ is the Lipschitz constant of $\nabla K_{\epsilon}.$ ∎

The following proposition proves assumption A4.

Proposition 3

The ODE $\dot{m}=h_{\infty}(m)$ has the origin as its unique globally asymptotically stable equilibrium point.

Proof:

From the definition of $h_{\infty}(m)$ , see assumption A4, we have

[TABLE]

Here

[TABLE]

where the facts that $e^{\frac{-\|cm-x\|^{2}}{\epsilon^{2}}}\leq 1$ and $\int_{\mathbb{R}^{p}}\|xf(x)dx\|<\infty$ are utilized. By the application of Dominated Convergence Theorem [22] we have

[TABLE]

So we have that $h_{\infty}(m)=-\lambda m.$

Now for the system $\dot{m}=h_{\infty}(m)=-\lambda m$ , clearly the origin is an equilibrium point. Also for any initial point $m_{0}$ , $m(t)=m_{0}\exp(-\lambda t)$ is the solution of the system and $m(t)\rightarrow 0$ as $t\rightarrow\infty$ . Therefore the origin is the unique globally asymptotically stable equilibrium point of the system. This concludes the proof. ∎

Remark 1

Note that the regularization coefficient $\lambda$ plays a key role in establishing the stability of the mode estimates. To see the effect of the regularization term consider the iterates

[TABLE]

The $h_{\infty}(.)$ corresponding to this update equation is identically 0 thereby violating assumption A4.

Consider now the following ODE:

[TABLE]

Let $\bar{H}$ be the set of asymptotically stable equilibrium points of (13).

Remark 2

From our assumptions $\nabla f_{\epsilon}(x)\xrightarrow{}\nabla f(x)$ as $\epsilon\xrightarrow{}0$ for every point of continuity $x$ and $\bar{H}\rightarrow\{m^{*}\}$ as $\lambda\rightarrow 0$ .

We have the following as our main result.

Theorem 3

${m_{n}},\leavevmode\nobreak\ n\geq 0$ * obtained from Algorithm 1 satisfies $m_{n}\xrightarrow[]{}\bar{H}$ a.s.*

Proof:

The result follows from the foregoing and Theorem 2. ∎

V Experiments

In this section, we discuss the numerical performance of our algorithm. We implement our algorithm on known popular distributions. We collect $10^{6}$ samples from a known distribution and apply our algorithm for estimating the mode. The initial mode estimate is selected as the average of initial 1000 points. We consider Gaussian kernel for univariate distributions and multivariate Gaussian kernel for bivariate normal and Dirchlet distributions (see Table I) for our experiments. The regularization coefficient $\lambda$ is chosen to be $10^{-5}$ and $\epsilon$ is set to 1. We perform 100 runs of the experiment and estimated mode is calculated as the mean of modes obtained over the 100 runs. The following Table II illustrates the performance of our algorithm (estimated mode) on standard distributions. We have also indicated the actual mode of the distribution in the Table II. The code for our experiments can be found at https://github.com/raghudiddigi/Mode-Estimation.

It is interesting to note that, though Exponential and Weibull densities are not smooth and do not satisfy our assumptions, the estimated mode obtained by our algorithm is closer to the actual mode.

In Figure 1, we show the performance of our algorithm with different initial points. For this purpose, we select Normal distribution with mean 10. We implement our algorithm with initial points 5,10 and 15 and plot the estimated mode over initial 50,000 iterations. We observe that the estimates of the mode in all the three cases converge towards the actual mode having value 10 as the number of iterations increase. This shows that the proposed algorithm is not very sensitive with respect to the initial mode estimate. These results thus confirm the practical utility of our algorithm.

VI Conclusions and Future Work

In this paper, we proposed an online computationally efficient algorithm for computing the mode from the samples of an unknown density. We have provided the proofs for the stability of the iterates and convergence of our algorithm. Next, we showed results of experiments on standard distributions that demonstrate the effectiveness of our algorithm in practice.

In future, we wish to propose second order algorithms based on the Newton’s method in the place of gradient ascent. Newton’s method is known to converge faster than the gradient ascent method. Another interesting future direction would be to obtain finite sample error bounds and rate of convergence for our algorithm by utilizing central limit theorem for stochastic approximation.

Bibliography22

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] T. Jaakkola, M. I. Jordan, and S. P. Singh, “Convergence of stochastic iterative dynamic programming algorithms,” in Advances in neural information processing systems , 1994, pp. 703–710.
2[2] Z. Zhou, P. Mertikopoulos, N. Bambos, S. Boyd, and P. W. Glynn, “Stochastic mirror descent in variationally coherent optimization problems,” in Advances in Neural Information Processing Systems , 2017, pp. 7040–7049.
3[3] Z. Zhou, P. Mertikopoulos, N. Bambos, S. Boyd, and P. Glynn, “Mirror descent in non-convex stochastic programming,” ar Xiv preprint ar Xiv:1706.05681 , 2017.
4[4] E. Parzen, “On estimation of a probability density function and mode,” The annals of mathematical statistics , vol. 33, no. 3, pp. 1065–1076, 1962.
5[5] U. Grenander, “Some direct estimates of the mode,” The Annals of Mathematical Statistics , pp. 131–138, 1965.
6[6] H. Chernoff, “Estimation of the mode,” Annals of the Institute of Statistical Mathematics , vol. 16, no. 1, pp. 31–41, 1964.
7[7] E. J. Wegman, “A note on the estimation of the mode,” The Annals of Mathematical Statistics , pp. 1909–1915, 1971.
8[8] T. Dalenius, “The mode–a neglected statistical parameter,” Journal of the Royal Statistical Society. Series A (General) , pp. 110–117, 1965.