A Bayesian Stochastic Approximation Method

Jin Xu; Cui Xiong; Rongji Mu

arXiv:1705.02069·stat.ME·May 8, 2017

A Bayesian Stochastic Approximation Method

Jin Xu, Cui Xiong, Rongji Mu

PDF

Open Access

TL;DR

This paper introduces a Bayesian stochastic approximation method that enhances small sample estimation of regression roots through adaptive modeling, demonstrating superior finite-sample performance over traditional procedures.

Contribution

It presents a novel Bayesian approach with adaptive local modeling and nonrecursive iteration, improving efficiency and consistency in root estimation tasks.

Findings

01

Superior finite-sample performance compared to Robbins--Monro procedures

02

Strong consistency of the Bayesian estimator established

03

Extensions to extremum searching and multivariate quantiles included

Abstract

Motivated by the goal of improving the efficiency of small sample design, we propose a novel Bayesian stochastic approximation method to estimate the root of a regression function. The method features adaptive local modelling and nonrecursive iteration. Strong consistency of the Bayes estimator is obtained. Simulation studies show that our method is superior in finite-sample performance to Robbins--Monro type procedures. Extensions to searching for extrema and a version of generalized multivariate quantile are presented.

Equations134

y_{n} = M (x_{n}) + ε_{n}, n = 1, 2, \dots

y_{n} = M (x_{n}) + ε_{n}, n = 1, 2, \dots

x_{n + 1} = x_{n} - a_{n} y_{n},

x_{n + 1} = x_{n} - a_{n} y_{n},

x_{n + 1} = x_{n} - (n b_{n})^{- 1} y_{n},

x_{n + 1} = x_{n} - (n b_{n})^{- 1} y_{n},

x_{n + 1} = x_{n} - a_{n} (y_{n} - α) .

x_{n + 1} = x_{n} - a_{n} (y_{n} - α) .

x_{n + 1} = x_{n} - a_{n} (y_{n} - α_{n})

x_{n + 1} = x_{n} - a_{n} (y_{n} - α_{n})

a_{n} = \frac{β τ _{n}^{2}}{α _{n} ( 1 - α _{n} ) ( 1 + β ^{2} τ _{n}^{2} ) ^{1/2}} ϕ {\frac{Φ ^{- 1} ( α )}{( 1 + β ^{2} τ _{n}^{2} ) ^{1/2}}}, α_{n} = Φ {\frac{Φ ^{- 1} ( α )}{( 1 + β ^{2} τ _{n}^{2} ) ^{1/2}}},

a_{n} = \frac{β τ _{n}^{2}}{α _{n} ( 1 - α _{n} ) ( 1 + β ^{2} τ _{n}^{2} ) ^{1/2}} ϕ {\frac{Φ ^{- 1} ( α )}{( 1 + β ^{2} τ _{n}^{2} ) ^{1/2}}}, α_{n} = Φ {\frac{Φ ^{- 1} ( α )}{( 1 + β ^{2} τ _{n}^{2} ) ^{1/2}}},

F (x) = α + β (x - θ), x \in (v_{0}, v_{1}) .

F (x) = α + β (x - θ), x \in (v_{0}, v_{1}) .

ρ_{0} = F (v_{0}) = α + s β (v_{0} - θ) and ρ_{1} = F (v_{1}) = α + s β (v_{1} - θ) .

ρ_{0} = F (v_{0}) = α + s β (v_{0} - θ) and ρ_{1} = F (v_{1}) = α + s β (v_{1} - θ) .

β = ρ_{1} - ρ_{0} and θ = \frac{ρ _{1} - α}{ρ _{1} - ρ _{0}} v_{0} + \frac{α - ρ _{0}}{ρ _{1} - ρ _{0}} v_{1} .

β = ρ_{1} - ρ_{0} and θ = \frac{ρ _{1} - α}{ρ _{1} - ρ _{0}} v_{0} + \frac{α - ρ _{0}}{ρ _{1} - ρ _{0}} v_{1} .

h (ρ_{0}, ρ_{1}) = \frac{2 I ( ρ _{L} < ρ _{0} < ρ _{1} < ρ _{U} )}{( ρ _{U} - ρ _{L} ) ^{2}},

h (ρ_{0}, ρ_{1}) = \frac{2 I ( ρ _{L} < ρ _{0} < ρ _{1} < ρ _{U} )}{( ρ _{U} - ρ _{L} ) ^{2}},

h (θ, β) = \frac{2 s β I ( ρ _{L} < α + s β ( v _{0} - θ ) < α + s β ( v _{1} - θ ) < ρ _{U} )}{( ρ _{U} - ρ _{L} ) ^{2}},

h (θ, β) = \frac{2 s β I ( ρ _{L} < α + s β ( v _{0} - θ ) < α + s β ( v _{1} - θ ) < ρ _{U} )}{( ρ _{U} - ρ _{L} ) ^{2}},

h (θ, β) = \frac{2 s β I ( 0 < β < η ( θ ) )}{( ρ _{U} - ρ _{L} ) ^{2}}

h (θ, β) = \frac{2 s β I ( 0 < β < η ( θ ) )}{( ρ _{U} - ρ _{L} ) ^{2}}

η (θ) = \frac{( ρ _{U} - α ) I ( θ \leq θ _{0} )}{s ( v _{1} - θ )} + \frac{( α - ρ _{L} ) I ( θ > θ _{0} )}{s ( θ - v _{0} )}, θ_{0} = \frac{( ρ _{U} - α ) v _{0} + ( α - ρ _{L} ) v _{1}}{ρ _{U} - ρ _{L}} .

η (θ) = \frac{( ρ _{U} - α ) I ( θ \leq θ _{0} )}{s ( v _{1} - θ )} + \frac{( α - ρ _{L} ) I ( θ > θ _{0} )}{s ( θ - v _{0} )}, θ_{0} = \frac{( ρ _{U} - α ) v _{0} + ( α - ρ _{L} ) v _{1}}{ρ _{U} - ρ _{L}} .

h_{0} (θ) = \frac{s η ^{2} ( θ )}{c _{0} ( ρ _{U} - ρ _{L} ) ^{2}},

h_{0} (θ) = \frac{s η ^{2} ( θ )}{c _{0} ( ρ _{U} - ρ _{L} ) ^{2}},

L_{i} (θ, β) = F (x_{i})^{y_{i}} {1 - F (x_{i})}^{1 - y_{i}} = a_{i} + b_{i} (θ) β,

L_{i} (θ, β) = F (x_{i})^{y_{i}} {1 - F (x_{i})}^{1 - y_{i}} = a_{i} + b_{i} (θ) β,

a_{i} = α^{y_{i}} (1 - α)^{1 - y_{i}} = 1 - y_{i} + (2 y_{i} - 1) α, b_{i} (θ) = s (2 y_{i} - 1) (x_{i} - θ) .

a_{i} = α^{y_{i}} (1 - α)^{1 - y_{i}} = 1 - y_{i} + (2 y_{i} - 1) α, b_{i} (θ) = s (2 y_{i} - 1) (x_{i} - θ) .

h (θ, β) j = 1 \prod m L_{i_{j}} (θ, β) = \frac{2 s β I { 0 < β < η ( θ )}}{( ρ _{U} - ρ _{L} ) ^{2}} j = 1 \prod m {a_{i_{j}} + b_{i_{j}} (θ) β} .

h (θ, β) j = 1 \prod m L_{i_{j}} (θ, β) = \frac{2 s β I { 0 < β < η ( θ )}}{( ρ _{U} - ρ _{L} ) ^{2}} j = 1 \prod m {a_{i_{j}} + b_{i_{j}} (θ) β} .

d_{m, r} (θ) = B \in Ω_{m, r} \sum t \in B^{c} \prod a_{i_{t}} k \in B \prod b_{i_{k}} (θ),

d_{m, r} (θ) = B \in Ω_{m, r} \sum t \in B^{c} \prod a_{i_{t}} k \in B \prod b_{i_{k}} (θ),

h_{m} (θ) = \frac{2 s}{c _{m} ( ρ _{U} - ρ _{L} ) ^{2}} r = 0 \sum m \frac{d _{m, r} ( θ ) η ^{r + 2} ( θ )}{r + 2} = \frac{2 c _{0} h _{0} ( θ )}{c _{m}} r = 0 \sum m \frac{d _{m, r} ( θ ) η ^{r} ( θ )}{r + 2},

h_{m} (θ) = \frac{2 s}{c _{m} ( ρ _{U} - ρ _{L} ) ^{2}} r = 0 \sum m \frac{d _{m, r} ( θ ) η ^{r + 2} ( θ )}{r + 2} = \frac{2 c _{0} h _{0} ( θ )}{c _{m}} r = 0 \sum m \frac{d _{m, r} ( θ ) η ^{r} ( θ )}{r + 2},

d_{m, r} (θ) = d_{m - 1, r} (θ) a_{i_{m}} + d_{m - 1, r - 1} (θ) b_{i_{m}} (θ), r = 0, \dots, m,

d_{m, r} (θ) = d_{m - 1, r} (θ) a_{i_{m}} + d_{m - 1, r - 1} (θ) b_{i_{m}} (θ), r = 0, \dots, m,

c_{m} h_{m} (θ) = c_{m - 1} h_{m - 1} (θ) {a_{i_{m}} + b_{i_{m}} (θ) η (θ) R_{m - 1} (θ)},

c_{m} h_{m} (θ) = c_{m - 1} h_{m - 1} (θ) {a_{i_{m}} + b_{i_{m}} (θ) η (θ) R_{m - 1} (θ)},

x_{n + 1} = \mbox E_{h_{m}} (θ) .

x_{n + 1} = \mbox E_{h_{m}} (θ) .

L_{i} (ρ_{0}, ρ_{1}) = 1 - y_{i} + (2 y_{i} - 1) q_{i} ρ_{0} + (2 y_{i} - 1) (1 - q_{i}) ρ_{1} .

L_{i} (ρ_{0}, ρ_{1}) = 1 - y_{i} + (2 y_{i} - 1) q_{i} ρ_{0} + (2 y_{i} - 1) (1 - q_{i}) ρ_{1} .

h_{m} (ρ_{0}) = \frac{2}{c _{m}^{*} ( ρ _{U} - ρ _{L} ) ^{2}} r = 0 \sum m \frac{d _{m, r} ( ρ _{0} ) ( ρ _{U}^{r + 1} - ρ _{0}^{r + 1} )}{r + 1},

h_{m} (ρ_{0}) = \frac{2}{c _{m}^{*} ( ρ _{U} - ρ _{L} ) ^{2}} r = 0 \sum m \frac{d _{m, r} ( ρ _{0} ) ( ρ _{U}^{r + 1} - ρ _{0}^{r + 1} )}{r + 1},

a_{i} = 1 - y_{i} + (2 y_{i} - 1) q_{i} ρ_{0}, b_{i} = (2 y_{i} - 1) (1 - q_{i}),

a_{i} = 1 - y_{i} + (2 y_{i} - 1) q_{i} ρ_{0}, b_{i} = (2 y_{i} - 1) (1 - q_{i}),

h_{m} (ρ_{1}) = \frac{2}{c _{m}^{**} ( ρ _{U} - ρ _{L} ) ^{2}} r = 0 \sum m \frac{d _{m, r} ( ρ _{1} ) ( ρ _{1}^{r + 1} - ρ _{L}^{r + 1} )}{r + 1},

h_{m} (ρ_{1}) = \frac{2}{c _{m}^{**} ( ρ _{U} - ρ _{L} ) ^{2}} r = 0 \sum m \frac{d _{m, r} ( ρ _{1} ) ( ρ _{1}^{r + 1} - ρ _{L}^{r + 1} )}{r + 1},

a_{i} = 1 - y_{i} + (2 y_{i} - 1) (1 - q_{i}) ρ_{1}, b_{i} = (2 y_{i} - 1) q_{i},

a_{i} = 1 - y_{i} + (2 y_{i} - 1) (1 - q_{i}) ρ_{1}, b_{i} = (2 y_{i} - 1) q_{i},

h (β, θ) = \frac{2 s β I { 0 < β < ρ _{U} - ρ _{L} , ℓ ( β ) < θ < u ( β ) }}{( ρ _{U} - ρ _{L} ) ^{2}},

h (β, θ) = \frac{2 s β I { 0 < β < ρ _{U} - ρ _{L} , ℓ ( β ) < θ < u ( β ) }}{( ρ _{U} - ρ _{L} ) ^{2}},

ℓ (β) = v_{1} - \frac{ρ _{U} - α}{s β}, u (β) = v_{0} + \frac{α - ρ _{L}}{s β},

ℓ (β) = v_{1} - \frac{ρ _{U} - α}{s β}, u (β) = v_{0} + \frac{α - ρ _{L}}{s β},

g_{0} (β) = \frac{2 ( ρ _{U} - ρ _{L} - β )}{c _{0} ( ρ _{U} - ρ _{L} ) ^{2}},

g_{0} (β) = \frac{2 ( ρ _{U} - ρ _{L} - β )}{c _{0} ( ρ _{U} - ρ _{L} ) ^{2}},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Statistical Methods and Inference · Neural Networks and Applications

Full text

A Bayesian Stochastic Approximation Method

Jin Xu111Corresponding author: School of Statistics, East China Normal University, 500 Dongchuan Road, Shanghai 200241, China, e-mail: [email protected] and Cui Xiong and Rongji Mu

School of Statistics

East China Normal University

Shanghai 200241, China

Abstract

Motivated by the goal of improving the efficiency of small sample design, we propose a novel Bayesian stochastic approximation method to estimate the root of a regression function. The method features adaptive local modelling and nonrecursive iteration. Strong consistency of the Bayes estimator is obtained. Simulation studies show that our method is superior in finite-sample performance to Robbins–Monro type procedures. Extensions to searching for extrema and a version of generalized multivariate quantile are presented.

Key words: adaptive local modelling; Kiefer–Wolfowitz process; nonrecursive iteration; Robbins–Monro process; stochastic approximation.

1 Introduction

We consider the problem of finding the unique root $\theta$ of a unknown function $M$ in the regression model

[TABLE]

where $\varepsilon_{n}$ is unobservable random error. The approach by stochastic approximation uses a sequential design strategy to successively choose $x_{n}$ on which the response $y_{n}$ is observed with mean $M(x_{n})$ so that $x_{n}$ converges to $\theta$ in some sense. The feature of response-adaptiveness is attractive and can often be more efficient than fixed sample design. Over years, stochastic approximation and its variants have broad applications in design of experiments, clinical trials, dynamic programming, sequential learning, to name just a few (Finney, 1978; Kushner and Yin, 1997; Spall, 2003).

Here we give a brief review which is by no means to be complete but just covers some major progresses. In the fundamental paper of Robbins and Monro (1951), they proposed a recursive design of the form

[TABLE]

where $a_{n}$ are positive constants, and showed that $x_{n}$ converges to $\theta$ in probability when $\sum_{n=1}^{\infty}a_{n}=\infty$ and $\sum_{n=1}^{\infty}a_{n}^{2}<\infty$ assuming $M$ satisfies some regularity conditions. It is a stochastic analogy to the deterministic Newton’s method where $x_{n+1}=x_{n}-M(x_{n})/M^{\prime}(x_{n})$ (The prime denotes the first derivative.) and is referred as Robbins–Monro procedure. The almost sure convergence was later proved through different approaches (Dvoretzky, 1956; Gladyshev, 1965; Robbins and Siegmund, 1971). Inspired by the Liapounov functions in the stability theory of ordinary differential equations, Sacks (1958) established the asymptotic normality of $x_{n}$ and showed that under certain regularity conditions the asymptotically optimal choice of $a_{n}$ in (2) is $a_{n}=(n\beta^{*})^{-1}$ where $\beta^{*}=M^{\prime}(\theta)$ . (See also Chung (1954), Burkholder (1956), Hodges and Lehmann (1956).)

Ever since, much effects have been made to estimate $\beta^{*}$ . Lai and Robbins (1979, 1981) proposed an adaptive Robbins–Monro procedure in the form of

[TABLE]

where $b_{n}$ is a truncated version of the least square estimate of the regression slope given by $\widehat{\beta}_{n}=\sum_{i=1}^{n}y_{i}(x_{i}-\overline{x}_{n})/\sum_{i=1}^{n}(x_{i}-\overline{x}_{n})^{2}$ and $\overline{x}_{n}=n^{-1}\sum_{i=1}^{n}x_{i}$ . Strong consistency of $b_{n}$ was established (Lai and Robbins, 1981, 1982). We refer the readers to Venter (1967), Anderson and Taylor (1976), Anbar (1978) and Anderson and Taylor (1979) for some related versions. An excellent review about these variants is given by Lai (2003).

In a different route, Ruppert (1988) and Polyak and Juditsky (1992) proposed using averaged trajectories of (2), $\overline{x}_{n}$ , to estimate the root and demonstrated the almost sure convergence when $a_{n}$ satisfies the condition of being sufficiently slowly decreasing in the sense of $a_{n}\rightarrow 0$ and $(a_{n}-a_{n+1})/a_{n}=\mbox{\rm o}(a_{n})$ .

An important case of (1) is when $M$ is a distribution function and $y_{n}$ is binary response. Then, the Robbins–Monro procedure for finding the $\alpha$ -quantile of $M$ , assuming it is unique, is given by

[TABLE]

The corresponding adaptive version is $x_{n+1}=x_{n}-(nb_{n})^{-1}(y_{n}-\alpha)$ .

The rationale of these procedures is clear. When observing a ‘success’ at the $n$ th step (such as explosion in the sensitivity experiment or occurrence of adverse events in dose-finding clinical trial), reduce the current level for the next design point; when observing a ‘failure’, increase the current level for the next design point. As the number of iteration increases, the magnitude of change converges to zero. This type of scheme is in a similar spirit to the ‘up-and-down’ method (Dixon and Mood, 1948; Dixon, 1965) for estimating the median in sensitivity experiments. To estimate $\beta^{*}$ in the binary data case, Wu (1985) proposed fitting a two-parameter logit model for the available data to obtain an initial maximum likelihood estimate (MLE) of $x_{n+1}$ . Some initial runs are required to have the condition for the existence and uniqueness of this MLE met. (See also Sitter and Wu (1993).) An important contribution by Joseph (2004) is the proposal of an efficient Robbins–Monro procedure which entails the recursion

[TABLE]

where

[TABLE]

$\tau_{n+1}^{2}=\tau_{n}^{2}-\alpha_{n}(1-\alpha_{n})a_{n}^{2}$ , $\beta=M^{\prime}(\theta)/\phi(\Phi^{-1}(\alpha))$ , $\Phi$ and $\phi$ are the distribution function and density of the standard normal variable respectively. The introduction of constant sequence $\alpha_{n}\rightarrow\alpha$ helps reduce the oscillation of $x_{n}$ at early steps. It is shown to have a faster convergence than the usual Robbins–Monro procedure when $\alpha$ takes extreme values. Wu and Tian (2014) proposed a three-phase design that combines some initial design and Joseph’s efficient modification to obtain a more steady method. Recently, Toulis and Airoldi (2015) proposed an implicit stochastic approximation method which improves the classic Robbins–Monro procedure by a stochastic fixed-point equation. It requires to run many additional experiments at every step of (2). Thus, it may not be feasible for a small sample design.

Other model-based designs for quantal response focus on estimation of the coefficients of a parametric model (Wu, 1986; Chaloner and Larntz, 1989; Chaudhuri and Mykland, 1993; Neyer, 1994; Dror and Steinberg, 2006, 2008; Hung and Joseph, 2014). The advantage of this approach is that one can use a single design to estimate the global response curve that includes all quantiles. While the disadvantages are that i) it needs to make assumptions (about the model and/or hyperparameters); and ii) the designs usually require initial data to start with which can be as many as ten or more. Hung and Joseph (2014) proposed a simple Bayesian version of Wu (1985)’s logit-MLE method, which makes the design fully sequential from $n=1$ . It postulates independent informative priors on the parameters of a logistic model for $M$ given by $F(x)=[1+\exp\{-(x-\mu)/\sigma\}]^{-1}$ with $\mu\sim N(\mu_{0},\tau^{2})$ and $\sigma\sim\exp(\xi)$ . And the sequential design estimates the $\alpha$ -quantile by $x_{n+1}=\hat{\mu}_{n}+\hat{\sigma}_{n}\log\{p/(1-p)\}$ , where $\hat{\mu}_{n}$ and $\hat{\sigma}_{n}$ are the maximum-a-posteriori (MAP) estimate of $(\mu,\sigma)$ after $n$ samples.

In this paper, we limit our study to the root finding problem. We point out several limitations associated with the Robbins–Monro type procedures. First, for these algorithm-based procedures such as (2), the averaged trajectory of (2) and (5), the adaptation through the last experiment data $(x_{n},y_{n})$ via recursion is subject to inadequacy. Experiments at points in a neighborhood would carry useful information for $\theta$ as well especially in the early stage. Second, large oscillation caused by these up-or-down recursions in early iteration can be harmful and inefficient. Third, for the procedures such as (3) that reply heavily on the estimation of $\beta^{*}$ , as $x_{n}$ clusters around to $\theta$ , little information is gained to estimate $\beta^{*}$ directly. So even for consistent estimator, the finite-sample performance can still be far from satisfaction from a practical point of view.

On the other hand, the Bayesian paradigm is known to be suitable for such adaptive learning problem. Some applications in a closely related problem of dose-finding in clinical trials have been reported (Cheung, 2010; Thall, 2010). Like Hung and Joseph (2014)’s method, Bayesian models are used to update the underlying distribution globally. Little has been seen for solving the local root for $\alpha$ -quantile directly. Using martingale theory, Hu (1998) established the strong consistency of the Bayes estimator under a general setting of a nonlinear regression model. We will make use of this result for later development.

Motivated by the aforementioned drawbacks of the Robbins–Monro type methods and the advantage of Bayesian approach, we propose a novel model-based stochastic approximation procedure that circumvents direct estimation of $\beta^{*}$ through integration. Specifically, the new method builds a local linear model for $M$ around $x_{n}$ and obtains the Bayes estimator as a nonrecursive solution for $x_{n+1}$ . Strong consistency is obtained. These constitute the main contents of Section 2. In Section 3, we give a few important remarks and insights of the proposed method that lead to more efficient algorithm. More importantly, in Section 4 we demonstrate by simulation that the proposed method yields a smooth search path and results in a superior finite-sample performance to the competing methods. In Section 5, we present applications of the new method to the general root-finding problem in (1) and Kiefer–Wolfowitz procedure (Kiefer and Wolfowitz, 1952) to find the minimum of an unknown function. In Section 6, we extend the proposed method to estimate a version of generalized multivariate quantile. Section 7 concludes the paper with some discussions.

2 Method

We begin with the problem of quantile estimation with binary responses under the setting in (4).

First, we introduce two preliminary processes before sequential experiment. (i) Scale the search domain of $x$ to the interval $(0,1)$ . It can be done easily once we have some general idea of the range of $x$ . (ii) Divide the interval $(0,1)$ equally into $s$ subintervals. We will provide guideline for the selection of $s$ in Section 3.4.

Denote the (scaled) data up to the $n$ th step by $\mathcal{D}_{n}=\{(x_{i},y_{i}):i=1,\ldots,n\}$ . Next, we construct a local Bayesian model based on the current point $x_{n}$ . Observe that $x_{n}$ is contained in the subinterval $(v_{0},v_{1})$ , where $v_{0}=(\lceil x_{n}s\rceil-1)/s$ , $v_{1}=\lceil x_{n}s\rceil/s$ , and $\lceil\cdot\rceil$ is the ceiling function. Approximate $M(x)$ in $(v_{0},v_{1})$ by the segment of a line through the point $(\theta,\alpha)$ with positive slope $\beta$ given by

[TABLE]

Note that $\theta$ itself is not necessarily in $(v_{0},v_{1})$ .

For the convenience of later calculation, denote $\widetilde{\beta}=\beta(v_{1}-v_{0})\ (=\beta/s)$ . Let

[TABLE]

Then, $\widetilde{\beta}$ and $\theta$ are 1-1 connected with $\rho_{0}$ and $\rho_{1}$ through

[TABLE]

Assume that the joint prior of $(\rho_{0},\rho_{1})$ is uniform with density

[TABLE]

where $0\leq\rho_{L}<\alpha<\rho_{U}\leq 1$ are two given constants, and $I(\cdot)$ is the indicator function. For example, the constants $\rho_{L}=0$ and $\rho_{U}=1$ are considered to be noninformative. We have more discussion about the determination of $\rho_{L}$ and $\rho_{U}$ in Section 3.1. It should be noted that under this prior, $\theta$ can take value in $(-\infty,\infty)$ through (8) as linear extrapolation. For later development, we will restrict the calculation of the posterior distribution of $\theta$ in the domain $(0,1)$ by truncation. And we will introduce other prior which meets the restriction for $\theta\in(0,1)$ in Section 3.5.

The subsequent development for finding the posterior distribution of $\theta$ is standard. After accounting for the Jacobian from (7), the joint prior density of $(\theta,\widetilde{\beta})$ is

[TABLE]

which can be expressed as

[TABLE]

with

[TABLE]

Note that $0<\eta(\theta)<\rho_{U}-\rho_{L}$ . Integrating out $\widetilde{\beta}$ in (10) and imposing the restriction that $0<\theta<1$ , we obtain the prior density of $\theta$ as

[TABLE]

where $c_{0}=\int_{0}^{1}s\eta^{2}(\theta)/(\rho_{U}-\rho_{L})^{2}d\theta$ is the normalization constant.

Next, we will only use the design points contained in $(v_{0},v_{1})$ to update the Bayesian model. This idea of using most recent design points is also seen in Anbar (1978) to estimate $\beta^{*}$ .

Denote the subsequence of $x_{n}$ in $(v_{0},v_{1})$ by $x_{i_{1}},\ldots,x_{i_{m}}$ . Clearly, $1\leq m\leq n$ since at least $x_{n}$ is in $(v_{0},v_{1})$ . Denote the likelihood function of $(\theta,\widetilde{\beta})$ at point $(x_{i},y_{i})$ by $L_{i}$ , which is expressed as

[TABLE]

where

[TABLE]

By (10) and (12), the posterior distribution of $(\theta,\widetilde{\beta})$ is proportional to

[TABLE]

For $r=0,1,\ldots,m$ , express the coefficient of $\widetilde{\beta}^{r}$ in $\prod_{j=1}^{m}\{a_{i_{j}}+b_{i_{j}}(\theta)\widetilde{\beta}\}$ as

[TABLE]

where $\Omega_{m,r}$ is the collection of $m$ -choose- $r$ distinct subsets of $r$ indices out of $\{1,\ldots,m\}$ and $B^{c}=\{1,\ldots,m\}\backslash B$ . We emphasize that $d_{m,r}(\theta)$ only depends on data observed in the subinterval.

Integrating out $\widetilde{\beta}$ in (14), we get the posterior distribution of $\theta$ as

[TABLE]

where $c_{m}$ is the normalization constant. A few points are worthy of being noted. First, $h_{m}(\theta)$ is a two-piecewise homogeneous polynomial of order $-2$ and is differentiable everywhere except at $\theta_{0}$ . Second, $h_{m}(\theta)$ is invariant to the permutation of the points in the subsequence. Third, the modification of $h_{m}$ to the prior $h_{0}$ takes place in a multiplicative fashion. The weighted summand $d_{m,r}(\theta)\eta^{r}(\theta)$ can be viewed as the $r$ th order interaction of the points in the subinterval. Moreover, we can write $d_{m,r}(\theta)$ recursively as

[TABLE]

where $d_{0,0}=1$ , $d_{m-1,-1}=d_{m-1,m}=0$ . It provides a simple way to obtain $d_{m,r}$ successively. Based on (17), we express $c_{m}h_{m}(\theta)$ in a recursive form as

[TABLE]

where $R_{m-1}(\theta)=\sum_{r=0}^{m-1}(r+3)^{-1}d_{m-1,r}(\theta)\eta^{r}(\theta)/\sum_{r=0}^{m-1}(r+2)^{-1}d_{m-1,r}(\theta)\eta^{r}(\theta)$ .

We summarize the above results in the following proposition.

Proposition 1.

Assume that the joint prior of $(\rho_{0},\rho_{1})$ associated with the subinterval $(v_{0},v_{1})$ is uniform with density (9). Then, the posterior distribution of $\theta$ restricted in $(0,1)$ is given in (16) satisfying a recursion in (18).

Finally, we set the next point to be the Bayes estimator with respect to $h_{m}$ , i.e.

[TABLE]

Since $h_{m}(\theta)$ or $c_{m}h_{m}(\theta)$ is completely determined in (16), $x_{n+1}$ can be easily calculated up to a desired precision. We can also easily obtain an equal tail credible interval for $\theta$ based on $h_{m}$ .

When $M$ is linear as $F$ in (6), it is clear that the random error for the binary response $y_{n}$ satisfies the conditions $\mbox{\rm E}(\varepsilon_{n}\mid\varepsilon_{1},\ldots,\varepsilon_{n-1})=0$ and $\mbox{\rm E}(\varepsilon_{n}^{2})<\infty$ . Then, by Theorem 1 of Hu (1998), we have the following result about the consistency of the procedure.

Proposition 2.

For binary response with mean value given by the model (6), the Bayesian stochastic approximation procedure given by (19) is strongly consistent.

When $M$ is nonlinear, by Taylor expansion $M(x)$ differs from $F(x)$ by a quantity bounded by $\sup_{x\in(v_{0},v_{1})}|M^{\prime\prime}(x)|/(2s^{2})$ , where $M^{\prime\prime}$ denotes the second derivative assuming it exists. As $n$ increases, we can increase $s$ so that the local linear approximation is well maintained. Thus, we expect the consistency of the procedure to hold. We demonstrate its superb finite-sample performance in Section 4.

Beside the Bayes estimator, we can also use the posterior mode, i.e. maximum a posterior (MAP) estimator, for the next point. We illustrate the procedure by an example.

Example 1.

Let $M_{1}(x)=\Phi(6x-3)$ for $x\in(0,1)$ . Consider estimating the median of $M_{1}$ . Set $x_{1}=0.25$ and $s=7$ . And set $\rho_{L}=0$ and $\rho_{U}=1$ for all subintervals. Figures 1 and 2 demonstrate one search path up to 30 steps and the evolution of the corresponding posterior distributions $h_{(n)}$ (which equals $h_{m}$ for some $m$ in the associated subinterval) obtained by the proposed method using the Bayes estimate and the MAP estimate, respectively. Notice that $c_{m}/(c_{m-1}a_{i_{m}})\rightarrow 1$ . For the purpose of illustrating the shape of $h_{m}$ , we multiply $h_{m}$ by $c_{m}/\prod_{j=1}^{m}a_{i_{j}}$ to make the amplified $h_{m}$ s in a comparable scale. It is seen that both sequences move across three subintervals and gradually converge to the median 0.5. The Bayes estimate appears to converge faster than the MAP estimate as it is more aggressive to move across a subinterval. While, the MAP estimate tends to yield a conservative movement and a more smooth path. These patterns are consistent to the properties of mean and median with respect to the skewness of a distribution.

3 Remarks

In this subsection, we give a few important remarks and insights of the proposed method that can lead to more efficient algorithm.

3.1 Posterior distributions of $\rho_{0}$ and $\rho_{1}$

By (6) and (7), we have linear interpolation for $x_{i}\in(v_{0},v_{1})$ as $F(x_{i})=q_{i}\rho_{0}+(1-q_{i})\rho_{1}$ with $q_{i}=(v_{1}-x_{i})/(v_{1}-v_{0})$ . Express the individual likelihood in (12) in terms of $(\rho_{0},\rho_{1})$ as

[TABLE]

Then, following the same routine as in Section 2 for $\theta$ , we obtain the marginal posterior distributions of $\rho_{0}$ and $\rho_{1}$ as follows.

Proposition 3.

Assume that the joint prior of $(\rho_{0},\rho_{1})$ associated with the subinterval $(v_{0},v_{1})$ is uniform with density (9). Then, the posterior distribution of $\rho_{0}$ is

[TABLE]

where $d_{m,r}$ is defined in the same form as (15) with

[TABLE]

and $c_{m}^{*}$ is the normalization constant. And the posterior distribution of $\rho_{1}$ is

[TABLE]

where $d_{m,r}$ is defined in the same form as (15) with

[TABLE]

and $c_{m}^{**}$ is the normalization constant.

The recursion in (17) also holds for $d_{m,r}(\rho_{0})$ and $d_{m,r}(\rho_{1})$ . Like $h_{m}(\theta)$ in Section 2, $h_{m}(\rho_{0})$ and $h_{m}(\rho_{1})$ are completely determined given the data.

When $x_{n}$ enters a subinterval for either the first time or re-visit, we can use the posterior distributions obtained from the previous subinterval to update $\rho_{L}$ or $\rho_{U}$ for the uniform prior of the current subinterval. More specifically, suppose that $x_{n}$ moves forward from the $t$ th subinterval to the $(t+1)$ th subinterval. Then we can set the fifth percentile of the posterior distribution of $\rho_{1}$ of the $t$ th subinterval as $\rho_{L}$ for the $(t+1)$ th subinterval. And suppose that $x_{n}$ moves downward from the $t$ th subinterval to the $(t-1)$ th subinterval. Then we can set the 95th percentile of the posterior distribution of $\rho_{0}$ of the $t$ th subinterval as $\rho_{U}$ for the $(t-1)$ th subinterval. In this way, the information from the neighboring subinterval is used for the new local model. We will use this strategy in the subsequent numerical study. As seen in simulation, these lower or upper fifth percentile can actually narrow the range of the uniform prior significantly as data cumulates.

3.2 Posterior distribution of $\widetilde{\beta}$

The joint prior $h(\theta,\widetilde{\beta})$ in (2) can also be written as

[TABLE]

with

[TABLE]

which indicates that $\theta$ given $\widetilde{\beta}$ is uniform. Note that without further restriction of $\widetilde{\beta}$ , the interval $(\ell(\widetilde{\beta}),u(\widetilde{\beta}))$ can be as wide as $(-\infty,\infty)$ as pointed out before. To impose the conditions $\ell(\widetilde{\beta})\geq 0$ and $u(\widetilde{\beta})\leq 1$ requires $\widetilde{\beta}\geq\widetilde{\beta}_{0}$ where $\widetilde{\beta}_{0}=\max\{\frac{\rho_{U}-\alpha}{sv_{1}},\frac{\alpha-\rho_{L}}{s(1-v_{0})}\}$ . Then, the marginal prior of $\widetilde{\beta}$ is

[TABLE]

where $\widetilde{c}_{0}$ is the normalization constant (over $(\widetilde{\beta}_{0},\rho_{U}-\rho_{L})$ ).

Secondly, express $L_{i}(\theta,\widetilde{\beta})$ in (12) as

[TABLE]

where

[TABLE]

Following the same steps in (15) and (16), we get

Proposition 4.

Assume that the joint prior of $(\rho_{0},\rho_{1})$ associated with the subinterval $(v_{0},v_{1})$ is uniform with density (9). Then, the posterior distribution of $\widetilde{\beta}$ is

[TABLE]

where $d_{m,r}$ is defined in the same form as (15) with $a_{i}$ and $b_{i}$ replaced by $\widetilde{a}_{i}$ and $\widetilde{b}_{i}$ in (20) respectively, and $\widetilde{c}_{m}$ is the normalization constant.

3.3 Investigation of $x_{2}$

We present a detailed investigation of $x_{2}$ to reveal some features of the proposed procedure.

By (16), we have

[TABLE]

For simplicity, fix $\rho_{L}=0$ and $\rho_{U}=1$ in (11) for $\eta$ .

To examine the connection between $x_{2}$ and $x_{1}$ , we first consider the MAP estimate for $x_{2}$ . By solving $h_{1}^{\prime}(\theta)=0$ and checking the sign of $h_{1}^{\prime}(\theta)$ for cases of $\theta<\theta_{0}$ and $\theta>\theta_{0}$ where $\theta_{0}$ is defined in (11), we obtain that

[TABLE]

where $t_{0}=3^{-1}(2+\alpha)v_{0}+3^{-1}(1-\alpha)v_{1}$ and $t_{1}=3^{-1}\alpha v_{0}+(1-3^{-1}\alpha)v_{1}$ which divide $(v_{0},v_{1})$ into subintervals $(v_{0},t_{0})$ , $(t_{0},t_{1})$ and $(t_{1},v_{1})$ with fractions of $(1-\alpha)/3$ , $2/3$ and $\alpha/3$ , respectively. And $\theta_{0}$ falls in these subintervals depending on $\alpha$ value in $(0,1/4)$ , $[1/4,3/4]$ , $(3/4,1)$ respectively.

A few interesting properties of the MAP estimate can be seen from (21). First, when $t_{0}<x_{1}<t_{1}$ , $x_{2}=\theta_{0}$ no matter $y_{1}=1$ or 0. This outcome enables the search path to possibly remain unchanged (with $1/4\leq\alpha\leq 3/4$ ) when the evidence of moving is not convincing. Second, when $1/4\leq\alpha\leq 3/4$ , the values of $x_{2}$ under the first two situations of (21) are rather counterintuitive. For instance, when $y_{1}=1$ with $x_{1}<t_{0}$ , we have $x_{1}<x_{2}$ . It would have been $x_{1}>x_{2}$ by Robbins–Monro type procedure. However, the procedure does yield $x_{2}<\theta_{0}$ . Similarly, when $y_{1}=0$ with $x_{1}>t_{1}$ , we get $\theta_{0}<x_{2}<x_{1}$ , which would have been $x_{2}>x_{1}$ by Robbins–Monro type procedure. This seemingly irrational move can actually avoid unnecessary oscillation of the search points in the absence of enough evidence and lead to a smooth path as seen in Figures 1 and 2 in contrast to a zig-zag path in Robbins–Monro type procedure. Third, $x_{2}$ can take value outside $(v_{0},v_{1})$ . For example, when $\alpha<1/4$ , $x_{1}<t_{0}$ and $y_{1}=1$ , we get $x_{2}<v_{0}$ ; and when $\alpha>3/4$ , $x_{1}>t_{1}$ and $y_{1}=0$ , we get $x_{2}>v_{1}$ . It results in the search point moving into the neighboring subinterval and consequently starting a new local Bayesian model.

The explicit expression of the MAP for $x_{3}$ can also be derived based on $h_{2}$ . It depends on $(x_{1},y_{1})$ and $(x_{2},y_{2})$ and is very complicated.

Next, by straightforward calculation, the Bayes estimate for $x_{2}$ is obtained as

[TABLE]

We can hardly interpret the connection of $x_{2}$ with $x_{1}$ from this analytic expression except that $x_{2}$ is a linear function of $x_{1}$ . However numerical analysis shows that $x_{2}$ also processes similar features as those described for the MAP estimate.

At last, inspired by the above investigation of $x_{2}$ , we find the proposed procedure is conservative in the sense of moving in large steps. So instead of choosing $x_{1}$ arbitrarily, we set $x_{1}=0.5$ , the middle of the search domain, as the starting point to begin cumulating information.

3.4 Choice of $s$

The number of subintervals $s$ determines the size of the neighborhood up on which a local model is built. When $\alpha$ is around the middle range, say $0.4\sim 0.6$ , an integer in the range of $3\sim 10$ can usually yield a quick convergence in a moderate number of iterations. When $\alpha$ is close to extreme values, implying rare event of ‘success’ or ‘failure’ in experiment, we wish the search sequence to be conservative in moving in small steps especially in the early iterations. Therefore, a moderately large value of $s$ is recommended, say 20. And for the same reason, we recommend using MAP estimator instead of the Bayes estimator.

Second, to get a more efficient approximation and faster convergence, we recommend a two-stage procedure. That is to set $s$ to be a small number to quickly reach the vicinity of the target and then increase $s$ to a larger number for refined approximation. We provide a guideline for the choice of $s$ in Table 1. The odd numbers are chosen to avoid possible invalid denominators in $\eta$ in (11) during numerical calculation.

Third, if during the search updated information about the range of $\theta$ becomes available, one can re-define the search domain and use the available data after rescaling.

3.5 Alternative choice of prior $h(\rho_{0},\rho_{1})$

As seen in Section 2, the uniform prior of (9) leads to simple calculation for the derivation, but induces distribution of $\theta$ outside $(0,1)$ . Alternatively, one can use other prior, such as

[TABLE]

or even more informative prior to warrant $\theta\in(0,1)$ . Then simple or explicit form of the posterior distribution may not be available. In this case, we can resort to Markov chain Monte Carlo method, e.g. Gibbs sampling, to obtain the empirical posterior distribution of $\theta$ after (8) and (14). However, because of scarcity of the data and simulation error, preliminary numerical study shows that resulting estimates are not as precise as those based on the exact distribution.

4 Numerical comparisons

We compare the proposed Bayesian stochastic approximation method using Bayes estimator (denoted by BSA-Bayes) and MAP estimator (denoted by BSA-MAP) with the classic Robbins–Monro procedure in (4) (denoted by RM), the efficient Robbins–Monro procedure in (5) (denoted by RMJ), the averaged trajectory method by Ruppert (1988) and Polyak and Juditsky (1992) (denoted by RPJ), and the Bayesian version of Wu’s logit-MLE method by Hung and Joseph (2014) (denoted by Wu-MAP).

Consider the following six functions adopted from Joseph (2004),

[TABLE]

which represent a shifted version of normal, uniform, logistic, extreme value, skewed logistic, and Cauchy distributions respectively with a common root at zero for all $\alpha$ -quantiles.

Since all RM, RMJ, RPJ and Wu-MAP procedures are not intended to search within $(0,1)$ , we convert the points in interval $(0,1)$ by the linear map $6x-3$ to interval $(-3,3)$ and invert the resulting points back to $(0,1)$ for comparison in the same scale. For RM in (4), the optimal $a_{n}=\{nM^{\prime}(\theta)\}^{-1}$ is used. For RMJ in (1), the optimal $\beta=M^{\prime}(\theta)/\phi(\Phi^{-1}(\alpha))$ and $\tau_{1}=1$ are used as in Joseph (2004). For RPJ, set $a_{n}=n^{-2/3}$ as recommended by Polyak and Juditsky (1992). For Wu-MAP, set the hyperparameters $\mu_{0}=0$ and $\tau=\xi=3$ to cover a wide range of priors. For BSA, set $s=17$ to represent a moderate number of sliced subintervals.

Throughout, we set $x_{1}=0.5$ (corresponding to the starting point zero in $(-3,3)$ ) and $n=20$ to estimate $\theta$ . For $\alpha$ taking values from $0.1,0.2,\ldots,0.9$ , we compute the empirical root of mean square (RMSE) of $x_{21}$ over 1,000 replications for every procedure.

Figures 3 shows the empirical RMSE of $x_{21}$ obtained by the six competing methods. The findings are summarized as follows. (i) Under model 2, the RMJ and RPJ methods perform similarly. Both are superior to the RM and Wu-MAP methods, especially for extreme values of $\alpha$ . The proposed method with Bayes estimator has uniform superiority to the RM, RMJ and RPJ methods for $\alpha=0.2,\ldots,0.8$ . For $\alpha$ being extreme values as 0.1 or 0.9, the response curve is nearly flat at $\theta$ . The performance of BSA-Bayes deteriorates, as expected. While, the proposed method with MAP estimator in this case is the best due to the starting point advantage and its conservatism of movement as pointed out in Example 1. (ii) Under models 3 to 7, the results are similarly to those under model 2. For the sake of space, we defer them in the supplementary material. (iii) Under model 1, the $\alpha$ -quantiles of standard normal locates across the search domain. It is seen that the performances of RM, RMJ, RPJ, Wu-MAP are similar to those under model 2. The proposed method with Bayes estimator outperforms the above four methods for all different $\alpha$ values. The RMSE of BSA-MAP has the minimum value for the median estimation and increases in the distance between the root and the starting point which is again because of its conservative movement.

Further simulation shows that the proposed method with other moderate number of subintervals, say $s=15\sim 25$ , yields similar superior result. The implicit stochastic approximation method by (20) of Toulis and Airoldi (2015) was also conducted and found to be much inferior to the RMJ and RPJ methods in the small sample case. The results are omitted.

At last, we want to add that the proposed method requires the uniqueness of $\theta$ . When $M^{\prime}(\theta)$ is very close to zero such as at $M^{-1}(0.9)$ , $M^{-1}(0.99)$ or $M^{-1}(0.999)$ , the proposed method can perform inferior to the algorithm-based methods RMJ or RPJ. In that case, a hybrid method that uses RMJ or RPJ afer a moderate number of iterations of the proposed method can be used.

5 Applications

We present two applications of the proposed Bayesian stochastic approximation method for binary responses in this subsection.

5.1 Search for the root of a monotonic continuous function

For the original problem in (1), first convert $y_{n}\in\mathbb{R}$ to a response in $(0,1)$ through a sigmoid function $y_{n}^{*}=(1+e^{-by_{n}})^{-1}$ , where $b$ is a known scale parameter such that $y_{n}^{*}$ spreads well in $(0,1)$ . For example, if $y_{n}$ has a known range in $(-C,C)$ for some $C>0$ , we can set $b=3/C$ .

Second, approximate $y_{n}^{*}$ by a fraction represented by $a$ ones and $q-a$ zeros such that $a/q$ is closest to $y_{n}^{*}$ for some integer $q\geq 1$ . These $q$ binaries are then treated as independent responses at the same point $x_{n}$ . The minimum value of $q=1$ corresponds to the dichotomization of $y_{n}$ by its sign. Usually a number as small as $q=3$ is adequate for the approximation.

Based on the generated binary responses, the problem is reduced to search for the median of a distribution. We can then use the proposed method with Bayes estimator in Section 2. More specifically, we set $s=5$ for the first ten steps and set $s=9$ for the subsequent steps as used in Section 4.

Example 2.

Consider the regression model $y_{n}=200(x_{n}-0.3)^{3}+\varepsilon_{n}$ , where $\varepsilon_{n}$ is independent standard normal variable. We applied the proposed method above with $b=1$ , $q=2$ and $x_{1}=0.5$ . Panel (a) of Figure 4 shows the empirical RMSEs (over 1,000 replications) of $x_{n}$ up to 30 steps in comparison with those obtained by applying the (scaled) RMJ procedure (with the same starting point) to the binaries obtained by signs of $y_{n}$ . It is seen that the proposed method dominates the RMJ procedure.

5.2 Search for a minimum of a convex function

Suppose that $\varphi(x)$ is a convex function. We seek a sequential design for finding the minimum of $\varphi(x)$ at $\theta$ . It is equivalent to find $\theta$ such that $G(\theta)=0$ , where $G(x)=\lim_{c\rightarrow 0}\{\varphi(x+c)-\varphi(x-c)\}/(2c)$ . The Kiefer–Wolfowitz procedure (Kiefer and Wolfowitz, 1952) entails the recursion

[TABLE]

where $y_{n1}$ and $y_{n2}$ are two independent responses at $x_{n}+c_{n}$ and $x_{n}-c_{n}$ with mean $\varphi(x_{n}+c_{n})$ and $\varphi(x_{n}-c_{n})$ respectively, $\gamma_{n}$ and $c_{n}$ are two positive constant sequences decreasing to zero and satisfying $\sum\gamma_{n}=\infty$ , $\sum\gamma_{n}c_{n}<\infty$ , and $\sum\gamma_{n}^{2}c_{n}^{-2}<\infty$ . For example, $\gamma_{n}=n^{-1}$ and $c_{n}=n^{-1/3}$ as recommended by Kiefer and Wolfowitz (1952).

Let $\widetilde{y}_{n}=(y_{n1}-y_{n2})/c_{n}$ . We apply the previous procedure in Section 5.1 to $(x_{n},\widetilde{y}_{n})$ to approximate the root of $G$ .

Example 3.

Consider the regression model $y_{n}=200(x_{n}-0.3)^{2}+\varepsilon_{n}$ , where $\varepsilon_{n}$ is independent standard normal variable. We conducted a similar comparison using the competing methods in Example 2 to $\widetilde{y}_{n}$ and $c_{n}$ defined above. Panel (b) of Figure 4 shows the proposed method outperforms the method based on RMJ procedure in terms of RMSE.

6 Multi-dimensional extension

6.1 Method

We extend the proposed method for quantile estimation to the multi-dimensional case.

Let $M(\mathbf{x})$ be the distribution function of a $p$ -dimensional continuous random vector $\mathbf{x}=(x_{1},\ldots,x_{p})^{\top}$ with the domain scaled in the unit hypercube $(0,1]^{p}$ . The goal is to find the generalized multivariate quantile defined by

[TABLE]

where $U(\mathbf{x})$ is a known function. This is a special case of the notion of generalized multivariate quantiles introduced by Einmahl and Mason (1992). Like the univariate case, assume that ${\theta}$ is unique.

The idea of the extension is to use a conditional approach to reduce the problem to univariate case along each coordinate so that the proposed method in Section 2 can be applied.

First, we introduce some notations. Divide $(0,1]$ equally into $s$ subintervals along each coordinate. For any $\mathbf{x}\in(0,1]^{p}$ , let $t_{j}=\lceil x_{j}s\rceil$ for $j=1,\ldots,p$ and $\mathbf{t}=(t_{1},\ldots,t_{p})^{\top}$ . Then, $\mathbf{x}$ is uniquely contained in the hypercube $H(\mathbf{x})=\prod_{j=1}^{p}\left(\frac{t_{j}-1}{s},\frac{t_{j}}{s}\right]$ . Let $\mathbf{1}_{p}$ denote a vector of $p$ ones and $\mathbf{e}_{a}$ denote the $a$ th column vector of the $p\times p$ identity matrix. Denote the following $p+1$ vertexes of $H(\mathbf{x})$ by

[TABLE]

Notice that $\mathbf{v}_{0},\mathbf{v}_{1},\ldots,\mathbf{v}_{p}$ are arranged in a helix.

Second, approximate $M$ in $H(\mathbf{x}_{n})$ by the segments of $p$ hyperplanes intersected by the hypercube respectively. The $j$ th hyperplane passes through the point $(\mathbf{x}^{(j)},\alpha)$ with

[TABLE]

and is expressed as

[TABLE]

where $\mbox{\boldmath$ {\beta} $}=(\beta_{1},\ldots,\beta_{p})^{\top}$ with $\beta_{1},\ldots,\beta_{p}$ being all positive.

For $a=0,1,\ldots,p$ , let $\rho_{a}=F_{j}(\mathbf{v}_{a})$ and $\mbox{\boldmath$ {\rho} $}=(\rho_{0},\ldots,\rho_{p})^{\top}$ . Then, by (22) and (23), we have $\rho_{0}<\rho_{1}<\cdots<\rho_{p}$ and the solution of $(\theta_{j},\mbox{\boldmath$ {\beta} $})$ in ${\rho}$ given by

[TABLE]

Let $\widetilde{\beta}_{a}=\beta_{a}(v_{aa}-v_{a-1,a})=\rho_{a}-\rho_{a-1}$ for $a=1,\ldots,p$ and $\widetilde{}\mbox{\boldmath$ {\beta} $}=(\widetilde{\beta}_{1},\ldots,\widetilde{\beta}_{p})^{\top}$ . The Jacobian of the transformation from ${\rho}$ to $(\theta_{j},\widetilde{}\mbox{\boldmath$ {\beta} $})$ is $s\widetilde{\beta}_{j}$ .

Assume the joint prior of ${\rho}$ is uniform with density

[TABLE]

Further denote $\widetilde{}\mbox{\boldmath$ {\beta} $}_{-j}=(\widetilde{\beta}_{1},\ldots,\widetilde{\beta}_{j-1},\widetilde{\beta}_{j+1},\widetilde{\beta}_{p})^{\top}$ . By (24), (25) and (26), the joint prior of $(\theta_{j},\widetilde{}\mbox{\boldmath$ {\beta} $})$ is

[TABLE]

where

[TABLE]

It is seen that the joint prior distribution of $\widetilde{}\mbox{\boldmath$ {\beta} $}_{-j}$ is uniform on the simplex $\Delta_{j}$ defined by (28). Then given $\widetilde{}\mbox{\boldmath$ {\beta} $}_{-j}$ , the conditional distribution of $\theta_{j}$ after integrating out $\widetilde{\beta}_{j}$ and imposing the restriction $0<\theta_{j}<1$ is

[TABLE]

where $V_{j}$ is the volume of $\Delta_{j}$ (depending on $\mathbf{x}_{n}$ ) and $c_{0j}$ is the conditional normalization constant (depending on $\widetilde{}\mbox{\boldmath$ {\beta} $}_{-j}$ ).

Alternatively, express

[TABLE]

where

[TABLE]

The further restriction of $0<\theta_{j}<1$ which amounts to $0\leq\ell_{j}(\widetilde{}\mbox{\boldmath$ {\beta} $})<u_{j}(\widetilde{}\mbox{\boldmath$ {\beta} $})\leq 1$ requires $\widetilde{\beta}_{j0}\leq\widetilde{\beta}_{j}\leq\widetilde{\beta}_{j1}$ with

[TABLE]

Denote the subsequence of $\mathbf{x}_{n}$ contained in $H(\mathbf{x}_{n})$ by $\mathbf{x}_{i_{1}},\ldots,\mathbf{x}_{i_{m}}$ with $1\leq m\leq n$ . Express the likelihood of $\mathbf{x}_{i}\in H(\mathbf{x}_{n})$ as

[TABLE]

where

[TABLE]

or as

[TABLE]

where

[TABLE]

It should be noted that unlike the univariate case here $a_{i}$ depends not only on $(\mathbf{x}_{i},y_{i})$ but also the current point $\mathbf{x}_{n}$ through $\alpha_{i}$ .

Combining the joint prior of $(\theta_{j},\widetilde{}\mbox{\boldmath$ {\beta} $})$ in (27) or (29) and the likelihood of the subsequence, we get the posterior distribution of $(\theta_{j},\widetilde{}\mbox{\boldmath$ {\beta} $})$ proportion to $h(\theta_{j},\widetilde{}\mbox{\boldmath$ {\beta} $})\prod_{k=1}^{m}L_{i_{k}}(\theta_{j},\widetilde{}\mbox{\boldmath$ {\beta} $})$ . Given $\widetilde{}\mbox{\boldmath$ {\beta} $}_{-j}$ , the conditional posterior distributions of $\theta_{j}$ and $\widetilde{\beta}_{j}$ are obtained in the same way as in the univariate case in Sections 2 and 3.2 respectively. We summarize the results in the following proposition.

Proposition 5.

Assume that the joint prior of ${\rho}$ associated with the vertexes of the hypercube $H(\mathbf{x}_{n})$ is uniform with density (26). Then, the conditional posterior distribution of $\theta_{j}$ restricted in $(0,1)$ is

[TABLE]

where $d_{m,r}$ is defined in (15) with $a_{i}$ and $b_{i}$ given in (31) and $c_{mj}$ is the conditional normalization constant. And the conditional posterior distribution of $\widetilde{\beta}_{j}$ is

[TABLE]

where $d_{m,r}$ is defined in the same form as (15) with $a_{i}$ and $b_{i}$ replaced by $\widetilde{a}_{i}(\widetilde{\beta}_{j}\mid\widetilde{}\mbox{\boldmath$ {\beta} $}_{-j})$ and $\widetilde{b}_{i}(\widetilde{\beta}_{j})$ in (32) respectively, and $\widetilde{c}_{mj}$ is the conditional normalization constant over the range $(\widetilde{\beta}_{j0},\widetilde{\beta}_{j1})$ given in (30).

Proposition 5 reduces to the results in Propositions 1 and 4 when $p=1$ .

The next design point along the $j$ th coordinate is then taken to be

[TABLE]

where

[TABLE]

Since $h_{m}(\theta_{j}\mid\widetilde{}\mbox{\boldmath$ {\beta} $}_{-j})$ is completely determined, the conditional expectation of $\theta_{j}$ can be numerically calculated. The expectation with respect to $\widetilde{}\mbox{\boldmath$ {\beta} $}_{-j}$ can be approximated by averaging the conditional expectations over finite number of $\widetilde{}\mbox{\boldmath$ {\beta} $}_{-j}$ taken uniformly from the simplex. For instance when $p=2$ , the simplex $\Delta_{j}$ for $\widetilde{\beta}_{-j}$ reduces to $(0,u_{-j})$ where

[TABLE]

$x_{n,-j}$ , $v_{0,-j}$ and $v_{2,-j}$ are the other element of $\mathbf{x}_{n}$ , $\mathbf{v}_{0}$ and $\mathbf{v}_{2}$ after removing the $j$ th element, respectively

At last, we use $U$ to determine the next design point out of the $p$ candidates, i.e.,

[TABLE]

Note that in the multi-dimensional case, the derivation of the marginal posterior distribution of $\rho_{0},\ldots,\rho_{p}$ is much complicated than the univariate case in Section 3.1. Moreover, there are in fact $p$ different ways to update $\rho_{L}$ (or $\rho_{U}$ ) in (26) depending on the coincidence of $\mathbf{v}_{0}$ (or $\mathbf{v}_{p}$ ) with one vertex of some neighboring hypercube. So in the following numerical study, we simply fix $\rho_{L}=0$ and $\rho_{U}=1$ for all hypercubes and let the data inside the hypercube learn the posterior distribution of the interested parameter.

6.2 Numerical illustration

We illustrate the proposed method by a few examples. Consider the following three models:

[TABLE]

where $x_{1},x_{2}\in(0,1)$ and $\Phi(z_{1},z_{2},\rho)$ is the distribution function of bivariate normal variables with zero means, unit marginal variances and correlation coefficient $\rho$ .

We use the same two-stage procedure with respect to the choice of $s$ as for the univariate case in Section 4. For illustration, we set the starting point $\mathbf{x}_{1}=(0.6,0.6)^{\top}$ for all cases and recommend using MAP estimator for $\alpha=0.25$ to be conservative. The uniform distribution for $\widetilde{\beta}_{-j}$ over $(0,u_{-j})$ in (33) is approximated by a discrete uniform distribution over $\{iu_{-j}/8:i=1,\ldots,7\}$ .

In these examples, by symmetry we have $\theta_{1}=\theta_{2}$ and the determination for the next point in (34) can be modified as $j^{*}=\mbox{\rm argmin}_{j=1,2}|x_{n,-j}-\widetilde{\theta}_{j}|$ , i.e. to choose a point that is closer to the diagonal line $x_{1}=x_{2}$ .

Panel (a) of Figure 5 presents a single search path under $M_{8}$ with $\alpha=0.05$ , where the dotted curve is the solution set of $M_{8}^{-1}(0.05)$ and $\mbox{\boldmath$ {\theta} $}=(0.3733,0.3733)^{\top}$ is indicated by ‘ $\diamondsuit$ ’. Panels (b), (c) and (d) of Figure 5 show the empirical RMSE (over 1,000 replications) of $\mathbf{x}_{n}$ up to 60 steps obtained by the proposed method using different estimators (in parenthesis). The convergence of the procedure is clear. For the case with $\alpha=0.5$ , the small value of RMSE at the first few steps is due to the starting point.

The results for models 9 and 10 are similar and hence omitted.

7 Conclusion and discussion

The proposed Bayesian stochastic approximation method uses an adaptive local model and yields a recursive updating scheme in terms of the posterior distribution in stead of the estimate itself. It has the advantage of successively utilizing the information of the neighboring points to improve the estimation efficiency, thus reduces the variation or uncertainty carried by a single point. However, there remain several questions unsettled. First, the asymptotic behavior of the procedure in both univariate and multivariate cases is not fully understood. Second, the refined prior in both univariate case and multi-dimensional case is worth further investigation. Third, more efficient algorithm is desired, especially for multi-dimensional situation, where information about the posterior distribution of $\widetilde{\beta}$ can be used.

Because of the rich and broad applications of stochastic approximation, we anticipate new explorations of the proposed method in interactions with different techniques in many fields that mentioned at the beginning of the article.

R package is provided in the supplementary material.

Acknowledgements

The research is supported by the National Natural Science Foundation of China (grant 11271134) and the 111 Project (B14019) of Chinese Ministry of Education.

Appendix

Figure 6 shows the empirical RMSE of $x_{21}$ obtained by the six competing methods under models 3 to 7. The results are similar to those obtained under model 2.

Bibliography40

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Anbar (1978) Anbar, D. (1978). A stochastic Newton–Raphson method. Journal of Statistical Planning and Inference 2 , 153–163.
2Anderson and Taylor (1976) Anderson, T. W. and J. Taylor (1976). Some experimental results on the statistical properties of least squares estimates in control problems. Econometrica 44 , 1289–1302.
3Anderson and Taylor (1979) Anderson, T. W. and J. Taylor (1979). Strong consistency of least squares estimates in dynamic models. Annals of Statistics 7 , 484–489.
4Burkholder (1956) Burkholder, D. L. (1956). On a class of stochastic approximation procedures. Annals of Mathematical Statistics 27 , 1044–1059.
5Chaloner and Larntz (1989) Chaloner, K. and K. Larntz (1989). Optimal Bayesian design applied to logistic regression experiments. Journal of Statistical Planning and Inference 21 , 191–208.
6Chaudhuri and Mykland (1993) Chaudhuri, P. and P. A. Mykland (1993). Nonlinear experiments: optimal design and inference based on ikelihood. Journal of the American Statistical Association 88 , 538–546.
7Cheung (2010) Cheung, Y. K. (2010). Stochastic approximation and modern model-based designs for dose-finding clinical trials. Statistical Science 25 , 191–201.
8Chung (1954) Chung, K. L. (1954). On a stochastic approximation method. Annals of Mathematical Statistics 25 , 463–483.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

A Bayesian Stochastic Approximation Method

Abstract

1 Introduction

2 Method

Proposition 1**.**

Proposition 2**.**

Example 1**.**

3 Remarks

3.1 Posterior distributions of ρ0\rho_{0}ρ0​ and ρ1\rho_{1}ρ1​

Proposition 3**.**

3.2 Posterior distribution of β~\widetilde{\beta}β​

Proposition 4**.**

3.3 Investigation of x2x_{2}x2​

3.4 Choice of sss

3.5 Alternative choice of prior h(ρ0,ρ1)h(\rho_{0},\rho_{1})h(ρ0​,ρ1​)

4 Numerical comparisons

5 Applications

5.1 Search for the root of a monotonic continuous function

Example 2**.**

5.2 Search for a minimum of a convex function

Example 3**.**

6 Multi-dimensional extension

6.1 Method

Proposition 5**.**

6.2 Numerical illustration

7 Conclusion and discussion

Acknowledgements

Appendix

Proposition 1.

Proposition 2.

Example 1.

3.1 Posterior distributions of $\rho_{0}$ and $\rho_{1}$

Proposition 3.

3.2 Posterior distribution of $\widetilde{\beta}$

Proposition 4.

3.3 Investigation of $x_{2}$

3.4 Choice of $s$

3.5 Alternative choice of prior $h(\rho_{0},\rho_{1})$

Example 2.

Example 3.

Proposition 5.