A Bayesian Stochastic Approximation Method
Jin Xu, Cui Xiong, Rongji Mu

TL;DR
This paper introduces a Bayesian stochastic approximation method that enhances small sample estimation of regression roots through adaptive modeling, demonstrating superior finite-sample performance over traditional procedures.
Contribution
It presents a novel Bayesian approach with adaptive local modeling and nonrecursive iteration, improving efficiency and consistency in root estimation tasks.
Findings
Superior finite-sample performance compared to Robbins--Monro procedures
Strong consistency of the Bayesian estimator established
Extensions to extremum searching and multivariate quantiles included
Abstract
Motivated by the goal of improving the efficiency of small sample design, we propose a novel Bayesian stochastic approximation method to estimate the root of a regression function. The method features adaptive local modelling and nonrecursive iteration. Strong consistency of the Bayes estimator is obtained. Simulation studies show that our method is superior in finite-sample performance to Robbins--Monro type procedures. Extensions to searching for extrema and a version of generalized multivariate quantile are presented.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Statistical Methods and Inference · Neural Networks and Applications
A Bayesian Stochastic Approximation Method
Jin Xu111Corresponding author: School of Statistics, East China Normal University, 500 Dongchuan Road, Shanghai 200241, China, e-mail: [email protected] and Cui Xiong and Rongji Mu
School of Statistics
East China Normal University
Shanghai 200241, China
Abstract
Motivated by the goal of improving the efficiency of small sample design, we propose a novel Bayesian stochastic approximation method to estimate the root of a regression function. The method features adaptive local modelling and nonrecursive iteration. Strong consistency of the Bayes estimator is obtained. Simulation studies show that our method is superior in finite-sample performance to Robbins–Monro type procedures. Extensions to searching for extrema and a version of generalized multivariate quantile are presented.
Key words: adaptive local modelling; Kiefer–Wolfowitz process; nonrecursive iteration; Robbins–Monro process; stochastic approximation.
1 Introduction
We consider the problem of finding the unique root of a unknown function in the regression model
[TABLE]
where is unobservable random error. The approach by stochastic approximation uses a sequential design strategy to successively choose on which the response is observed with mean so that converges to in some sense. The feature of response-adaptiveness is attractive and can often be more efficient than fixed sample design. Over years, stochastic approximation and its variants have broad applications in design of experiments, clinical trials, dynamic programming, sequential learning, to name just a few (Finney, 1978; Kushner and Yin, 1997; Spall, 2003).
Here we give a brief review which is by no means to be complete but just covers some major progresses. In the fundamental paper of Robbins and Monro (1951), they proposed a recursive design of the form
[TABLE]
where are positive constants, and showed that converges to in probability when and assuming satisfies some regularity conditions. It is a stochastic analogy to the deterministic Newton’s method where (The prime denotes the first derivative.) and is referred as Robbins–Monro procedure. The almost sure convergence was later proved through different approaches (Dvoretzky, 1956; Gladyshev, 1965; Robbins and Siegmund, 1971). Inspired by the Liapounov functions in the stability theory of ordinary differential equations, Sacks (1958) established the asymptotic normality of and showed that under certain regularity conditions the asymptotically optimal choice of in (2) is where . (See also Chung (1954), Burkholder (1956), Hodges and Lehmann (1956).)
Ever since, much effects have been made to estimate . Lai and Robbins (1979, 1981) proposed an adaptive Robbins–Monro procedure in the form of
[TABLE]
where is a truncated version of the least square estimate of the regression slope given by and . Strong consistency of was established (Lai and Robbins, 1981, 1982). We refer the readers to Venter (1967), Anderson and Taylor (1976), Anbar (1978) and Anderson and Taylor (1979) for some related versions. An excellent review about these variants is given by Lai (2003).
In a different route, Ruppert (1988) and Polyak and Juditsky (1992) proposed using averaged trajectories of (2), , to estimate the root and demonstrated the almost sure convergence when satisfies the condition of being sufficiently slowly decreasing in the sense of and .
An important case of (1) is when is a distribution function and is binary response. Then, the Robbins–Monro procedure for finding the -quantile of , assuming it is unique, is given by
[TABLE]
The corresponding adaptive version is .
The rationale of these procedures is clear. When observing a ‘success’ at the th step (such as explosion in the sensitivity experiment or occurrence of adverse events in dose-finding clinical trial), reduce the current level for the next design point; when observing a ‘failure’, increase the current level for the next design point. As the number of iteration increases, the magnitude of change converges to zero. This type of scheme is in a similar spirit to the ‘up-and-down’ method (Dixon and Mood, 1948; Dixon, 1965) for estimating the median in sensitivity experiments. To estimate in the binary data case, Wu (1985) proposed fitting a two-parameter logit model for the available data to obtain an initial maximum likelihood estimate (MLE) of . Some initial runs are required to have the condition for the existence and uniqueness of this MLE met. (See also Sitter and Wu (1993).) An important contribution by Joseph (2004) is the proposal of an efficient Robbins–Monro procedure which entails the recursion
[TABLE]
where
[TABLE]
, , and are the distribution function and density of the standard normal variable respectively. The introduction of constant sequence helps reduce the oscillation of at early steps. It is shown to have a faster convergence than the usual Robbins–Monro procedure when takes extreme values. Wu and Tian (2014) proposed a three-phase design that combines some initial design and Joseph’s efficient modification to obtain a more steady method. Recently, Toulis and Airoldi (2015) proposed an implicit stochastic approximation method which improves the classic Robbins–Monro procedure by a stochastic fixed-point equation. It requires to run many additional experiments at every step of (2). Thus, it may not be feasible for a small sample design.
Other model-based designs for quantal response focus on estimation of the coefficients of a parametric model (Wu, 1986; Chaloner and Larntz, 1989; Chaudhuri and Mykland, 1993; Neyer, 1994; Dror and Steinberg, 2006, 2008; Hung and Joseph, 2014). The advantage of this approach is that one can use a single design to estimate the global response curve that includes all quantiles. While the disadvantages are that i) it needs to make assumptions (about the model and/or hyperparameters); and ii) the designs usually require initial data to start with which can be as many as ten or more. Hung and Joseph (2014) proposed a simple Bayesian version of Wu (1985)’s logit-MLE method, which makes the design fully sequential from . It postulates independent informative priors on the parameters of a logistic model for given by with and . And the sequential design estimates the -quantile by , where and are the maximum-a-posteriori (MAP) estimate of after samples.
In this paper, we limit our study to the root finding problem. We point out several limitations associated with the Robbins–Monro type procedures. First, for these algorithm-based procedures such as (2), the averaged trajectory of (2) and (5), the adaptation through the last experiment data via recursion is subject to inadequacy. Experiments at points in a neighborhood would carry useful information for as well especially in the early stage. Second, large oscillation caused by these up-or-down recursions in early iteration can be harmful and inefficient. Third, for the procedures such as (3) that reply heavily on the estimation of , as clusters around to , little information is gained to estimate directly. So even for consistent estimator, the finite-sample performance can still be far from satisfaction from a practical point of view.
On the other hand, the Bayesian paradigm is known to be suitable for such adaptive learning problem. Some applications in a closely related problem of dose-finding in clinical trials have been reported (Cheung, 2010; Thall, 2010). Like Hung and Joseph (2014)’s method, Bayesian models are used to update the underlying distribution globally. Little has been seen for solving the local root for -quantile directly. Using martingale theory, Hu (1998) established the strong consistency of the Bayes estimator under a general setting of a nonlinear regression model. We will make use of this result for later development.
Motivated by the aforementioned drawbacks of the Robbins–Monro type methods and the advantage of Bayesian approach, we propose a novel model-based stochastic approximation procedure that circumvents direct estimation of through integration. Specifically, the new method builds a local linear model for around and obtains the Bayes estimator as a nonrecursive solution for . Strong consistency is obtained. These constitute the main contents of Section 2. In Section 3, we give a few important remarks and insights of the proposed method that lead to more efficient algorithm. More importantly, in Section 4 we demonstrate by simulation that the proposed method yields a smooth search path and results in a superior finite-sample performance to the competing methods. In Section 5, we present applications of the new method to the general root-finding problem in (1) and Kiefer–Wolfowitz procedure (Kiefer and Wolfowitz, 1952) to find the minimum of an unknown function. In Section 6, we extend the proposed method to estimate a version of generalized multivariate quantile. Section 7 concludes the paper with some discussions.
2 Method
We begin with the problem of quantile estimation with binary responses under the setting in (4).
First, we introduce two preliminary processes before sequential experiment. (i) Scale the search domain of to the interval . It can be done easily once we have some general idea of the range of . (ii) Divide the interval equally into subintervals. We will provide guideline for the selection of in Section 3.4.
Denote the (scaled) data up to the th step by . Next, we construct a local Bayesian model based on the current point . Observe that is contained in the subinterval , where , , and is the ceiling function. Approximate in by the segment of a line through the point with positive slope given by
[TABLE]
Note that itself is not necessarily in .
For the convenience of later calculation, denote . Let
[TABLE]
Then, and are 1-1 connected with and through
[TABLE]
Assume that the joint prior of is uniform with density
[TABLE]
where are two given constants, and is the indicator function. For example, the constants and are considered to be noninformative. We have more discussion about the determination of and in Section 3.1. It should be noted that under this prior, can take value in through (8) as linear extrapolation. For later development, we will restrict the calculation of the posterior distribution of in the domain by truncation. And we will introduce other prior which meets the restriction for in Section 3.5.
The subsequent development for finding the posterior distribution of is standard. After accounting for the Jacobian from (7), the joint prior density of is
[TABLE]
which can be expressed as
[TABLE]
with
[TABLE]
Note that . Integrating out in (10) and imposing the restriction that , we obtain the prior density of as
[TABLE]
where is the normalization constant.
Next, we will only use the design points contained in to update the Bayesian model. This idea of using most recent design points is also seen in Anbar (1978) to estimate .
Denote the subsequence of in by . Clearly, since at least is in . Denote the likelihood function of at point by , which is expressed as
[TABLE]
where
[TABLE]
By (10) and (12), the posterior distribution of is proportional to
[TABLE]
For , express the coefficient of in as
[TABLE]
where is the collection of -choose- distinct subsets of indices out of and . We emphasize that only depends on data observed in the subinterval.
Integrating out in (14), we get the posterior distribution of as
[TABLE]
where is the normalization constant. A few points are worthy of being noted. First, is a two-piecewise homogeneous polynomial of order and is differentiable everywhere except at . Second, is invariant to the permutation of the points in the subsequence. Third, the modification of to the prior takes place in a multiplicative fashion. The weighted summand can be viewed as the th order interaction of the points in the subinterval. Moreover, we can write recursively as
[TABLE]
where , . It provides a simple way to obtain successively. Based on (17), we express in a recursive form as
[TABLE]
where .
We summarize the above results in the following proposition.
Proposition 1**.**
Assume that the joint prior of associated with the subinterval is uniform with density (9). Then, the posterior distribution of restricted in is given in (16) satisfying a recursion in (18).
Finally, we set the next point to be the Bayes estimator with respect to , i.e.
[TABLE]
Since or is completely determined in (16), can be easily calculated up to a desired precision. We can also easily obtain an equal tail credible interval for based on .
When is linear as in (6), it is clear that the random error for the binary response satisfies the conditions and . Then, by Theorem 1 of Hu (1998), we have the following result about the consistency of the procedure.
Proposition 2**.**
For binary response with mean value given by the model (6), the Bayesian stochastic approximation procedure given by (19) is strongly consistent.
When is nonlinear, by Taylor expansion differs from by a quantity bounded by , where denotes the second derivative assuming it exists. As increases, we can increase so that the local linear approximation is well maintained. Thus, we expect the consistency of the procedure to hold. We demonstrate its superb finite-sample performance in Section 4.
Beside the Bayes estimator, we can also use the posterior mode, i.e. maximum a posterior (MAP) estimator, for the next point. We illustrate the procedure by an example.
Example 1**.**
Let for . Consider estimating the median of . Set and . And set and for all subintervals. Figures 1 and 2 demonstrate one search path up to 30 steps and the evolution of the corresponding posterior distributions (which equals for some in the associated subinterval) obtained by the proposed method using the Bayes estimate and the MAP estimate, respectively. Notice that . For the purpose of illustrating the shape of , we multiply by to make the amplified s in a comparable scale. It is seen that both sequences move across three subintervals and gradually converge to the median 0.5. The Bayes estimate appears to converge faster than the MAP estimate as it is more aggressive to move across a subinterval. While, the MAP estimate tends to yield a conservative movement and a more smooth path. These patterns are consistent to the properties of mean and median with respect to the skewness of a distribution.
3 Remarks
In this subsection, we give a few important remarks and insights of the proposed method that can lead to more efficient algorithm.
3.1 Posterior distributions of and
By (6) and (7), we have linear interpolation for as with . Express the individual likelihood in (12) in terms of as
[TABLE]
Then, following the same routine as in Section 2 for , we obtain the marginal posterior distributions of and as follows.
Proposition 3**.**
Assume that the joint prior of associated with the subinterval is uniform with density (9). Then, the posterior distribution of is
[TABLE]
where is defined in the same form as (15) with
[TABLE]
and is the normalization constant. And the posterior distribution of is
[TABLE]
where is defined in the same form as (15) with
[TABLE]
and is the normalization constant.
The recursion in (17) also holds for and . Like in Section 2, and are completely determined given the data.
When enters a subinterval for either the first time or re-visit, we can use the posterior distributions obtained from the previous subinterval to update or for the uniform prior of the current subinterval. More specifically, suppose that moves forward from the th subinterval to the th subinterval. Then we can set the fifth percentile of the posterior distribution of of the th subinterval as for the th subinterval. And suppose that moves downward from the th subinterval to the th subinterval. Then we can set the 95th percentile of the posterior distribution of of the th subinterval as for the th subinterval. In this way, the information from the neighboring subinterval is used for the new local model. We will use this strategy in the subsequent numerical study. As seen in simulation, these lower or upper fifth percentile can actually narrow the range of the uniform prior significantly as data cumulates.
3.2 Posterior distribution of
The joint prior in (2) can also be written as
[TABLE]
with
[TABLE]
which indicates that given is uniform. Note that without further restriction of , the interval can be as wide as as pointed out before. To impose the conditions and requires where . Then, the marginal prior of is
[TABLE]
where is the normalization constant (over ).
Secondly, express in (12) as
[TABLE]
where
[TABLE]
Following the same steps in (15) and (16), we get
Proposition 4**.**
Assume that the joint prior of associated with the subinterval is uniform with density (9). Then, the posterior distribution of is
[TABLE]
where is defined in the same form as (15) with and replaced by and in (20) respectively, and is the normalization constant.
3.3 Investigation of
We present a detailed investigation of to reveal some features of the proposed procedure.
By (16), we have
[TABLE]
For simplicity, fix and in (11) for .
To examine the connection between and , we first consider the MAP estimate for . By solving and checking the sign of for cases of and where is defined in (11), we obtain that
[TABLE]
where and which divide into subintervals , and with fractions of , and , respectively. And falls in these subintervals depending on value in , , respectively.
A few interesting properties of the MAP estimate can be seen from (21). First, when , no matter or 0. This outcome enables the search path to possibly remain unchanged (with ) when the evidence of moving is not convincing. Second, when , the values of under the first two situations of (21) are rather counterintuitive. For instance, when with , we have . It would have been by Robbins–Monro type procedure. However, the procedure does yield . Similarly, when with , we get , which would have been by Robbins–Monro type procedure. This seemingly irrational move can actually avoid unnecessary oscillation of the search points in the absence of enough evidence and lead to a smooth path as seen in Figures 1 and 2 in contrast to a zig-zag path in Robbins–Monro type procedure. Third, can take value outside . For example, when , and , we get ; and when , and , we get . It results in the search point moving into the neighboring subinterval and consequently starting a new local Bayesian model.
The explicit expression of the MAP for can also be derived based on . It depends on and and is very complicated.
Next, by straightforward calculation, the Bayes estimate for is obtained as
[TABLE]
We can hardly interpret the connection of with from this analytic expression except that is a linear function of . However numerical analysis shows that also processes similar features as those described for the MAP estimate.
At last, inspired by the above investigation of , we find the proposed procedure is conservative in the sense of moving in large steps. So instead of choosing arbitrarily, we set , the middle of the search domain, as the starting point to begin cumulating information.
3.4 Choice of
The number of subintervals determines the size of the neighborhood up on which a local model is built. When is around the middle range, say , an integer in the range of can usually yield a quick convergence in a moderate number of iterations. When is close to extreme values, implying rare event of ‘success’ or ‘failure’ in experiment, we wish the search sequence to be conservative in moving in small steps especially in the early iterations. Therefore, a moderately large value of is recommended, say 20. And for the same reason, we recommend using MAP estimator instead of the Bayes estimator.
Second, to get a more efficient approximation and faster convergence, we recommend a two-stage procedure. That is to set to be a small number to quickly reach the vicinity of the target and then increase to a larger number for refined approximation. We provide a guideline for the choice of in Table 1. The odd numbers are chosen to avoid possible invalid denominators in in (11) during numerical calculation.
Third, if during the search updated information about the range of becomes available, one can re-define the search domain and use the available data after rescaling.
3.5 Alternative choice of prior
As seen in Section 2, the uniform prior of (9) leads to simple calculation for the derivation, but induces distribution of outside . Alternatively, one can use other prior, such as
[TABLE]
or even more informative prior to warrant . Then simple or explicit form of the posterior distribution may not be available. In this case, we can resort to Markov chain Monte Carlo method, e.g. Gibbs sampling, to obtain the empirical posterior distribution of after (8) and (14). However, because of scarcity of the data and simulation error, preliminary numerical study shows that resulting estimates are not as precise as those based on the exact distribution.
4 Numerical comparisons
We compare the proposed Bayesian stochastic approximation method using Bayes estimator (denoted by BSA-Bayes) and MAP estimator (denoted by BSA-MAP) with the classic Robbins–Monro procedure in (4) (denoted by RM), the efficient Robbins–Monro procedure in (5) (denoted by RMJ), the averaged trajectory method by Ruppert (1988) and Polyak and Juditsky (1992) (denoted by RPJ), and the Bayesian version of Wu’s logit-MLE method by Hung and Joseph (2014) (denoted by Wu-MAP).
Consider the following six functions adopted from Joseph (2004),
[TABLE]
which represent a shifted version of normal, uniform, logistic, extreme value, skewed logistic, and Cauchy distributions respectively with a common root at zero for all -quantiles.
Since all RM, RMJ, RPJ and Wu-MAP procedures are not intended to search within , we convert the points in interval by the linear map to interval and invert the resulting points back to for comparison in the same scale. For RM in (4), the optimal is used. For RMJ in (1), the optimal and are used as in Joseph (2004). For RPJ, set as recommended by Polyak and Juditsky (1992). For Wu-MAP, set the hyperparameters and to cover a wide range of priors. For BSA, set to represent a moderate number of sliced subintervals.
Throughout, we set (corresponding to the starting point zero in ) and to estimate . For taking values from , we compute the empirical root of mean square (RMSE) of over 1,000 replications for every procedure.
Figures 3 shows the empirical RMSE of obtained by the six competing methods. The findings are summarized as follows. (i) Under model 2, the RMJ and RPJ methods perform similarly. Both are superior to the RM and Wu-MAP methods, especially for extreme values of . The proposed method with Bayes estimator has uniform superiority to the RM, RMJ and RPJ methods for . For being extreme values as 0.1 or 0.9, the response curve is nearly flat at . The performance of BSA-Bayes deteriorates, as expected. While, the proposed method with MAP estimator in this case is the best due to the starting point advantage and its conservatism of movement as pointed out in Example 1. (ii) Under models 3 to 7, the results are similarly to those under model 2. For the sake of space, we defer them in the supplementary material. (iii) Under model 1, the -quantiles of standard normal locates across the search domain. It is seen that the performances of RM, RMJ, RPJ, Wu-MAP are similar to those under model 2. The proposed method with Bayes estimator outperforms the above four methods for all different values. The RMSE of BSA-MAP has the minimum value for the median estimation and increases in the distance between the root and the starting point which is again because of its conservative movement.
Further simulation shows that the proposed method with other moderate number of subintervals, say , yields similar superior result. The implicit stochastic approximation method by (20) of Toulis and Airoldi (2015) was also conducted and found to be much inferior to the RMJ and RPJ methods in the small sample case. The results are omitted.
At last, we want to add that the proposed method requires the uniqueness of . When is very close to zero such as at , or , the proposed method can perform inferior to the algorithm-based methods RMJ or RPJ. In that case, a hybrid method that uses RMJ or RPJ afer a moderate number of iterations of the proposed method can be used.
5 Applications
We present two applications of the proposed Bayesian stochastic approximation method for binary responses in this subsection.
5.1 Search for the root of a monotonic continuous function
For the original problem in (1), first convert to a response in through a sigmoid function , where is a known scale parameter such that spreads well in . For example, if has a known range in for some , we can set .
Second, approximate by a fraction represented by ones and zeros such that is closest to for some integer . These binaries are then treated as independent responses at the same point . The minimum value of corresponds to the dichotomization of by its sign. Usually a number as small as is adequate for the approximation.
Based on the generated binary responses, the problem is reduced to search for the median of a distribution. We can then use the proposed method with Bayes estimator in Section 2. More specifically, we set for the first ten steps and set for the subsequent steps as used in Section 4.
Example 2**.**
Consider the regression model , where is independent standard normal variable. We applied the proposed method above with , and . Panel (a) of Figure 4 shows the empirical RMSEs (over 1,000 replications) of up to 30 steps in comparison with those obtained by applying the (scaled) RMJ procedure (with the same starting point) to the binaries obtained by signs of . It is seen that the proposed method dominates the RMJ procedure.
5.2 Search for a minimum of a convex function
Suppose that is a convex function. We seek a sequential design for finding the minimum of at . It is equivalent to find such that , where . The Kiefer–Wolfowitz procedure (Kiefer and Wolfowitz, 1952) entails the recursion
[TABLE]
where and are two independent responses at and with mean and respectively, and are two positive constant sequences decreasing to zero and satisfying , , and . For example, and as recommended by Kiefer and Wolfowitz (1952).
Let . We apply the previous procedure in Section 5.1 to to approximate the root of .
Example 3**.**
Consider the regression model , where is independent standard normal variable. We conducted a similar comparison using the competing methods in Example 2 to and defined above. Panel (b) of Figure 4 shows the proposed method outperforms the method based on RMJ procedure in terms of RMSE.
6 Multi-dimensional extension
6.1 Method
We extend the proposed method for quantile estimation to the multi-dimensional case.
Let be the distribution function of a -dimensional continuous random vector with the domain scaled in the unit hypercube . The goal is to find the generalized multivariate quantile defined by
[TABLE]
where is a known function. This is a special case of the notion of generalized multivariate quantiles introduced by Einmahl and Mason (1992). Like the univariate case, assume that is unique.
The idea of the extension is to use a conditional approach to reduce the problem to univariate case along each coordinate so that the proposed method in Section 2 can be applied.
First, we introduce some notations. Divide equally into subintervals along each coordinate. For any , let for and . Then, is uniquely contained in the hypercube . Let denote a vector of ones and denote the th column vector of the identity matrix. Denote the following vertexes of by
[TABLE]
Notice that are arranged in a helix.
Second, approximate in by the segments of hyperplanes intersected by the hypercube respectively. The th hyperplane passes through the point with
[TABLE]
and is expressed as
[TABLE]
where \mbox{\boldmath{\beta}}=(\beta_{1},\ldots,\beta_{p})^{\top} with being all positive.
For , let and \mbox{\boldmath{\rho}}=(\rho_{0},\ldots,\rho_{p})^{\top}. Then, by (22) and (23), we have and the solution of (\theta_{j},\mbox{\boldmath{\beta}}) in given by
[TABLE]
Let for and \widetilde{}\mbox{\boldmath{\beta}}=(\widetilde{\beta}_{1},\ldots,\widetilde{\beta}_{p})^{\top}. The Jacobian of the transformation from to (\theta_{j},\widetilde{}\mbox{\boldmath{\beta}}) is .
Assume the joint prior of is uniform with density
[TABLE]
Further denote \widetilde{}\mbox{\boldmath{\beta}}_{-j}=(\widetilde{\beta}_{1},\ldots,\widetilde{\beta}_{j-1},\widetilde{\beta}_{j+1},\widetilde{\beta}_{p})^{\top}. By (24), (25) and (26), the joint prior of (\theta_{j},\widetilde{}\mbox{\boldmath{\beta}}) is
[TABLE]
where
[TABLE]
It is seen that the joint prior distribution of \widetilde{}\mbox{\boldmath{\beta}}_{-j} is uniform on the simplex defined by (28). Then given \widetilde{}\mbox{\boldmath{\beta}}_{-j}, the conditional distribution of after integrating out and imposing the restriction is
[TABLE]
where is the volume of (depending on ) and is the conditional normalization constant (depending on \widetilde{}\mbox{\boldmath{\beta}}_{-j}).
Alternatively, express
[TABLE]
where
[TABLE]
The further restriction of which amounts to 0\leq\ell_{j}(\widetilde{}\mbox{\boldmath{\beta}})<u_{j}(\widetilde{}\mbox{\boldmath{\beta}})\leq 1 requires with
[TABLE]
Denote the subsequence of contained in by with . Express the likelihood of as
[TABLE]
where
[TABLE]
or as
[TABLE]
where
[TABLE]
It should be noted that unlike the univariate case here depends not only on but also the current point through .
Combining the joint prior of (\theta_{j},\widetilde{}\mbox{\boldmath{\beta}}) in (27) or (29) and the likelihood of the subsequence, we get the posterior distribution of (\theta_{j},\widetilde{}\mbox{\boldmath{\beta}}) proportion to h(\theta_{j},\widetilde{}\mbox{\boldmath{\beta}})\prod_{k=1}^{m}L_{i_{k}}(\theta_{j},\widetilde{}\mbox{\boldmath{\beta}}). Given \widetilde{}\mbox{\boldmath{\beta}}_{-j}, the conditional posterior distributions of and are obtained in the same way as in the univariate case in Sections 2 and 3.2 respectively. We summarize the results in the following proposition.
Proposition 5**.**
Assume that the joint prior of associated with the vertexes of the hypercube is uniform with density (26). Then, the conditional posterior distribution of restricted in is
[TABLE]
where is defined in (15) with and given in (31) and is the conditional normalization constant. And the conditional posterior distribution of is
[TABLE]
where is defined in the same form as (15) with and replaced by \widetilde{a}_{i}(\widetilde{\beta}_{j}\mid\widetilde{}\mbox{\boldmath{\beta}}_{-j}) and in (32) respectively, and is the conditional normalization constant over the range given in (30).
Proposition 5 reduces to the results in Propositions 1 and 4 when .
The next design point along the th coordinate is then taken to be
[TABLE]
where
[TABLE]
Since h_{m}(\theta_{j}\mid\widetilde{}\mbox{\boldmath{\beta}}_{-j}) is completely determined, the conditional expectation of can be numerically calculated. The expectation with respect to \widetilde{}\mbox{\boldmath{\beta}}_{-j} can be approximated by averaging the conditional expectations over finite number of \widetilde{}\mbox{\boldmath{\beta}}_{-j} taken uniformly from the simplex. For instance when , the simplex for reduces to where
[TABLE]
, and are the other element of , and after removing the th element, respectively
At last, we use to determine the next design point out of the candidates, i.e.,
[TABLE]
Note that in the multi-dimensional case, the derivation of the marginal posterior distribution of is much complicated than the univariate case in Section 3.1. Moreover, there are in fact different ways to update (or ) in (26) depending on the coincidence of (or ) with one vertex of some neighboring hypercube. So in the following numerical study, we simply fix and for all hypercubes and let the data inside the hypercube learn the posterior distribution of the interested parameter.
6.2 Numerical illustration
We illustrate the proposed method by a few examples. Consider the following three models:
[TABLE]
where and is the distribution function of bivariate normal variables with zero means, unit marginal variances and correlation coefficient .
We use the same two-stage procedure with respect to the choice of as for the univariate case in Section 4. For illustration, we set the starting point for all cases and recommend using MAP estimator for to be conservative. The uniform distribution for over in (33) is approximated by a discrete uniform distribution over .
In these examples, by symmetry we have and the determination for the next point in (34) can be modified as , i.e. to choose a point that is closer to the diagonal line .
Panel (a) of Figure 5 presents a single search path under with , where the dotted curve is the solution set of and \mbox{\boldmath{\theta}}=(0.3733,0.3733)^{\top} is indicated by ‘’. Panels (b), (c) and (d) of Figure 5 show the empirical RMSE (over 1,000 replications) of up to 60 steps obtained by the proposed method using different estimators (in parenthesis). The convergence of the procedure is clear. For the case with , the small value of RMSE at the first few steps is due to the starting point.
The results for models 9 and 10 are similar and hence omitted.
7 Conclusion and discussion
The proposed Bayesian stochastic approximation method uses an adaptive local model and yields a recursive updating scheme in terms of the posterior distribution in stead of the estimate itself. It has the advantage of successively utilizing the information of the neighboring points to improve the estimation efficiency, thus reduces the variation or uncertainty carried by a single point. However, there remain several questions unsettled. First, the asymptotic behavior of the procedure in both univariate and multivariate cases is not fully understood. Second, the refined prior in both univariate case and multi-dimensional case is worth further investigation. Third, more efficient algorithm is desired, especially for multi-dimensional situation, where information about the posterior distribution of can be used.
Because of the rich and broad applications of stochastic approximation, we anticipate new explorations of the proposed method in interactions with different techniques in many fields that mentioned at the beginning of the article.
R package is provided in the supplementary material.
Acknowledgements
The research is supported by the National Natural Science Foundation of China (grant 11271134) and the 111 Project (B14019) of Chinese Ministry of Education.
Appendix
Figure 6 shows the empirical RMSE of obtained by the six competing methods under models 3 to 7. The results are similar to those obtained under model 2.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Anbar (1978) Anbar, D. (1978). A stochastic Newton–Raphson method. Journal of Statistical Planning and Inference 2 , 153–163.
- 2Anderson and Taylor (1976) Anderson, T. W. and J. Taylor (1976). Some experimental results on the statistical properties of least squares estimates in control problems. Econometrica 44 , 1289–1302.
- 3Anderson and Taylor (1979) Anderson, T. W. and J. Taylor (1979). Strong consistency of least squares estimates in dynamic models. Annals of Statistics 7 , 484–489.
- 4Burkholder (1956) Burkholder, D. L. (1956). On a class of stochastic approximation procedures. Annals of Mathematical Statistics 27 , 1044–1059.
- 5Chaloner and Larntz (1989) Chaloner, K. and K. Larntz (1989). Optimal Bayesian design applied to logistic regression experiments. Journal of Statistical Planning and Inference 21 , 191–208.
- 6Chaudhuri and Mykland (1993) Chaudhuri, P. and P. A. Mykland (1993). Nonlinear experiments: optimal design and inference based on ikelihood. Journal of the American Statistical Association 88 , 538–546.
- 7Cheung (2010) Cheung, Y. K. (2010). Stochastic approximation and modern model-based designs for dose-finding clinical trials. Statistical Science 25 , 191–201.
- 8Chung (1954) Chung, K. L. (1954). On a stochastic approximation method. Annals of Mathematical Statistics 25 , 463–483.
