The adaptive Wynn-algorithm in generalized linear models with univariate response
Fritjof Freise, Norbert Gaffke, Rainer Schwabe

TL;DR
This paper introduces an adaptive Wynn-algorithm for D-optimal design in generalized linear models with univariate responses, demonstrating its strong consistency and asymptotic properties under certain conditions.
Contribution
It extends the Wynn-algorithm to generalized linear models, providing theoretical guarantees for adaptive design and estimator asymptotics.
Findings
Adaptive ML-estimators are strongly consistent.
Design sequence is asymptotically locally D-optimal.
Asymptotic normality holds under smoothness assumptions.
Abstract
For a nonlinear regression model the information matrices of designs depend on the parameter of the model. The adaptive Wynn-algorithm for D-optimal design estimates the parameter at each step on the basis of the employed design points and observed responses so far, and selects the next design point as in the classical Wynn-algorithm for D-optimal design. The name `Wynn-algorithm' is in honor of Henry P. Wynn who established the latter `classical' algorithm in his 1970 paper. The asymptotics of the sequences of designs and maximum likelihood estimates generated by the adaptive algorithm is studied for an important class of nonlinear regression models: generalized linear models whose (univariate) response variables follow a distribution from a one-parameter exponential family. Under the assumptions of compactness of the experimental region and of the parameter space together with some…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
The adaptive
Wynn-algorithm in generalized linear models
with univariate response
Fritjof Freise1, Norbert Gaffke2, and Rainer Schwabe2
xxxxxxxxxx
1TU Dortmund University and 2University of Magdeburg
Abstract
For a nonlinear regression model the information matrices of designs depend on the parameter of the model. The adaptive Wynn-algorithm for D-optimal design estimates the parameter at each step on the basis of the employed design points and observed responses so far, and selects the next design point as in the classical Wynn-algorithm for D-optimal design. The name ‘Wynn-algorithm’ is in honor of Henry P. Wynn who established the latter ‘classical’ algorithm in his 1970 paper [16]. The asymptotics of the sequences of designs and maximum likelihood estimates generated by the adaptive algorithm is studied for an important class of nonlinear regression models: generalized linear models whose (univariate) response variables follow a distribution from a one-parameter exponential family. Under the assumptions of compactness of the experimental region and of the parameter space together with some natural continuity assumptions it is shown that the adaptive ML-estimators are strongly consistent and the design sequence is asymptotically locally D-optimal at the true parameter point. If the true parameter point is an interior point of the parameter space then under some smoothness assumptions the asymptotic normality of the adaptive ML-estimators is obtained.
1 Introduction
In a nonlinear regression model the information matrix of a design depends on the model parameter whose true value is unknown. Modifying the classical algorithm of Wynn [16] for sequential generation of a D-optimal design in linear regression to an adaptive sequential procedure in a nonlinear model, the ‘adaptive Wynn-algorithm’ emerges, which was called ‘one-step ahead adaptive D-optimal design algorithm’ in Pronzato [11].
By , , , and we denote the set of all positive integers, the set of all nonnegative integers, the real line, and the -dimensional Euclidean space, respectively. Vectors are written as column vectors and denotes the transposed of , which is a -dimensional row vector. The usual Euclidean norm on is denoted by . If is a family of vectors then {\rm span}\bigl{\{}a_{i}\,:\,i\in I\bigr{\}} denotes the linear subspace of generated by the vectors (). For a linear subspace of the dimension of is denoted by . If is a symmetric matrix then denotes the trace of and denotes the Frobenius norm of , i.e., \|A\|=\bigl{(}{\rm tr}(A^{2})\bigr{)}^{1/2}. For any two symmetric matrices and we write or, equivalently, iff is nonnegative definite. Thereby a semi-ordering is defined on the set of all symmetric matrices, which is called the Loewner semi-ordering.
We give an outline of the adaptive Wynn-algorithm. Let be the experimental region and be the parameter space. For each a function is given such that the range of spans , i.e., {\rm span}\bigl{\{}f_{\theta}(x)\,:\,x\in{\cal X}\bigr{\}}=\mathbb{R}^{p} for each . Throughout it is assumed that and are compact metric spaces with distance functions and , resp., and the function is continuous on . Of course, the assumption of compactness of the parameter space is somewhat disturbing but, presently, indispensable for our results. However, in the literature on adaptive procedures in stochastic approximation it is quite common to assume compactness of the parameter space and, moreover, to assume the true parameter to be an interior point, see e.g. Venter [14], Section 4.
An (approximate) design is a probability measure with finite support on , and it can formally be represented as
[TABLE]
where denotes the support of , which is a nonempty finite subset of , and to each the design assigns a positive weight such that . The symbol (for any ) stands for the one-point probability measure on concentrated at the point . For a design and for a parameter point the information matrix (per observation) of at is given by
[TABLE]
which is a nonnegative definite matrix. The information matrices defined by (1.1) arise as Fisher information in some nonlinear regression model and, in particular, the functions are related to a local linearization at of the (univariate) nonlinear mean response , say. E.g., in case of a homoscedastic regression model the vector is given by the gradient of at . In case of heteroscedasticity, also the variance function and possibly its gradient enters into , see Atkinson et al. [1]. For the case of a generalized linear model the functions have the pleasant property that the parameter only enters into a positive scalar factor, i.e., a real-valued positive function while the ‘body’ of the functions is given by one -valued function . We will refer to this situation as ‘condition (GLM)‘ on the family of functions , , namely:
**Condition (GLM)
** for all , where and are given continuous functions.
For a generalized linear model one has, even more specially, that and the real-valued function is actually a function of , i.e.,
[TABLE]
where is a continuous function of one real variable. As an example, for the logistic model with Bernoulli response variables one has
[TABLE]
see Atkinson and Woods [2], Section 2.3.
The adaptive Wynn algorithm generates a sequence of designs , , (the index ‘st’ standing for ‘starting’) which is obtained from a sequence of points , , and a sequence of parameter points , , as follows,
[TABLE]
where it is assumed that the starting design is such that its information matrix is positive definite for all . This implies positive definiteness of the information matrices of all designs , , since
[TABLE]
which entails , for all and all . Note that the design for each is an exact design of size since the weights assigned to its support points are integer multiples of , and hence can be exactly realized for the sample size . The sequence of parameter points , , employed will actually be generated by adaptive parameter estimation, i.e., for all , where are the sequentially observed univariate responses at the design points , resp., due to an underlying regression model with a mean response function , , , as mentioned above.
In Section 2 we study the asymptotic behavior of the design sequence and their information matrices under any sequence of parameter points , , which may be thought of as a path of a sequence of adaptive estimators , . Also, the design sequence , , may be viewed as a path of a sequence of adaptive random designs. In Section 3 the asymptotic properties (strong consistency, asymptotic normality) of adaptive ML-estimators in the algorithm are derived. For modelling the adaptive procedure inherent in the algorithm we follow the martingale approach of Lai and Wei [10], Lai [9], and Chen, Hu and Ying [4]. Some known results on matrices used in our proofs are collected in the appendix.
The paper of Pronzato [11] deals with the adaptive Wynn-algorithm for the case of a finite design space (and a compact parameter space). In that paper, under some conditions of Chebyshev type on the functions , , and the mean response function, asymptotic results of the design sequence and of adaptive least squares estimators were derived, and also for adaptive ML-estimators in the particular case of binary response variables. The thesis of Freise [6] provides an interesting contribution to the asymptotics of the adaptive Wynn algorithm. Of further interest, though not dealing with adaptive procedures, are the papers of Wu [15] on nonlinear least squares estimators, and of Fahrmeir and Kaufmann [5] on maximum likelihood estimators in generalized linear models.
2 Asymptotic properties of designs
Throughout this section let , , be any given sequence of parameter points and let , , be the sequence of designs given by (1.3) and (1.4), where the starting design is such that its information matrix is positive definite for all , and hence is positive definite for all and all .
An important question is whether positive definiteness of the information matrices of the designs is preserved asymptotically in the sense that
[TABLE]
or, even stronger,
[TABLE]
where denotes the smallest eigenvalue of a symmetric matrix . Answers to the questions about asymptotic nonsingularity will be given. Under condition (GLM) the stronger asymptotic nonsingularity (2.2) holds true, while a weaker technical condition (T) ensures the asymptotic nonsingularity (2.1). We start our derivations with four lemmas.
For a real number we denote by the smallest integer greater than or equal to .
Lemma 2.1
Let , , be a sequence in , where is given, and let such that for each the following two implications hold.
[TABLE]
*Let be given. Denote m_{1}\,=\,m_{1}(\beta,\widetilde{\beta},m_{0})\,:=\,\big{\lceil}1/\beta\big{\rceil}\,\max\bigl{\{}m_{0},\,\big{\lceil}1/(\widetilde{\beta}-\beta)\big{\rceil}\bigr{\}}.
Then: for all .*
Proof. We show that
[TABLE]
Let with be given. In case that the sequence , , never exceeds the conclusion in (2.5) trivially holds. In the other case, by (2.3), it suffices to show that holds for those for which and . For such , by (2.4), .
Next we show that
[TABLE]
Let with be given. If is a nonnegative integer such that for all , then by (2.3) and hence , i.e., . So there must be some such that .
Consider as defined in the lemma and define . Note that m_{1}=\big{\lceil}1/\beta\big{\rceil}\,k_{1}. We show that for all .
Case 1: . By (2.5) with one gets for all .
Case 2: . By (2.6) with one gets some such that . Application of (2.5) on yields for all and, in particular, for all since \nu\leq\big{\lceil}k_{1}/\beta\big{\rceil}\leq\lceil 1/\beta\rceil\,k_{1}=m_{1}.
We will use the two positive real constants given by
[TABLE]
where in (2.8) the infimum is taken over all from the unit sphere of and over all . In fact, both the supremum in (2.7) and the infimum in (2.8) are attained and are positive. This is obvious for the former supremum by the continuity and compactness assumptions. For the infimum in (2.8), note that the function (v,\theta)\mapsto\max_{x\in{\cal X}}\bigl{(}v^{\sf\scriptsize T}f_{\theta}(x)\bigr{)}^{2} is lower semi-continuous (as a pointwise maximum of a family of continuous functions) and positive, where the latter follows from the basic assumption that the image spans for each . By compactness of the unit sphere of and compactness of the infimum in (2.8) is attained and hence positive.
Lemma 2.2
*Let be a design and such that is positive definite.
Let and . Then for all such that one has*
[TABLE]
Proof. Abbreviate . Define b_{0}\,:=\,M^{-1}_{0}f_{\theta_{0}}(x_{0})\big{/}\bigl{(}f_{\theta_{0}}^{\sf\scriptsize T}(x_{0})M^{-1}_{0}f_{\theta_{0}}(x_{0})\bigr{)}. Then
[TABLE]
Denote by the smallest eigenvalue of . We will show that
[TABLE]
To prove (2.10), the obvious inequality (in the Loewner semi-ordering) , where denotes the unit matrix, yields
[TABLE]
To prove (2.11) let be a normalized eigenvector to of . By definition of from (2.8)
[TABLE]
for some . Hence, together with the obvious inequality (in the Loewner semi-ordering) , one obtains
[TABLE]
From (2.9), (2.10), and (2.11) we get
[TABLE]
Let such that . Recall that, by definition of ,
[TABLE]
Together with (2.12) we get
[TABLE]
hence . Define . Then and hence by (M3) of the appendix,
[TABLE]
from which the result follows.
Lemma 2.3
Let and \eta\in\bigl{(}\,0\,,\,1-\frac{1}{\sqrt{p}}\,\bigr{)}. Let and be given such that
[TABLE]
Then: .
Proof. Suppose that . Consider the mean (of over w.r.t. ),
[TABLE]
Since for all we get . By Lemma 2.2,
[TABLE]
By (M1) and (M2) of the appendix,
[TABLE]
Hence it follows that . This is a contradiction since we know from the Kiefer-Wolfowitz equivalence Theorem that . So must be true.
For , , and we denote
[TABLE]
The set may be called an -neighborhood of . For any subset and we denote, as usual, f_{\theta}^{-1}(C)\,=\,\bigl{\{}x\in{\cal X}\,:\,f_{\theta}(x)\in C\bigr{\}}.
Lemma 2.4
Let be a linear subspace with , and let with and be given. Then, denoting w_{n}:=\xi_{n}\Bigl{(}f_{\theta_{n}}^{-1}\bigl{(}\overline{V}(\delta)\bigr{)}\Bigr{)}, one has
[TABLE]
Proof. For all decompose
[TABLE]
where denotes the orthogonal complement of in . Choose . Clearly, for all x\in f_{\theta_{n}}^{-1}\bigl{(}\overline{V}(\delta)\bigr{)} one has \|v(x)\|={\rm dist}\bigl{(}f_{\theta_{n}}(x),V\bigr{)}\leq\delta. On the other hand, which can be seen as follows. Since there is some -dimensional linear subspace such that . There is a representation for some with . By for all , and by definition of in (2.8) one gets
[TABLE]
Define . Clearly, hence by (M3) of the appendix
[TABLE]
Now, \displaystyle b^{\sf\scriptsize T}M(\xi_{n},\theta_{n})\,b\,=\,\sum_{x\in{\rm\scriptsize supp}(\xi_{n})}\xi_{n}(x)\,\bigl{(}b^{\sf\scriptsize T}f_{\theta_{n}}(x)\bigr{)}^{2}\,=\,\sum_{x\in{\rm\scriptsize supp}(\xi_{n})}\xi_{n}(x)\,\frac{\bigl{(}v^{\sf\scriptsize T}(x^{*})\,v(x)\bigr{)}^{2}}{\|v(x^{*})\|^{4}},
and \bigl{(}v^{\sf\scriptsize T}(x^{*})\,v(x)\bigr{)}^{2}\big{/}\|v(x^{*})\|^{4}\leq\|v(x^{*})\|^{2}\,\|v(x)\|^{2}\big{/}\|v(x^{*})\|^{4}=\|v(x)\|^{2}/\|v(x^{*})\|^{2}\,\leq 1 for all . If x\in f_{\theta_{n}}^{-1}\bigl{(}\overline{V}(\delta)\bigr{)} then . Hence, partitioning into {\rm supp}(\xi_{n})\cap f_{\theta_{n}}^{-1}\bigl{(}\overline{V}(\delta)\bigr{)} and {\rm supp}(\xi_{n})\setminus f_{\theta_{n}}^{-1}\bigl{(}\overline{V}(\delta)\bigr{)}, one gets
[TABLE]
and together with (2.14) the result follows.
We introduce a technical condition (T) which is weaker than (GLM). It is motivated by the result of Lemma 2.5 below.
**Condition (T)
**For each there exist an integer and a such that for all and all linear subspaces one has f_{\theta_{k}}^{-1}\bigl{(}\overline{V}(\delta^{\prime})\bigr{)}\subseteq f_{\theta_{\ell}}^{-1}\bigl{(}\overline{V}(\delta)\bigr{)}.
Lemma 2.5
*(i) Condition (GLM) implies condition (T).
(ii) If for some then condition (T) holds.*
Proof. Ad (i). Assume (GLM). Denote
[TABLE]
By compactness and continuity the infimum and the supremum are attained, and hence . For a given choose and . Let and a linear subspace be given. For any and any one has f_{\theta}^{-1}\bigl{(}\overline{V}(\varepsilon)\bigr{)}=\bigl{\{}x\in{\cal X}\,:\,{\rm dist}(\psi(x,\theta)\,f(x),V)\leq\varepsilon\bigr{\}}, and {\rm dist}\bigl{(}\psi(x,\theta)\,f(x),V\bigr{)}=\psi(x,\theta)\,{\rm dist}(f(x),V), hence
[TABLE]
For and (2.16) yields, observing ,
[TABLE]
For and (2.16) yields, observing ,
[TABLE]
From (2.17) and (2.18) the inclusion f_{\theta_{k}}^{-1}\bigl{(}\overline{V}(\delta^{\prime})\bigr{)}\subseteq f_{\theta_{\ell}}^{-1}\bigl{(}\overline{V}(\delta)\bigr{)} follows.
Ad (ii). Assume that for some . By compactness of and continuity (hence uniform continuity) of the function the sequence of functions , , converges to uniformly on . So, for any given there is an such that
[TABLE]
Choose . Let and a linear subspace be given. Using the well-known inequality
[TABLE]
one gets from (2.19) that
[TABLE]
From (2.20), using , one gets f_{\theta_{k}}^{-1}\bigl{(}\overline{V}(\delta^{\prime})\bigr{)}\subseteq f_{\theta_{\ell}}^{-1}\bigl{(}\overline{V}(\delta)\bigr{)}.
Theorem 2.6
Assume condition (T). Then there exist , , and such that for all and all -dimensional linear subspaces of one has \xi_{n}\Bigl{(}f_{\theta_{n}}^{-1}\bigl{(}\overline{V}_{p-1}(\varepsilon)\bigr{)}\Bigr{)}\leq\alpha.
Proof. Firstly, consider the (nearly) trivial case . The only [math]-dimensional linear subspace of is , hence for any . From (1.4) \bigl{(}f_{\theta_{n}}(x_{n+1})\bigr{)}^{2}=\max_{x\in{\cal X}}\bigl{(}f_{\theta_{n}}(x)\bigr{)}^{2} and by (2.8) \bigl{(}f_{\theta_{n}}(x_{n+1})\bigr{)}^{2}\geq\kappa for all . Choose a and choose and according to condition (T). Then x_{n+1}\not\in f_{\theta_{n}}^{-1}\bigl{(}[\,-\delta\,,\,\delta\,]\bigr{)} for all , hence
[TABLE]
By (T), for all the set f_{\theta_{n}}^{-1}\bigl{(}[\,-\delta^{\prime}\,,\,\delta^{\prime}\,]\bigr{)} is a subset of the intersection from (2.21) and hence x_{i}\not\in f_{\theta_{n}}^{-1}\bigl{(}[\,-\delta^{\prime}\,,\,\delta^{\prime}\,]\bigr{)} for all and all . It follows that
[TABLE]
So, choosing , , and , the statement of the theorem holds in case . In what follows we assume . We will prove by induction the following statement for all .
There exist , , and such that \xi_{n}\Bigl{(}f_{\theta_{n}}^{-1}\bigl{(}\overline{V}_{r}(\varepsilon_{r})\bigr{)}\Bigr{)}\leq\alpha_{r} for all and all -dimensional linear subspaces of .
Then the result will follow by taking , , and .
. The only [math]-dimensional linear subspace of is the nullspace , and for any one has , the closed ball centered at zero with radius . Choose any \eta\in\bigl{(}\,0\,,\,1-\frac{1}{\sqrt{p}}\,\bigr{)} and let . Choose and according to condition (T), and define
[TABLE]
Clearly, if x,z\in f_{\theta_{\ell}}^{-1}\bigl{(}\overline{V}_{0}(\delta)\bigr{)}, i.e., and , then . So the subset has the property that if and then . By Lemma 2.3, if and \xi_{n}(S)>1\big{/}\bigl{(}(1-\eta)^{2}p\bigr{)} then . Choose an with 1\big{/}\bigl{(}(1-\eta)^{2}p\bigr{)}<\alpha_{0}<1. The sequence , , along with \beta=1\big{/}\bigl{(}(1-\eta)^{2}p\bigr{)} and , satisfy the assumptions of Lemma 2.1, and hence by that lemma for all n\geq m_{1}=m_{1}\bigl{(}\beta,\widetilde{\beta},m_{0}(\delta)\bigr{)}. By (T), f_{\theta_{n}}^{-1}\bigl{(}\overline{V}_{0}(\delta^{\prime})\bigr{)}\subseteq S for all and hence \xi_{n}\Bigl{(}f_{\theta_{n}}^{-1}\bigl{(}\overline{V}_{0}(\delta^{\prime})\bigr{)}\Bigr{)}\leq\alpha_{0} for all . So statement holds with , , and as already introduced.
Induction step. Suppose that for some statement is true, and let , , and be chosen as in statement . Since every linear subspace of dimension can be enlarged to an -dimensional linear subspace , where and hence , the assumed statement implies the following.
[TABLE]
The rest of the proof of the induction step is lengthy; it is structured into three steps.
Step 1. We introduce some sets and constants.
[TABLE]
Obviously, and are compact sets of matrices. It is not quite obvious that is nonempty which can be seen as follows. (2.22) implies in particular that, choosing any , the set {\cal X}\setminus f_{\theta_{n}}^{-1}\bigl{(}\overline{V}_{0}(\varepsilon_{r-1})\bigr{)} is nonempty, i.e., there is a such that . By one has . Choosing pairwise orthogonal vectors with , , one gets a matrix with , hence . Next, denote by the -norm of a vector and define
[TABLE]
Choosing any and gives , and follows. Together with compactness and continuity one has . Again by compactness and continuity one can choose a positive integer and nonempty subsets such that
[TABLE]
[TABLE]
Note that \alpha_{r-1}\,<\,(K_{r}c_{r}^{2}+\alpha_{r-1})\big{/}(K_{r}c_{r}^{2}+1), hence . Finally, choose a which satisfies the following three conditions,
[TABLE]
In fact, such a exists since, firstly, both sides of the inequality (2.29) are continuous functions of a real variable and the (strict) inequality (2.29) holds for by (2.27). Secondly, (2.30) is achieved by the uniform continuity of the function on the compact set from (2.23).
Step 2. With and from Step 1 we show the following:
- *If is an -dimensional linear subspace and such that
\xi_{n}\Bigl{(}f_{\theta_{n}}^{-1}\bigl{(}\overline{V}_{r}(\delta)\bigr{)}\Bigr{)}\,>\,\overline{\alpha}_{r}, then x_{n+1}\not\in f_{\theta_{n}}^{-1}\bigl{(}\overline{V}_{r}(\delta)\bigr{)}.*
Let an -dimensional linear subspace and an be given such that
[TABLE]
By property (2.28), \delta^{2}/\kappa<\kappa\big{/}\bigl{(}(c_{r}+1)^{2}\gamma^{2}\bigr{)}\leq\kappa/\gamma^{2}\leq 1, where the last inequality is obvious by the definitions of and in (2.7) and (2.8). So and by Lemma 2.4 and (2.31)
[TABLE]
Next, we construct a particular basis of the linear subspace . From (2.22) and (2.31) it follows that for all linear subspaces of dimension one has
[TABLE]
Note that by (2.26), in particular, the sets cover . Thus (2.33) implies that to any linear subspace of dimension at most one can find some index such that \xi_{n}\Bigl{(}f_{\theta_{n}}^{-1}\bigl{(}\overline{V}_{r}(\delta)\setminus\overline{V}(\varepsilon_{r-1})\bigr{)}\cap R_{k})\Bigr{)}\,>\,(\overline{\alpha}_{r}-\alpha_{r-1})/K_{r}. Using this, one obtains inductively subsets of such that for all ,
[TABLE]
with particular linear subspaces given by
[TABLE]
where denotes the average of over w.r.t. analogously to (2.13). For each by (2.34), firstly, for all and hence also for the mean since the set is convex. Secondly, for all , i.e., for all . Thirdly, , hence for all which implies for all . Using the inequality \big{|}{\rm dist}(f_{\theta_{n}}(x),W_{j-1})-{\rm dist}\bigl{(}\overline{f}_{\theta_{n}}(S_{j},\xi_{n}),W_{j-1}\bigr{)}\big{|}\leq\|f_{\theta_{n}}(x)-\overline{f}_{\theta_{n}}(S_{j},\xi_{n})\| one gets, choosing any ,
[TABLE]
By (2.36) together with (M4) of the appendix the matrix F:=\bigl{[}\overline{f}_{\theta_{n}}(S_{1},\xi_{n}),\ldots,\overline{f}_{\theta_{n}}(S_{r},\xi_{n})\bigr{]} satisfies
[TABLE]
For each , by ,
[TABLE]
Consider the matrix B:=\bigl{[}b_{1},\ldots,b_{r}\bigr{]}. Since one has . By (2.38) , , and hence, using property (2.30) of and (2.37),
[TABLE]
In particular, and the vectors are linearly independent and form thus a basis of the linear subspace . Now suppose, contrary to the assertion of Step 2, that x_{n+1}\in f_{\theta_{n}}^{-1}\bigl{(}\overline{V}_{r}(\delta)\bigr{)}. Then
[TABLE]
Since constitute a basis of and B=\bigl{[}b_{1},\ldots,b_{r}\bigr{]}, one has for some . In fact, is uniquely determined by . Since and one has, according to the definition of in (2.25), that . Together with (2.38),
[TABLE]
[TABLE]
Define . Hence
[TABLE]
Let . Then by property (2.28) of , and by (2.39) . So, by Lemma 2.2,
[TABLE]
Observing that b\mapsto\bigl{(}b^{\sf\scriptsize T}M^{-1}(\xi_{n},\theta_{n})\,b\bigr{)}^{1/2}, , is a norm on and using the definition of the vector ,
[TABLE]
For each , one gets by (M1) and (M2) of the appendix, where the sums below are taken over ,
[TABLE]
where the last inequality is due to (2.35). Hence by (2.41) and by ,
[TABLE]
and together with (2.40) one gets
[TABLE]
Observing that the r.h.s. of (2.42) equals the reciprocal of the l.h.s. of (2.29), it follows from (2.29) that
[TABLE]
and hence by (2.42)
[TABLE]
which is a contradiction to (2.32) derived above. So our supposition that x_{n+1}\in f_{\theta_{n}}^{-1}\bigl{(}\overline{V}_{r}(\delta)\bigr{)} was wrong. Hence the result of Step 2 follows.
Step 3. For from the previous Steps 1 and 2, let and be chosen according to (T), where we may assume that . Recall that according to (2.27). Choose an such that . Let be any -dimensional linear subspace of . Consider the set
[TABLE]
By the result of Step 2 and by S\subseteq f_{\theta_{n}}^{-1}\bigl{(}\overline{V}_{r}(\delta)\bigr{)} for all , we have:
[TABLE]
So the sequence
[TABLE]
satisfies the assumptions of Lemma 2.1, and hence by that lemma for all n\geq m_{1}=m_{1}\bigl{(}\beta,\widetilde{\beta},m_{0}(\delta)\bigr{)}. Since does not depend on the particular choice of we have thus obtained that for all linear subspaces of dimension one has
[TABLE]
According to (T) we have such that f_{\theta_{n}}^{-1}\bigl{(}\overline{V}_{r}(\delta^{\prime})\bigr{)}\subseteq\bigcap_{\ell\geq m_{0}(\delta)}f_{\theta_{\ell}}^{-1}\bigl{(}\overline{V}_{r}(\delta)\bigr{)} for all and all linear subspaces of dimension . Hence \xi_{n}\Bigl{(}f_{\theta_{n}}^{-1}\bigl{(}\overline{V}_{r}(\delta^{\prime})\bigr{)}\Bigr{)}\leq\alpha_{r} for all and all linear subspaces of dimension , which is statement with , , and as obtained. So the induction step has been completed.
Corollary 2.7
*(i) If (T) is satisfied then the asymptotic nonsingularity (2.1) holds.
(ii) If (GLM) is satisfied then the stronger asymptotic nonsingularity (2.2) holds.*
Proof. Using a well-known representation of the smallest eigenvalue of a symmetric matrix we can write
[TABLE]
For any , , we denote by the -dimensional subspace of given by V_{p-1,c}\,=\,\bigl{\{}a\in\mathbb{R}^{p}\,:\,c^{\sf\scriptsize T}a=0\bigr{\}}. Assume (T). Let , , and be chosen according to Theorem 2.6. Then by the theorem, observing that f^{-1}_{\theta_{n}}\bigl{(}\overline{V}_{p-1,c}(\varepsilon)\bigr{)}\,=\,\bigl{\{}x\in{\cal X}\,:\,|c^{\sf\scriptsize T}f_{\theta_{n}}(x)|\leq\varepsilon\bigr{\}}, we have for all and all with ,
[TABLE]
Denote S_{n,c}\,=\,\bigl{\{}x\in{\cal X}\,:\,|c^{\sf\scriptsize T}f_{\theta_{n}}(x)|>\varepsilon\bigr{\}}. Then for all and all with one has and hence
[TABLE]
and together with (2.43), \lambda_{\rm\scriptsize min}\bigl{(}M(\xi_{n},\theta_{n})\bigr{)}\geq\widetilde{\lambda}_{0} for all . So, in the case , a positive real constant is given by \lambda_{0}:=\min\{\widetilde{\lambda}_{0},\lambda_{\rm\scriptsize min}\bigl{(}M(\xi_{n},\theta_{n})\bigr{)}:\,n_{\rm\scriptsize st}\leq n<n_{0}\}. In the case choose . In any case, with that constant the asymptotic nonsingularity (2.1) holds. Now assume (GLM). By Lemma 2.4 (T) is satisfied as well and hence, as already proved, the asymptotic nonsingularity (2.1) holds with some . For all and all one has . Consider the positive real numbers and from (2.15). Then, for all and trivially . Hence for any design one has for all . In particular, one has for all and . It follows that
[TABLE]
So the stronger asymptotic nonsingularity (2.2) holds with instead of .
As a further consequence from Theorem 2.6 and Corollory 2.7 we can derive a convergence result as in Pronzato [11], Lemma 2 and Theorem 2, and Freise [6], Lemma 18. If the sequence of parameter points converges to some parameter point then the design sequence is asymptotically locally D-optimal at , in the sense that the sequence of information matrices converges to the information matrix of a locally D-optimal design at . For later reference (see Section 3), the next theorem states the convergence of the information matrices to for any sequence converging to , provided that the sequence employed by the algorithm converges to . Of course, in the linear model case, identical for all , we retrieve the classical result of Wynn [16], Theorem 1.
Theorem 2.8
*If for some then for every sequence , , such that
one has*
[TABLE]
where denotes a locally D-optimal design at , i.e., maximizes \det\bigl{(}M(\xi,\overline{\theta})\bigr{)} over the set of all designs .
Proof. The matrix-valued function is uniformly continuous on its compact domain . So, for any sequence converging to , observing that for all and ,
[TABLE]
Hence
[TABLE]
and, in particular,
[TABLE]
Consider from (2.7). For any design and any we have
[TABLE]
By Lemma 2.4 and Corollary 2.7 there is a satisfying (2.1). Let be the set of all nonnegative definite martices such that and . Clearly, is compact and for all , and by (2.45) there is an such that for all . Define a real-valued function on by
[TABLE]
which is continuous and hence uniformly continuous on its compact domain . So, together with (2.45),
[TABLE]
In what follows let an be given. By (2.46) and by the definition of the function , there is an such that
[TABLE]
This yields, in particular,
(i) : ,
since for all , denoting ,
[TABLE]
The second inequality in (i) is well-known from the Kiefer-Wolfowitz Equivalence Theorem. The rest of the proof employs the arguments of Pronzato [11] in the proof of Lemma 3 of that paper. For convenience we report here the main steps labelled below by (ii) - (v).
(ii) One can choose such that for all
[TABLE]
To see this we note that and by a well-known formula of determinants,
[TABLE]
By (i) for the expression (2.47) is greater than or equal to
[TABLE]
where we have used that with , as . Choose such that for all . Then for all ,
[TABLE]
(iii) One can choose such that for all
[TABLE]
This follows from and by choosing such that c_{n}\leq\varepsilon\bigl{(}1-\frac{n+p+\varepsilon}{2n}\bigr{)} for all .
(iv) Denote \Psi^{*}:=\log\det\bigl{(}M(\xi^{*}_{\overline{\theta}},\overline{\theta})\bigr{)}. If and \log\det\bigl{(}M(\xi_{n},\overline{\theta})\bigr{)}\,\leq\ \Psi^{*}-2\varepsilon then
[TABLE]
This can be seen as follows. By the gradient inequality for the concave criterion ,
[TABLE]
where the last inequality comes from (i). Hence it follows that and together with (2.47) one gets
[TABLE]
where the last inequality comes from (iii).
(v) One can choose such that for all
[TABLE]
To see this, note that by (iv) there is some such that \log\det\bigl{(}M(\xi_{n_{3}},\overline{\theta})\bigr{)}>\Psi^{*}-2\varepsilon, since otherwise (iv) would yield that \log\det\bigl{(}M(\xi_{n},\overline{\theta})\bigr{)}\longrightarrow\infty as , which is a contradiction. By (ii) and (iv), the sequence a_{n}:=\log\det\bigl{(}M(\xi_{n},\overline{\theta})\bigr{)}, , has the following properties.
[TABLE]
Thus, obviously, for all , which is (v).
From (v) we get
[TABLE]
Since was arbitrary we get \liminf_{n\to\infty}\log\det\bigl{(}M(\xi_{n},\overline{\theta})\bigr{)}\,\geq\,\Psi^{*} and hence \lim_{n\to\infty}\log\det\bigl{(}M(\xi_{n},\overline{\theta})\bigr{)}\,=\,\Psi^{*}. This implies , since by strict concavity of the criterion the information matrix at of a locally D-optimal design at is unique. That is, denoting by the set of all designs and {\cal M}_{\overline{\theta}}:=\bigl{\{}M(\xi,\overline{\theta})\,:\,\mbox{\xi\in\Xi }\bigr{\}}, the set of all information matrices of designs at , the information matrix is the unique point in such that . So for any one has by compactness and continuity
[TABLE]
So, \lim_{n\to\infty}\log\det\bigl{(}M(\xi_{n},\overline{\theta}))\,=\log\det(M^{*}) implies . If is any sequence converging to then by (2.44) .
3 Adaptive Wynn-algorithm in univariate GLM
Now we focus on the adaptive character of the algorithm. The sequence of parameter points , , employed is given by parameter estimates based on the data available at the current stage , which are the design points and the observed values of a univariate response variable. We assume a (nonlinear) regression model with expected univariate responses , where and . The function is assumed to be continuous and, as in the previous sections, the experimental region and the parameter space are compact metric spaces. Again, for the algorithm we assume a family , , of -valued functions on defining the information matrices of designs by (1.1) and having the properties that for each the image spans , and the function is continuous on . The adaptive Wynn-algorithm sequentially generates data where is the observed (univariate) response at the design point () and the employed sequence , , is given by adaptive parameter estimates, , . In particular, the values of the response variable as well as the generated values of the design variable are random and hence they are modelled by random variables and . The sequential and adaptive character of the data is caught by the ‘adaptive regression model’ formulated and discussed in Subsection 3.1 below. For theoretical investigations on consistency or asymptotic distribution of estimators it will be convenient to distinguish between the true (but unknown) parameter point and any possible parameter point to be considered. So throughout this section, denotes the fixed true parameter point governing the random variables.
3.1 Adaptive regression model.
An appropriate model for the adaptive character of the sequences of random variables and , , is provided by the following assumptions (A1) and (A2), cp. Lai [9], Sec. 1, or Chen, Hu, and Ying [4], Sec. 3. Note that all the random variables are defined on some probability space , where is a nonempty set, is a sigma-field of subsets of , and is a probability measure on corresponding to the true parameter point .
- (A1)
There is given a nondecreasing sequence of sub-sigma-fields of , such that for each the random variable is -measurable and the random variable is -measurable.
- (A2)
with real-valued square integrable random errors such that
{\rm E}\bigl{(}e_{i}\,\big{|}\,{\cal F}_{i-1}\bigr{)}\,=0\ \mbox{\,a.s.} for all , and \sup_{i\in\mathbb{N}}{\rm E}\bigl{(}e_{i}^{2}\,\big{|}\,{\cal F}_{i-1}\bigr{)}\,<\infty\ \ \mbox{a.s.}
As an illustration of the sub-sigma-fields , , suppose that the starting design of the algorithm was chosen deterministically, i.e., are constants, and suppose further that for all there is no ambiguity in chosing the maximizer in (1.4) given the values of and thus given the value of . Then for all the random variable is a function of . So one can employ the particular sigma-fields generated by the random variables , for all , and the minimal sigma-field in . We note that no further relation is assumed so far between the mean response function and the family of functions , , of the algorithm, whereas a particular relation will be employed in the next subsection.
The following lemma presents some auxiliary asymptotic results derived from martingale limit theorems. If , , is a sequence of -valued random variables and is an -valued random variable, the notation stands for almost sure convergence of the sequence to (as ). For real-valued we will also use the notation for indicating almost sure convergence (or ‘divergence’) to infinity.
Lemma 3.1
*Under(A1) and (A2) the following (a), (b), and (c) hold.
(a)
(b) Let , , be a sequence of real-valued square integrable random variables such that is -measurable for all and Then*
[TABLE]
(c)* Let be a continuous function. Then*
[TABLE]
**Proof.
(a)** Denote W_{i}\,:=\,|e_{i}|-{\rm E}\bigl{(}|e_{i}|\,\big{|}{\cal F}_{i-1}\bigr{)}, . It is easily seen that the sequence of partial sums , , is a martingale w.r.t. , . Since
[TABLE]
one has by (A2) \sup_{i\in\mathbb{N}}{\rm E}\bigl{(}W_{i}^{2}\big{|}{\cal F}_{i-1}\bigr{)}\,<\infty\ \,\mbox{a.s.} and hence \sum_{i=1}^{\infty}i^{-2}{\rm E}\bigl{(}W_{i}^{2}\big{|}{\cal F}_{i-1}\bigr{)}\,<\infty\ \,\mbox{a.s.} By Theorem 2.18 of Hall and Heyde [8], , i.e.,
[TABLE]
By (A2) and Jensen’s inequality \sup_{i\in\mathbb{N}}{\rm E}\bigl{(}|e_{i}|\,\big{|}{\cal F}_{i-1}\bigr{)}\,<\infty\ \,\mbox{a.s.} from which one gets
(b) As it is easily seen, the sequence , , is a martingale w.r.t. , . By assumption there are two real random variables and such that U_{1}\,=\,\sup_{i\in\mathbb{N}}{\rm E}\bigl{(}e_{i}^{2}\,\big{|}\,{\cal F}_{i-1}\bigr{)} a.s. and a.s. Hence
[TABLE]
So \sum_{i=1}^{\infty}i^{-2}{\rm E}\bigl{(}(Z_{i}e_{i})^{2}\big{|}{\cal F}_{i-1}\bigr{)}\ <\infty\ \mbox{ a.s.} and the result follows from Theorem 2.18 of Hall and Heyde [8].
(c) Fix any . By compactness of and continuity of there exist a finite number and nonempty, pairwise disjoint, and measurable subsets of such that and for all and all , . Choose any points , , and denote , , . Then
[TABLE]
Introduce zero-one-valued random variables Z_{i}^{(j)}:=\hskip 1.00006pt1\hskip-6.00006pt1\hskip 1.00006pt\bigl{(}X_{i}^{-1}(R_{j})\bigr{)}, , , i.e., yields the value if the value of is in , and otherwise yields the value [math] . Abbreviate . Clearly, , and by (3.1) for all , , and ,
[TABLE]
where if and if , for any real number . Hence by summation in (3.2) over and ,
[TABLE]
Denote , which is finite and, clearly, for all and all . Hence
[TABLE]
and thus
[TABLE]
Applying parts (a) and (b) of the lemma,
[TABLE]
where , which is almost surely finite. Since was arbitrary the result follows.
3.2 Adaptive GLM and ML-estimators
Now we specialize to an ‘adaptive generalized linear model’ as follows. The parameter space is a compact subset of provided with the usual Euclidean metric, the mean response function is of the form
[TABLE]
where is a given continuous function whose range spans and is a given continuously differentiable function on an open interval with \bigl{\{}f^{\sf\scriptsize T}(x)\,\theta\,:\,(x,\theta)\in{\cal X}\times\Theta\bigr{\}}\subseteq I and whose derivative is positive, for all . The function is the inverse of the link function of the generalized linear model and , , is the linear predictor. Note that an interval may be unbounded from below or from above or both, where in the latter case the interval is the whole real line. Assumption (A2) is strengthened by an assumption (A2’) below, stating that the conditional distribution of given belongs to a one-parameter exponential family of distributions , , where is an open interval. We employ the canonical (or ‘natural’) parametrization of the one-parameter exponential family where is its canonical parameter. So , , are probability distributions on the Borel sigma-field of the real line with densities w.r.t. some Borel-measure ,
[TABLE]
where is a nonnegative measurable function on and is a real-valued function on , which is infinitely often differentiable, see e.g. Fahrmeir and Kaufmann [5], Section 2. In particular, the first and second derivatives of give the expectation and the variance of the distribution , resp., and for all . So the derivative is a smooth and strictly increasing function and hence a bijection, where is the open interval of all expectations , . The inverse assigns to each expectation the parameter value of the exponential family.
Now, we assume the following (A2’) which is stronger than assumption (A2) from Subsection 3.1. Recall that denotes the fixed true parameter point.
- (A2’)
The values of the inverse link function are contained in , i.e., .
For each the conditional distribution of given is equal to where
\overline{\tau}_{i}=(b^{\prime})^{-1}\bigl{(}G(f^{\sf\scriptsize T}(X_{i})\,\overline{\theta})\bigr{)}.
For the notion of a conditional distribution of a real-valued random variable given a sub-sigma-field we refer to [3], p. 77, Definition 4.29. Note that has finite moments m_{k}(\tau)={\rm E}_{P_{\tau}}\bigl{(}Y^{k}\bigr{)} of any order , and is a continuous function of . Assumption (A2’) together with (A1) imply the following. Firstly, {\rm E}\bigl{(}Y_{i}\big{|}\,{\cal F}_{i-1}\bigr{)}=m_{1}(\overline{\tau}_{i})=G\bigl{(}f^{\sf\scriptsize T}(X_{i})\,\overline{\theta}\bigr{)}. So, e_{i}=Y_{i}-G\bigl{(}f^{\sf\scriptsize T}(X_{i})\,\overline{\theta}\bigr{)}, , satisfy and {\rm E}\bigl{(}e_{i}\big{|}\,{\cal F}_{i-1}\bigr{)}=0, with from (3.3). Secondly, {\rm E}\bigl{(}e_{i}^{2}\big{|}\,{\cal F}_{i-1}\bigr{)}=m_{2}(\overline{\tau}_{i})-\bigl{(}m_{1}(\overline{\tau}_{i})\bigr{)}^{2} for all , and since the values of all are contained in some compact subinterval of one has {\rm E}\bigl{(}e_{i}^{2}\big{|}\,{\cal F}_{i-1}\bigr{)}\leq C_{2} a.s. for all for some real constant . A similar conclusion holds for higher conditional moments of , e.g. consider fourth moments:
[TABLE]
where and G^{0}\bigl{(}f^{\sf\scriptsize T}(X_{i})\,\overline{\theta}\bigr{)}=1. The values of all the , , are in the compact experimental region , and hence all the random variables , , have their values in some compact subinterval of . It follows that
[TABLE]
for some real constant . To summarize: assumption (A2’) together with (A1) imply (A2) and, moreover, (3.5). Obviously, this is due to the compactness of the experimental region (and the continuity of ). Compactness of the parameter space , however, is not needed here since (A2’) as well as (A2) are local conditions at the true parameter point .
Fisher information matrices in a generalized linear model with univariate response whose observations follow a one-parameter exponential family were derived in Atkinson and Woods [2], formula (13.3) on p. 473, and also for the multivariate case in Fahrmeir and Kaufmann [5], p. 347. Accordingly, we employ the following assumption (A3’) on the family of functions , , defining the information matrices of designs via (1.1).
- (A3’)
f_{\theta}(x)\,=\,\ \varphi\bigl{(}f^{\sf\scriptsize T}(x)\,\theta\bigr{)}\,f(x) for all , ,
where \varphi(u)\,=\,G^{\prime}(u)\Big{/}\sqrt{b^{\prime\prime}\Bigl{(}(b^{\prime})^{-1}\bigl{(}G(u)\bigr{)}\Bigr{)}}, ,
and where is a given continuous -valued function on whose range spans . In particular, by (A3’) the family , , satisfies condition (GLM) from Section 2.
In what follows we focus on the asymptotics of adaptive maximum likelihood (ML) estimators. Note, however, that the adaptive estimators , , employed by the algorithm may or may not be given by the adaptive ML-estimators , . The algorithm may employ any reasonable adaptive estimators , , e.g., the adaptive maximum quasi-likelihood estimators studied by Chen, Hu, and Ying [4] in the case that the function is defined on the whole real line, . See also our remark below following Corollary 3.2. The main topics studied are strong consistency of the adaptive ML-estimators, i.e., almost-sure convergence to the true parameter point , and asymptotic normality. Strong consistency of the estimators , , employed by the algorithm implies almost-sure asymptotic local D-optimality at of the design sequence generated by the algorithm, which is an immediate consequence from Theorem 2.8. Note that the corollary does not need any of the assumptions (A1), (A2), (A2’), or (A3’).
Corollary 3.2
If then for any sequence of estimators such that one has , where is a locally D-optimal design at .
Remark. Under assumptions (A1), (A2’), and (A3’), in the case the adaptive maximum quasi-likelihood estimators studied by Chen, Hu, and Ying [4] turn out to be strongly consistent. In fact, by Corollary 2.7, part (ii), and by (3.5) one easily verifies the assumptions of Theorem 2 in [4] for the adaptive design sequence generated by the algorithm, irrespective of the employed sequence of adaptive estimators , in the algorithm. Our next result establishes strong consistency of the adaptive maximum likelihood estimators, again irrespective of the employed sequence of estimators , in the algorithm.
Assuming (A1) and (A2’), an adaptive ML-estimator , for , is a maximizer of the log-likelihood
[TABLE]
Note that with probability equal to one, for all . Thus positivity of , , is assumed for the log-likelihood (3.6). Note also that for the canonical link one gets and hence . The following result gives the strong consistency of the adaptive ML-estimators.
Theorem 3.3
Under assumptions (A1), (A2’), and (A3’), one has .
Proof. For all one gets from (3.6) and (3.7), observing that and Y_{i}=G\bigl{(}f^{\sf\scriptsize T}(X_{i})\,\overline{\theta}\bigr{)}+e_{i},
[TABLE]
For each , by second order Taylor expansion of at ,
[TABLE]
with some from the interval whose end points are given by and . Since , (3.8) rewrites as
[TABLE]
By compactness of and and continuity of there is a compact subinterval such that for all , . Since and are increasing functions, d_{1}\leq(b^{\prime})^{-1}\bigl{(}G(f^{\sf\scriptsize T}(x)\,\theta)\bigr{)}\leq d_{2} for all , , where d_{j}=(b^{\prime})^{-1}\bigl{(}G(c_{j})\bigr{)}, , and . In particular, for all and . By continuity and positivity of the minimum exists and . Hence, in (3.9), b^{\prime\prime}\bigl{(}\widetilde{\tau}_{i}(\theta)\bigr{)}\geq\beta_{0} for all and . Since the composition (b^{\prime})^{-1}\bigl{(}G(u)\bigr{)}, , is a continuously differentiable function with positive derivative H(u)\,=\,\frac{\rm d}{{\rm d}u}\bigl{[}(b^{\prime})^{-1}\bigl{(}G(u)\bigr{)}\bigr{]}, , it follows that exists and . By the mean value theorem
[TABLE]
From (3.7) and (3.9) it follows that
[TABLE]
As in Wu [15], Lemma 1, strong consistency of will follow if we prove that for every such that the parameter subset C(\overline{\theta},\delta)=\bigl{\{}\theta\in\Theta\,:\,\|\theta-\overline{\theta}\|\geq\delta\bigr{\}} is nonempty, one has
[TABLE]
In fact, the turns out to be equal to infinity almost surely, since we show that
[TABLE]
By (3.10) and the trivial inequality for all real ,
[TABLE]
Since for all and , with h(x,\theta)=(b^{\prime})^{-1}\bigl{(}f^{\sf\scriptsize T}(x)\,\overline{\theta}\bigr{)}-(b^{\prime})^{-1}\bigl{(}f^{\sf\scriptsize T}(x)\,\theta\bigr{)}, , it follows by Lemma 3.1, part (c), that
[TABLE]
It remains to show that the of the second term on the r.h.s. of (3.12) is positive almost surely. Consider an arbitrary path of the adaptive process and, in particular, a path , , of the sequence of random variables , . With the generated design sequence , , we can write, for all ,
[TABLE]
For any , , denote and V_{p-1,\theta}=\bigl{\{}a\in\mathbb{R}^{p}\,:\,c^{\sf\scriptsize T}_{\theta}a=0\bigr{\}}. By Theorem 2.6 there exist , , and such that
[TABLE]
Using (A3’) and for all , let
[TABLE]
hence . Define . As in the proof of Lemma 2.5, part (i), one gets
[TABLE]
Note that f^{-1}\bigl{(}\overline{V}_{p-1,\theta}(\varepsilon^{\prime})\bigr{)}\,=\,\bigl{\{}x\in{\cal X}\,:\,|c^{\sf\scriptsize T}_{\theta}f(x)|\leq\varepsilon^{\prime}\bigr{\}}. Taking the complementary sets and observing that is equivalent to \big{|}f^{\sf\scriptsize T}(x)\,(\theta-\overline{\theta})\big{|}>\varepsilon^{\prime}\|\theta-\overline{\theta}\|, which in the case implies \big{|}f^{\sf\scriptsize T}(x)\,(\theta-\overline{\theta})\big{|}>\varepsilon^{\prime}\delta, we have
[TABLE]
Hence
[TABLE]
Together with (3.14), (3.13), and (3.12) the proof of (3.11) is complete and follows.
The next result shows the asymptotic normality of adaptive ML-estimators if the true parameter point is an interior point of the parameter space , i.e., there exists a such that
[TABLE]
The -dimensional normal distribution with expectation [math] and (positive definite) covariance matrix is denoted by . For , the identity matrix, is the -dimensional standard normal distribution. For a sequence of -valued random variables, convergence in distribution of (as ) to a -dimensional normal distribution is abbreviated by . In the next theorem, the assumption of strong consistency of the adaptive estimators employed by the algorithm is met, by Theorem 3.3, if , .
Theorem 3.4
Assume (A1), (A2’), and (A3’). Assume further that is twice continuously differentiable, is an interior point of , and . Then:
[TABLE]
Also, denoting by M_{*}=M\bigl{(}\xi^{*}_{\overline{\theta}},\overline{\theta}) the information matrix of a locally D-optimal design at , one has
[TABLE]
Proof. Choose a positive where is according to (3.15). Then the compact ball \overline{B}(\overline{\theta},\rho_{1})=\bigl{\{}a\in\mathbb{R}^{p}\,:\,\|a-\overline{\theta}\|\leq\rho_{1}\bigr{\}} is contained in the interior of . By Theorem 3.3 if is large enough, i.e., if is greater than or equal to the value of some random variable whose values are in . Let be given and denote by the event (subset of ) that the random variable yields a value less than or equal to . Consider the log-likelihood function from (3.6), (3.7) and its gradient w.r.t. , which is often called the score function, for . One obtains
[TABLE]
Abbreviate
[TABLE]
Since S_{n}\bigl{(}\widehat{\theta}_{n}^{\rm\scriptsize(ML)}\bigr{)}=0 a.s. on we get from (3.16),
[TABLE]
The function from (3.17) is continuously differentiable. Denote its derivative by . The gradient (w.r.t. ) of is given by
[TABLE]
By the mean value theorem, for each there is some on the line segment joining and such that
[TABLE]
Together with (3.18) and (3.19), and observing that , we get
[TABLE]
Since M(\xi_{n},\theta)=\frac{1}{n}\sum_{i=1}^{n}\varphi^{2}\bigl{(}f^{\sf\scriptsize T}(X_{i})\,\theta\bigr{)}\,f(X_{i})\,f^{\sf\scriptsize T}(X_{i}) we can write
[TABLE]
So, by (3.20) after some slight manipulations,
[TABLE]
Next we show that
[TABLE]
Regarding the Cramér-Wold device let any with be given. Using (3.16) for and inserting Y_{i}=G\bigl{(}f^{\sf\scriptsize T}(X_{i})\,\overline{\theta}\bigr{)}\,+e_{i} we can write
[TABLE]
Clearly, for each the random variable Z_{i}:=H\bigl{(}f^{\sf\scriptsize T}(X_{i})\,\overline{\theta}\bigr{)}\,v^{\sf\scriptsize T}M_{*}^{-1/2}f(X_{i}) is -measurable and for all for some finite constant . Since , , one easily verifies that the sequence of partial sums , , is a martingale w.r.t. , . By Corollary 3.1 (p. 58) in [8] the convergence holds if the following two conditions (A) and (B) are satisfied.
- (A)
\displaystyle\frac{1}{n}\sum_{i=1}^{n}{\rm E}\bigl{(}\widetilde{e}_{i}^{2}\big{|}\,{\cal F}_{i-1}\bigr{)}\,\,\stackrel{{\scriptstyle\rm\scriptsize a.s.}}{{\longrightarrow}}\,1.
- (B)
\displaystyle\frac{1}{n}\sum_{i=1}^{n}{\rm E}\Bigl{(}\widetilde{e}_{i}^{2}\hskip 1.00006pt1\hskip-6.00006pt1\hskip 1.00006pt\bigl{(}|\widetilde{e}_{i}|>\sqrt{n}\,\varepsilon\bigr{)}\big{|}\,{\cal F}_{i-1}\Bigr{)}\,\,\stackrel{{\scriptstyle\rm\scriptsize a.s.}}{{\longrightarrow}}\,0 for all .
Ad (A). {\rm E}\bigl{(}\widetilde{e}_{i}^{2}\big{|}\,{\cal F}_{i-1}\bigr{)}={\rm E}\bigl{(}e_{i}^{2}\big{|}\,{\cal F}_{i-1}\bigr{)}\,Z_{i}^{2} and Z_{i}^{2}\,=\,H^{2}\bigl{(}f^{\sf\scriptsize T}(X_{i})\,\overline{\theta}\bigr{)}\,v^{\sf\scriptsize T}M_{*}^{-1/2}\,f(X_{i})\,f^{\sf\scriptsize T}(X_{i})\,M_{*}^{-1/2}\,v. By assumption (A2’), {\rm E}\bigl{(}e_{i}^{2}\big{|}\,{\cal F}_{i-1}\bigr{)}=b^{\prime\prime}\Bigl{(}(b^{\prime})^{-1}\bigl{(}G(f^{\sf\scriptsize T}(X_{i})\,\overline{\theta})\bigr{)}\Bigr{)}. Together with (3.17) this yields {\rm E}\bigl{(}\widetilde{e}_{i}^{2}\big{|}\,{\cal F}_{i-1}\bigr{)}=\varphi^{2}\bigl{(}f^{\sf\scriptsize T}(X_{i})\,\overline{\theta}\bigr{)}\,v^{\sf\scriptsize T}M_{*}^{-1/2}\,f(X_{i})\,f^{\sf\scriptsize T}(X_{i})\,M_{*}^{-1/2}v, and hence
[TABLE]
where the convergence follows from Corollary 3.2.
Ad (B). Using the trivial inequality \widetilde{e}_{i}^{2}\,\hskip 1.00006pt1\hskip-6.00006pt1\hskip 1.00006pt\bigl{(}|\widetilde{e}_{i}|>\sqrt{n}\,\varepsilon\bigr{)}\,\leq\,\frac{1}{\varepsilon^{2}n}\,\widetilde{e}_{i}^{4} we obtain
[TABLE]
where we have used (3.5).
So by the Cramér-Wold device (3.26) follows and hence, by (3.25),
[TABLE]
Next we show that
[TABLE]
By (3.24),
[TABLE]
Note that \big{\|}f(X_{i})\,f^{\sf\scriptsize T}(X_{i})\big{\|}=\big{\|}f(X_{i})\big{\|}^{2}\leq\gamma_{0}^{2} for all , where . By for all and for some compact interval , by the uniform continuity of the function on the compact interval, and by
[TABLE]
it follows that
[TABLE]
From this the first convergence statement in (3.28) follows. To prove the second convergence statement in (3.28) we write Y_{i}-G\bigl{(}f^{\sf\scriptsize T}(X_{i})\,\widetilde{\theta}_{i,n}\bigr{)}=G\bigl{(}f^{\sf\scriptsize T}(X_{i})\,\overline{\theta}\bigr{)}-G\bigl{(}f^{\sf\scriptsize T}(X_{i})\,\widetilde{\theta}_{i,n}\bigr{)}+e_{i}, and together with the definition of in (3.22),
[TABLE]
One concludes by similar arguments as used above when showing where, in particular, uniform continuity of on the compact interval and boundedness of on that interval are utilized. Consider the sequence of matrices entrywise. The -th entry (where ) of has the form where Z_{i}=H^{\prime}\bigl{(}f^{\sf\scriptsize T}(X_{i})\,\overline{\theta}\bigr{)}\,f_{k}(X_{i})\,f_{\ell}(X_{i}). Note that for all for some real constant . From Lemma 3.1, part (b), it follows that . Hence . For we consider again the -th entry for any given , and we obtain
[TABLE]
The uniform continuity of on yields \max_{1\leq i\leq n}\big{|}H^{\prime}\bigl{(}f^{\sf\scriptsize T}(X_{i})\,\widetilde{\theta}_{i,n}\bigr{)}-H^{\prime}\bigl{(}f^{\sf\scriptsize T}(X_{i})\,\overline{\theta}\bigr{)}\big{|}\,\stackrel{{\scriptstyle\rm\scriptsize a.s.}}{{\longrightarrow}}\,0. Together with Lemma 3.1, part (a), it follows that the -th entry of converges to zero almost surely and hence . So we have proved (3.28). By Corollary 3.2, and hence
[TABLE]
Together with (3.27), observing \lim_{n\to\infty}\mathbb{P}_{\overline{\theta}}\bigl{(}\{N\leq n\}\bigr{)}=1 and using standard properties of convergence in distribution, it follows that for any sequence of random matrices such that one has Q_{n}\bigl{[}\sqrt{n}\bigl{(}\widehat{\theta}_{n}^{\rm\scriptsize(ML)}-\overline{\theta}\bigr{)}\bigr{]}\,\stackrel{{\scriptstyle\rm\scriptsize d}}{{\longrightarrow}}\,{\rm N}(0,I_{p}). In particular, the convergence holds for the sequence and the constant sequence . Hence the result follows.
Appendix A Appendix: known auxiliary results
Four well-known results on nonnegative definite matrices are stated below, which have been used throughout the proofs. If is a (real) matrix then the range of is given by . A generalized inverse of a matrix is denoted by which is by definition a matrix satisfying . As it is easily seen, if is symmetric and then the value is the same for all generalized inverses of .
- (M1)
If and are nonnegative definite matrices and , then and for all . See Stepniak, Wang and Wu [13], Lemma 2.
- (M2)
If and such that , then
[TABLE]
The first inequality is a special case of Theorem 4.2 in Gaffke and Krafft [7]; the second inequality follows from if , and if , for any .
- (M3)
If is a positive definite matrix and , , then
[TABLE]
and the maximum on the right hand side is attained for . Actually, the inequality is a special case of a more general matrix inequality, see Pukelsheim [12], Section 1.21.
- (M4)
Let be a matrix with columns . Then
[TABLE]
and where denotes the squared Euclidean distance of a vector and a linear subspace of . In fact, the formula trivially holds if the vectors are linearly dependent, in which case both sides of the formula are equal to zero. Also, the case is trivial. Let and let be linearly independent. Consider the matrix having columns . Then, the matrix can be written in partitioned form as
[TABLE]
So, by a well-known formula for the determinant of a partitioned positive definite matrix,
[TABLE]
The second factor on the r.h.s. of the latter equation is equal to . We have thus obtained that \det\bigl{(}A^{\sf\scriptsize T}A\bigr{)}=\det\bigl{(}B^{\sf\scriptsize T}B)\cdot{\rm dist}^{2}(a_{q},V_{q-1}). Now the asserted formula follows by induction on .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Atkinson, A.C.; Fedorov, V.V.; Herzberg, A.M.; Zhang, R. (2014). Elemental information matrices and optimal experimental design for generalized regression models. J. Statist. Plann. Inference 144 , 81-91.
- 2[2] Atkinson, A.C.; Woods, D.C. (2015). Designs for generalized linear models. In: Dean, A.; Morris, M.; Stufken, J.; Bingham, D. (eds). Handbook of Design and Analysis of Experiments. CRC Press, 2015, pp. 471-514.
- 3[3] Breiman, L. (1993). Probability. SIAM, Philadelphia. 2nd printing.
- 4[4] Chen, K.; Hu, I.; Ying, Z. (1999). Strong consistency of maximum quasi-likelihood estimators in generalized linear models with fixed and adaptive designs. Ann. Statist 27 , 1155-1163.
- 5[5] Fahrmeir, L.; Kaufmann, H. (1985). Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models. Ann. Statist 13 , 342-368.
- 6[6] Freise, F. (2016). On Convergence of the Maximum Likelihood Estimator in Adaptive Designs. Dissertation. University of Magdeburg.
- 7[7] Gaffke, N.; Krafft, O. (1982). Matrix Inequalities in the Löwner Ordering. In: B. Korte (ed.): Modern Applied Mathematics: Optimization and Operations Research, pp. 595-622. North-Holland, Amsterdam, 1982.
- 8[8] Hall, P.; Heyde, C.C. (1980). Martingale Limit Theory and Its Application. Academic Press, New York.
