Upper bounds on the minimum coverage probability of model averaged tail   area confidence intervals in regression

Paul Kabaila

arXiv:1702.05189·stat.ME·May 18, 2018

Upper bounds on the minimum coverage probability of model averaged tail area confidence intervals in regression

Paul Kabaila

PDF

TL;DR

This paper derives upper bounds on the minimum coverage probability of model averaged tail area (MATA) confidence intervals in complex linear regression models, highlighting limitations of BIC-based weights.

Contribution

It extends previous work by providing an easily computed upper bound for the coverage probability in models with many parameters, demonstrating issues with BIC-based weights.

Findings

01

Upper bounds indicate potential undercoverage of MATA intervals.

02

BIC-based weights may lead to inadequate coverage.

03

Complex models pose challenges for model averaging confidence intervals.

Abstract

Frequentist model averaging has been proposed as a method for incorporating "model uncertainty" into confidence interval construction. Such proposals have been of particular interest in the environmental and ecological statistics communities. A promising method of this type is the model averaged tail area (MATA) confidence interval put forward by Turek and Fletcher, 2012. The performance of this interval depends greatly on the data-based model weights on which it is based. A computationally convenient formula for the coverage probability of this interval is provided by Kabaila, Welsh and Abeysekera, 2016, in the simple scenario of two nested linear regression models. We consider the more complicated scenario that there are many (32,768 in the example considered) linear regression models obtained as follows. For each of a specified set of components of the regression parameter vector, we…

Figures1

Click any figure to enlarge with its caption.

Equations148

y = X β + ε,

y = X β + ε,

RSS = (y - X β)^{⊤} (y - X β) .

RSS = (y - X β)^{⊤} (y - X β) .

\widehat{\boldsymbol{\beta}}_{K}=\Big{(}\boldsymbol{I}-(\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}\boldsymbol{H}_{K}^{\top}\big{(}\boldsymbol{H}_{K}(\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}\boldsymbol{H}_{K}^{\top}\big{)}^{-1}\boldsymbol{H}_{K}\Big{)}\widehat{\boldsymbol{\beta}}.

\widehat{\boldsymbol{\beta}}_{K}=\Big{(}\boldsymbol{I}-(\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}\boldsymbol{H}_{K}^{\top}\big{(}\boldsymbol{H}_{K}(\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}\boldsymbol{H}_{K}^{\top}\big{)}^{-1}\boldsymbol{H}_{K}\Big{)}\widehat{\boldsymbol{\beta}}.

RSS_{K} = (y - X β_{K})^{⊤} (y - X β_{K})

RSS_{K} = (y - X β_{K})^{⊤} (y - X β_{K})

GIC (K) = n ln (RSS_{K}) + d (p - ∣ K ∣)

GIC (K) = n ln (RSS_{K}) + d (p - ∣ K ∣)

U_{K}=\big{(}\boldsymbol{H}_{K}\widehat{\boldsymbol{\beta}}\big{)}^{\top}\big{(}\boldsymbol{H}_{K}(\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}\boldsymbol{H}_{K}^{\top}\big{)}^{-1}\boldsymbol{H}_{K}\widehat{\boldsymbol{\beta}},\qquad K\in{\mathscr{K}}\setminus\{\varnothing\}.

U_{K}=\big{(}\boldsymbol{H}_{K}\widehat{\boldsymbol{\beta}}\big{)}^{\top}\big{(}\boldsymbol{H}_{K}(\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}\boldsymbol{H}_{K}^{\top}\big{)}^{-1}\boldsymbol{H}_{K}\widehat{\boldsymbol{\beta}},\qquad K\in{\mathscr{K}}\setminus\{\varnothing\}.

\frac{U _{K} /∣ K ∣}{RSS / ( n - p )}

\frac{U _{K} /∣ K ∣}{RSS / ( n - p )}

V_{K}=\big{(}\boldsymbol{H}_{K}(\widehat{\boldsymbol{\beta}}/\sigma)\big{)}^{\top}\big{(}\boldsymbol{H}_{K}(\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}\boldsymbol{H}_{K}^{\top}\big{)}^{-1}\boldsymbol{H}_{K}(\widehat{\boldsymbol{\beta}}/\sigma),\qquad K\in{\mathscr{K}}\setminus\{\varnothing\}.

V_{K}=\big{(}\boldsymbol{H}_{K}(\widehat{\boldsymbol{\beta}}/\sigma)\big{)}^{\top}\big{(}\boldsymbol{H}_{K}(\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}\boldsymbol{H}_{K}^{\top}\big{)}^{-1}\boldsymbol{H}_{K}(\widehat{\boldsymbol{\beta}}/\sigma),\qquad K\in{\mathscr{K}}\setminus\{\varnothing\}.

\lambda=(1/2)\big{(}\boldsymbol{H}_{K}(\boldsymbol{\beta}/\sigma)\big{)}^{\top}\big{(}\boldsymbol{H}_{K}(\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}\boldsymbol{H}_{K}^{\top}\big{)}^{-1}\boldsymbol{H}_{K}(\boldsymbol{\beta}/\sigma),

\lambda=(1/2)\big{(}\boldsymbol{H}_{K}(\boldsymbol{\beta}/\sigma)\big{)}^{\top}\big{(}\boldsymbol{H}_{K}(\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}\boldsymbol{H}_{K}^{\top}\big{)}^{-1}\boldsymbol{H}_{K}(\boldsymbol{\beta}/\sigma),

w (K; K) = ⎩ ⎨ ⎧ \frac{1}{1 + \sum _{L \in K ∖ {\emptyset}} r ( U _{L} / RSS , ∣ L ∣ )} \frac{r ( U _{K} / RSS , ∣ K ∣ )}{1 + \sum _{L \in K ∖ {\emptyset}} r ( U _{L} / RSS , ∣ L ∣ )} for K = \emptyset otherwise .

w (K; K) = ⎩ ⎨ ⎧ \frac{1}{1 + \sum _{L \in K ∖ {\emptyset}} r ( U _{L} / RSS , ∣ L ∣ )} \frac{r ( U _{K} / RSS , ∣ K ∣ )}{1 + \sum _{L \in K ∖ {\emptyset}} r ( U _{L} / RSS , ∣ L ∣ )} for K = \emptyset otherwise .

h\big{(}z,\boldsymbol{y};{\mathscr{K}}\big{)}=\sum_{K\in{\mathscr{K}}}w(K;{\mathscr{K}})\,G_{n-p+|K|}\left(\frac{\boldsymbol{a}^{\top}\widehat{\boldsymbol{\beta}}_{K}-z}{S_{K}\,(v(K))^{1/2}}\right),

h\big{(}z,\boldsymbol{y};{\mathscr{K}}\big{)}=\sum_{K\in{\mathscr{K}}}w(K;{\mathscr{K}})\,G_{n-p+|K|}\left(\frac{\boldsymbol{a}^{\top}\widehat{\boldsymbol{\beta}}_{K}-z}{S_{K}\,(v(K))^{1/2}}\right),

h\big{(}\widehat{\theta}_{\ell},\boldsymbol{y};{\mathscr{K}}\big{)}=1-\alpha/2\ \ \text{and}\ \ h\big{(}\widehat{\theta}_{u},\boldsymbol{y};{\mathscr{K}}\big{)}=\alpha/2

h\big{(}\widehat{\theta}_{\ell},\boldsymbol{y};{\mathscr{K}}\big{)}=1-\alpha/2\ \ \text{and}\ \ h\big{(}\widehat{\theta}_{u},\boldsymbol{y};{\mathscr{K}}\big{)}=\alpha/2

G_{n - p + ∣ K ∣} (\frac{a ^{⊤} β _{K} - a ^{⊤} β}{S _{K} ( v ( K ) ) ^{1/2}})

G_{n - p + ∣ K ∣} (\frac{a ^{⊤} β _{K} - a ^{⊤} β}{S _{K} ( v ( K ) ) ^{1/2}})

P\big{(}\theta\in I({\mathscr{K}}^{**})\big{)},

P\big{(}\theta\in I({\mathscr{K}}^{**})\big{)},

w\big{(}{\{p\}},{\mathscr{K}}^{*}\big{)}=1\Bigg{/}\left(1+\displaystyle{\frac{1}{r\left(\widehat{\beta}_{p}^{2}/(m\widehat{\sigma}^{2}v_{p}),1\right)}}\right).

w\big{(}{\{p\}},{\mathscr{K}}^{*}\big{)}=1\Bigg{/}\left(1+\displaystyle{\frac{1}{r\left(\widehat{\beta}_{p}^{2}/(m\widehat{\sigma}^{2}v_{p}),1\right)}}\right).

w_{1}\left(\frac{\widehat{\beta}_{p}^{2}}{\widehat{\sigma}^{2}v_{p}}\right)=w\big{(}{\{p\}},{\mathscr{K}}^{*}\big{)},

w_{1}\left(\frac{\widehat{\beta}_{p}^{2}}{\widehat{\sigma}^{2}v_{p}}\right)=w\big{(}{\{p\}},{\mathscr{K}}^{*}\big{)},

w_{1}(z)=1\bigg{/}\left(1+\displaystyle{\frac{1}{r\left(z/m,1\right)}}\right).

w_{1}(z)=1\bigg{/}\left(1+\displaystyle{\frac{1}{r\left(z/m,1\right)}}\right).

w_{1} (z) = \frac{1}{1 + ( 1 + \frac{z}{m} ) ^{n /2} exp ( - d /2 )} .

w_{1} (z) = \frac{1}{1 + ( 1 + \frac{z}{m} ) ^{n /2} exp ( - d /2 )} .

w_{1}(x^{2}/y^{2})\,G_{m+1}\left(\left(\frac{m+1}{x^{2}+my^{2}}\right)^{1/2}\frac{\delta-\widetilde{\rho}\,x}{(1-\widetilde{\rho}^{\,2})^{1/2}}\right)+\big{(}1-w_{1}(x^{2}/y^{2})\big{)}\,G_{m}(\delta/y)=u,

w_{1}(x^{2}/y^{2})\,G_{m+1}\left(\left(\frac{m+1}{x^{2}+my^{2}}\right)^{1/2}\frac{\delta-\widetilde{\rho}\,x}{(1-\widetilde{\rho}^{\,2})^{1/2}}\right)+\big{(}1-w_{1}(x^{2}/y^{2})\big{)}\,G_{m}(\delta/y)=u,

\int_{0}^{\infty} \int_{- \infty}^{\infty} (Φ (\frac{δ _{1 - α /2} ( x , y ) - ρ ( x - y )}{( 1 - ρ ^{2} ) ^{1/2}}) - Φ (\frac{δ _{α /2} ( x , y ) - ρ ( x - y )}{( 1 - ρ ^{2} ) ^{1/2}})) ϕ (x - γ) f_{m} (y) d x d y,

\int_{0}^{\infty} \int_{- \infty}^{\infty} (Φ (\frac{δ _{1 - α /2} ( x , y ) - ρ ( x - y )}{( 1 - ρ ^{2} ) ^{1/2}}) - Φ (\frac{δ _{α /2} ( x , y ) - ρ ( x - y )}{( 1 - ρ ^{2} ) ^{1/2}})) ϕ (x - γ) f_{m} (y) d x d y,

y = ψ + β_{2} z_{2} + \dots + β_{16} z_{16} + ε,

y = ψ + β_{2} z_{2} + \dots + β_{16} z_{16} + ε,

(37.37, 33.98, 74.58, 8.8, 3.26, 10.97, 80.91, 3876.05, 11.87, 46.08, 14.37, 100, 30, 140, 57.57) .

(37.37, 33.98, 74.58, 8.8, 3.26, 10.97, 80.91, 3876.05, 11.87, 46.08, 14.37, 100, 30, 140, 57.57) .

y = β_{1} + β_{2} (z_{2} - z_{2}^{*}) + \dots + β_{16} (z_{16} - z_{16}^{*}) + ε

y = β_{1} + β_{2} (z_{2} - z_{2}^{*}) + \dots + β_{16} (z_{16} - z_{16}^{*}) + ε

T_{n} = {\overline{X}_{n} b \overline{X}_{n} if ∣ \overline{X}_{n} ∣ > n^{- 1/4} if ∣ \overline{X}_{n} ∣ \leq n^{- 1/4},

T_{n} = {\overline{X}_{n} b \overline{X}_{n} if ∣ \overline{X}_{n} ∣ > n^{- 1/4} if ∣ \overline{X}_{n} ∣ \leq n^{- 1/4},

W_{n} = {1 b^{2} if ∣ \overline{X}_{n} ∣ > n^{- 1/4} if ∣ \overline{X}_{n} ∣ \leq n^{- 1/4},

W_{n} = {1 b^{2} if ∣ \overline{X}_{n} ∣ > n^{- 1/4} if ∣ \overline{X}_{n} ∣ \leq n^{- 1/4},

\sup_{\gamma}P\big{(}w_{1}(\widehat{\gamma}^{2})\big{)}\geq\epsilon)\rightarrow 0\quad\text{as}\quad n\rightarrow\infty.

\sup_{\gamma}P\big{(}w_{1}(\widehat{\gamma}^{2})\big{)}\geq\epsilon)\rightarrow 0\quad\text{as}\quad n\rightarrow\infty.

w (K; K) = \frac{exp ( - GIC ( K ) /2 )}{\sum _{L \in K} exp ( - GIC ( L ) /2 )},

w (K; K) = \frac{exp ( - GIC ( K ) /2 )}{\sum _{L \in K} exp ( - GIC ( L ) /2 )},

RSS_{K} = RSS + U_{K},

RSS_{K} = RSS + U_{K},

w (\emptyset; K) = \frac{1}{1 + L \in K ∖ { \emptyset } \sum ( 1 + \frac{U _{L}}{RSS} ) ^{- n /2} exp ( \frac{d ∣ L ∣}{2} )}

w (\emptyset; K) = \frac{1}{1 + L \in K ∖ { \emptyset } \sum ( 1 + \frac{U _{L}}{RSS} ) ^{- n /2} exp ( \frac{d ∣ L ∣}{2} )}

w (K; K) = \frac{( 1 + \frac{U _{K}}{RSS} ) ^{- n /2} exp ( \frac{d ∣ K ∣}{2} )}{1 + L \in K ∖ { \emptyset } \sum ( 1 + \frac{U _{L}}{RSS} ) ^{- n /2} exp ( \frac{d ∣ L ∣}{2} )} .

w (K; K) = \frac{( 1 + \frac{U _{K}}{RSS} ) ^{- n /2} exp ( \frac{d ∣ K ∣}{2} )}{1 + L \in K ∖ { \emptyset } \sum ( 1 + \frac{U _{L}}{RSS} ) ^{- n /2} exp ( \frac{d ∣ L ∣}{2} )} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

1

**Upper bounds on the minimum coverage probability of model averaged tail area confidence intervals in regression **

PAUL KABAILA

Department of Mathematics and Statistics

La Trobe University

Key words and phrases: Model averaged confidence intervals; MATA confidence interval; minimum coverage probability.

MSC 2010: Primary 62F25; secondary 62P12

Abstract: Frequentist model averaging has been proposed as a method for incorporating “model uncertainty” into confidence interval construction. Such proposals have been of particular interest in the environmental and ecological statistics communities. A promising method of this type is the model averaged tail area (MATA) confidence interval put forward by Turek & Fletcher, 2012. The performance of this interval depends greatly on the data-based model weights on which it is based. A computationally convenient formula for the coverage probability of this interval is provided by Kabaila, Welsh and Abeysekera, 2016, in the simple scenario of two nested linear regression models. We consider the more complicated scenario that there are many (32,768 in the example considered) linear regression models obtained as follows. For each of a specified set of components of the regression parameter vector, we either set the component to zero or let it vary freely. We provide an easily-computed upper bound on the minimum coverage probability of the MATA confidence interval. This upper bound provides evidence against the use of a model weight based on the Bayesian Information Criterion (BIC).

1. INTRODUCTION

Commonly in applied statistics, there is some uncertainty as to which explanatory variables should be included in the model. Frequentist model averaging has been proposed as a method for properly incorporating this “model uncertainty” into confidence interval construction. Such proposals have been of particular interest in the environmental and ecological statistics communities, see e.g. Fieberg & Johnson (2015, p.712) for a recent review.

The earliest approach to the construction of frequentist model averaged confidence intervals was to first construct a model averaged estimator of the parameter of interest as follows. This estimator is a data-based weighted average of the estimators of this parameter under the various models considered. In this approach, the model averaged confidence interval, with nominal coverage $1-\alpha$ , is centered on this estimator and has width equal to the $1-\alpha/2$ quantile of the standard normal distribution multiplied by an estimate of the standard deviation of this estimator (Buckland et al., 1997). However, Hjort & Claeskens (2003, Section 4.3) show that the distributional assumption on which this confidence interval is based is completely incorrect in large samples. This problem effectively rules out the use of this confidence interval. Hjort & Claeskens (2003, equation 4.8) then propose a new frequentist model averaged confidence interval that has the desired minimum coverage probability in large samples. However, this interval is essentially the same as the standard confidence interval based on the full model (Kabaila & Leeb, 2006, Remark 5b and Wang & Zou, 2013).

An important conceptual advance was made by Fletcher & Turek (2011) and Turek & Fletcher (2012) who put forward the idea of using data-based weighted averages across the models considered of procedures for constructing confidence intervals. In this way the model averaged confidence interval is constructed in a single step, rather than first constructing a model averaged estimator, which is used as the center of this interval, and then seeking an appropriate formula for the width of this interval. However, some problems have been identified by Kabaila, Welsh & Abeysekera (2016) with the method of Fletcher & Turek (2011). This leaves the model averaged tail area (MATA) confidence interval of Turek & Fletcher (2012) as a promising method, particularly in the normal linear regression context since exactly pivotal quantities for the parameter of interest can be specified for each model under consideration. As Turek & Fletcher (2102) note, their method can also be applied when one has only approximately pivotal quantities for the parameter of interest for each model under consideration. However, the use of such approximately pivotal quantities (which may be obtained by via the parametric bootstrap) is outside the scope of the present paper.

Turek & Fletcher (2012) considered a data-based weight on a model that is proportional to $\exp(-\text{AIC}/2)$ , $\exp(-\text{AIC}_{c}/2)$ and $\exp(-\text{BIC}/2)$ , where AIC, $\text{AIC}_{c}$ and BIC are the Akaike Information Criterion, the Akaike Information Criterion corrected for small samples and the Bayesian Information Criterion, respectively, for the model. The performance of the MATA confidence interval depends greatly on the model weights on which it is based. It is helpful to applied statisticians who wish to use MATA intervals if we can narrow down the choice of data-based model weight by eliminating the worst performing model weights from further consideration.

A computationally convenient formula for the exact coverage probability of the MATA interval is provided by Kabaila, Welsh & Abeysekera (2016) in the simple scenario of two nested normal linear regression models: the full model and a submodel specified by a linear constraint on the regression parameter vector. They consider a parameter of interest that is a specified linear combination of the components of the regression parameter vector for the full model. Kabaila, Welsh & Mainzer (2106) consider the same simple scenario in their evaluation of a MATA interval constructed using data-based weights based on Mallows’ $C_{P}$ . Of course, it is of interest to also evaluate the MATA interval in the more complicated situations that we average over more than two ( $2^{15}$ for the real life data considered in Section 5) normal linear regression models.

In the present paper, the family of models that we average over is obtained as follows. For each of a specified set of components of the regression parameter vector, we either set the component to zero or let it vary freely. For the MATA interval, we consider quite general data-based weights on these models. These general weights include, as special cases, the weights considered by Turek & Fletcher (2012) and the weights based on Mallows’ $C_{P}$ that are considered by Kabaila, Welsh & Mainzer (2016). Using the two new theorems presented in Section 3 of the present paper, we show how the results of Kabaila, Welsh & Abeysekera (2016) can be used to provide a new easily-computed upper bound on the minimum coverage probability of the MATA interval in this situation. This upper bound is analogous to the upper bounds of Kabaila & Leeb (2006) and Kabaila & Giri (2009) on the minimum coverage probability of the post-model-selection confidence interval in the context of the same family of models and is proved using the approach of Kabaila & Giri (2009).

The most important measure (in the form of a single number) of the performance of a confidence interval is its confidence coefficient, defined to be the infimum of the coverage probability of a confidence interval (see e.g. Casella & Berger, 2002, pp.418–419). If the confidence coefficient of a confidence interval is far below its nominal coverage then this confidence interval should not be used. The main application of our new upper bound on the minimum coverage probability of the MATA interval is that it can be used to help eliminate poorly performing model weights from further consideration.

Consider the linear regression model

[TABLE]

where $\boldsymbol{y}$ is a random $n$ -vector of responses, $\boldsymbol{X}$ is a known $n\times p$ matrix with linearly independent columns, $\boldsymbol{\beta}$ is an unknown parameter $p$ -vector and $\boldsymbol{\varepsilon}\sim\text{N}(0,\sigma^{2}\boldsymbol{I})$ where $\sigma^{2}$ is an unknown positive parameter and $n>p$ . Suppose that the quantity of interest is $\theta=\boldsymbol{a}^{\top}\boldsymbol{\beta}$ where $\boldsymbol{a}$ is a specified non-zero $p$ -vector. Our aim is to find a confidence interval for $\theta$ with minimum coverage probability a pre-specified value $1-\alpha$ , based on an observation of $\boldsymbol{y}$ .

Henceforth, let $\mathscr{K}$ denote the family of all subsets of $\{q+1,\dots,p\}$ including the empty set, where $q$ is a specified integer satisfying $1\leq q<p$ . For each $K\in\mathscr{K}$ , let ${\cal M}_{K}$ denote the model for which $\beta_{i}=0$ for all $i\in K$ . In other words, the number of models under consideration is $2^{p-q}$ . Suppose that the last $p-q$ components of $\boldsymbol{a}$ are zeros. In other words, suppose that these models differ from each other only with respect to nuisance parameters, so that the quantity of interest $\theta$ has the same meaning for all of these models. This condition will commonly be satisfied, possibly after some minor reparametrization (see Section 5 for an example). We consider quite general data-based weights on the models ${\cal M}_{K}$ , where $K$ belongs to the family ${\mathscr{K}}$ . We then consider the MATA interval, with nominal coverage $1-\alpha$ , obtained by averaging over the these models using these data-based weights. We denote this confidence interval by $I({\mathscr{K}})$ .

Our easily-computed (calculated by repeated numerical evaluation of a double integral) upper bound on the minimum coverage probability of the MATA interval $I({\mathscr{K}})$ is obtained as follows. We first prove the intuitively plausible result Theorem 2 (stated in Section 2) that the wider the class of models over which one averages using specified data-based model weights, the smaller is the minimum coverage probability of the MATA interval, with nominal coverage $1-\alpha$ . Let $\widehat{\theta}$ , $\widehat{\beta}_{q+1},\ldots,\widehat{\beta}_{p}$ denote the least squares estimators of $\theta$ , $\beta_{q+1},\ldots,\beta_{p}$ respectively. Also let corr $\big{(}\widehat{\theta},\widehat{\beta}_{j}\big{)}$ denote the correlation between $\widehat{\theta}$ and $\widehat{\beta}_{j}$ , which is a known quantity that is determined by the design matrix $\boldsymbol{X}$ and the vector $\boldsymbol{a}$ which specifies the parameter of interest $\theta$ . It follows from the results of Kabaila, Welsh & Abeysekera (2016) that the MATA interval, with nominal coverage $1-\alpha$ , obtained by data-based averaging over only the full model and the submodel for which $\beta_{j}=0$ has minimum coverage probability that is the same decreasing function of $|\text{corr}\big{(}\widehat{\theta},\widehat{\beta}_{j}\big{)}|$ , for each $j\in\{q+1,\dots,p\}$ . It follows from Theorem 1 that this minimum coverage probability is an upper bound on the minimum coverage probability of the MATA interval $I({\mathscr{K}})$ , for each $j\in\{q+1,\dots,p\}$ . Our upper bound on the minimum coverage probability of the MATA interval $I({\mathscr{K}})$ is simply the minimum of these upper bounds, which is attained for the value of $j\in\{q+1,\dots,p\}$ maximizing $|\text{corr}\big{(}\widehat{\theta},\widehat{\beta}_{j}\big{)}|$ . This upper bound depends on the design matrix $\boldsymbol{X}$ and the vector $\boldsymbol{a}$ only through the known parameter $|\rho|_{\mbox{\footnotesize$ \rm max $}}$ which we define to be the maximum over $j\in\{q+1,\dots,p\}$ of $\big{|}\text{corr}\big{(}\widehat{\theta},\widehat{\beta}_{j}\big{)}\big{|}$ . Since $|\rho|_{\mbox{\footnotesize$ \rm max $}}$ is obtained by this maximization, it may be quite close to 1 in many applications. We have written an R computer program to evaluate this upper bound.

We use this computer program to provide evidence against the use of a data-based weight on the model ${\cal M}_{K}$ that is proportional to $\exp(-\text{BIC}(K)/2)$ , where $\text{BIC}(K)$ denotes the BIC criterion for this model. Since AIC and BIC are similar criteria for $\ln(n)$ approximately equal to 2, we consider $n\geq 15$ . Figure 1 presents graphs of the upper bound on the minimum coverage probability of the MATA interval $I({\mathscr{K}})$ , with nominal coverage 0.95, as a function of $|\rho|_{\mbox{\footnotesize$ \rm max $}}$ for $p=10$ and $n\in\{15,30,70,200\}$ . For each value of $n$ considered, this upper bound is found to be a continuous decreasing function of $|\rho|_{\mbox{\footnotesize$ \rm max $}}$ that falls well below $1-\alpha$ when $|\rho|_{\mbox{\footnotesize$ \rm max $}}$ is close to 1. Also, for each value of $|\rho|_{\mbox{\footnotesize$ \rm max $}}>0$ considered, this upper bound is found to be a decreasing function of $n$ . Figures similar to Figure 1 are presented in the Supplementary Material for a wide range of values of $n$ and $p$ . Figure 1 suggests the following large sample result: under the very weak condition that $|\rho|_{\mbox{\footnotesize$ \rm max $}}$ converges to a positive number as $n\rightarrow\infty$ , the minimum coverage probability (i.e the confidence coefficient) of the MATA interval $I({\mathscr{K}})$ , with weight on model ${\cal M}_{K}$ proportional to $\exp(-\text{BIC}(K)/2)$ , converges to 0 as $n\rightarrow\infty$ . This suggested result turns out to be correct and is stated in Section 6.

Large sample results can have subtleties in their interpretation. These subtleties are briefly explored at the start of Section 6, before we state the main results of this section. Our conclusion from these results and the Supplementary Material is that the MATA interval with weight on the model ${\cal M}_{K}$ proportional to $\exp(-\text{BIC}(K)/2)$ should not be used if $|\rho|_{\mbox{\footnotesize$ \rm max $}}$ is not too far from 1 and $p/n$ is reasonably small, as judged from a figure, such as Figure 1, which is easily computed for any given $p$ .

2. THE MATA INTERVAL FOR GENERAL DATA-BASED WEIGHTS

Let $\widehat{\boldsymbol{\beta}}$ denote the least-squares estimator of $\boldsymbol{\beta}$ . Let RSS denote the following residual sum of squares,

[TABLE]

For each $K\in\mathscr{K}$ , let $|K|$ denote the number of elements in $K$ . Also, for $K\neq\varnothing$ , let $\boldsymbol{H}_{K}$ denote the $|K|\times p$ matrix whose $i$ ’th row consists of zeros except for the $j$ ’th element which is 1, where $j$ is the $i$ ’th ordered element of $K$ . Thus $\boldsymbol{H}_{K}\boldsymbol{\beta}=\boldsymbol{0}$ , for the model ${\cal M}_{K}$ ( $K\neq\varnothing$ ). Let $\widehat{\boldsymbol{\beta}}_{K}$ denote the least-squares estimator of $\boldsymbol{\beta}$ subject to this restriction. Note that

[TABLE]

Let $\text{RSS}_{K}$ denote the residual sum of squares

[TABLE]

and $S_{K}^{2}=\text{RSS}_{K}/(n-p+|K|)$ . Also let $v(K)=\text{var}\big{(}\boldsymbol{a}^{\top}\widehat{\boldsymbol{\beta}}_{K}\big{)}/\sigma^{2}$ , where this variance is computed under the model ${\cal M}_{K}$ .

We can choose a model from $\big{\{}{\cal M}_{K}:K\in{\mathscr{K}}\big{\}}$ by minimizing the following generalized information criterion

[TABLE]

with respect to $K\in{\mathscr{K}}$ , where $d$ is a nonnegative number ( $d=2$ for AIC and $d=\ln(n)$ for BIC) and $\text{RRS}_{K}=\text{RSS}$ for $K=\varnothing$ . A weight for model ${\cal M}_{K}$ ( $K\in{\mathscr{K}}$ ) that is proportional to $\exp(-\text{GIC}(K)/2)$ , for either $d=2$ or $d=\ln(n)$ , was considered by Turek & Fletcher (2012).

We introduce quite general forms of model weights based on the statistics $U_{K}/\text{RSS}$ , where

[TABLE]

Some motivation for the use of such weights is provided by the fact that

[TABLE]

is the usual test statistic for testing the null hypothesis that $\boldsymbol{H}_{K}\boldsymbol{\beta}=\boldsymbol{0}$ against the alternative hypothesis that $\boldsymbol{H}_{K}\boldsymbol{\beta}\neq\boldsymbol{0}$ . This test statistic has an $F_{|K|,n-p}$ distribution under this null hypothesis. Obviously, $U_{K}/\text{RSS}=V_{K}\big{/}(\text{RSS}/\sigma^{2})$ , where

[TABLE]

Now, for any given $K\in{\mathscr{K}}\setminus\{\varnothing\}$ , $V_{K}$ and $\text{RSS}/\sigma^{2}$ are independent random variables, where $\text{RSS}/\sigma^{2}\sim\chi^{2}_{n-p}$ and $V_{K}$ has a noncentral chi-squared distribution with degrees of freedom $|K|$ and noncentrality parameter

[TABLE]

see e.g. Graybill (1976, p.127). Thus $U_{K}/\text{RSS}$ may be viewed as a data-based measure of the deviation of the model ${\cal M}_{K}$ from the true model. This suggests a data-based weight $w(K;{\mathscr{K}})$ on the model ${\cal M}_{K}$ ( $K\in{\mathscr{K}}$ ) given by

[TABLE]

Here, the function $r:(0,\infty)\times\{1,\dots,p-q\}\rightarrow(0,\infty)$ satisfies the following conditions:

C1

For each $y\in\{1,\dots,p-q\}$ , $r(x,y)$ is a continuous decreasing function of $x$ that approaches 0 as $x\rightarrow\infty$ . 2. C2

For each $x\in(0,\infty)$ , $r(x,y)$ is an increasing function of $y\in\{1,\dots,p-q\}$ .

The motivation for the second of these conditions is as follows. According to (4), the weight on model ${\cal M}_{K}$ is proportional to $r(U_{K}/\text{RSS},|K|)$ , where $U_{K}/\text{RSS}$ is a data-based measure of the deviation of the model ${\cal M}_{K}$ from the true model and $|K|$ is the number of regression parameters that are set to 0. We want $r(U_{K}/\text{RSS},|K|)$ to be an increasing function of $|K|$ since this leads to $r(U_{K}/\text{RSS},|K|)$ being a decreasing function of $p-|K|$ , which is the number of regression parameters in the model ${\cal M}_{K}$ . As shown in the appendix, a weight for model ${\cal M}_{K}$ ( $K\in{\mathscr{K}}$ ) that is proportional to $\exp(-\text{GIC}(K)/2)$ has the form described by (4) above.

The MATA interval $I({\mathscr{K}})$ for $\theta$ , with nominal coverage $1-\alpha$ and obtained by averaging (using the data-based weights (4)) over the models $\big{\{}{\cal M}_{K}:K\in{\mathscr{K}}\big{\}}$ is obtained as follows. Let

[TABLE]

where $G_{\nu}$ is the $t_{\nu}$ cdf. The MATA interval $I({\mathscr{K}})=\big{[}\widehat{\theta}_{\ell},\widehat{\theta}_{u}\big{]}$ , is obtained by solving

[TABLE]

for $\widehat{\theta}_{\ell}$ and $\widehat{\theta}_{u}$ .

3. TWO IMPORTANT PRELIMINARY RESULTS

Remember the following definitions given in the introduction. Let $\mathscr{K}$ denote the family of all subsets of $\{q+1,\dots,p\}$ ( $1\leq q<p$ ), including the empty set. For each $K\in\mathscr{K}$ , let ${\cal M}_{K}$ denote the model for which $\beta_{i}=0$ for all $i\in K$ . Let $I({\mathscr{K}})$ denote the MATA interval, with nominal coverage $1-\alpha$ , obtained by averaging (using the data-based weights (4)) over the models $\big{\{}{\cal M}_{K}:K\in{\mathscr{K}}\big{\}}$ . Throughout this section we assume that $\boldsymbol{a}$ , $\boldsymbol{X}$ and $q$ are given. Remember, we assume that the last $p-q$ components of $\boldsymbol{a}$ are zeros. The following lemma, proved in the appendix, paves the way for Theorems 1 and 2, which are the main results of this section.

Lemma 1. For each given $K\in{\mathscr{K}}\setminus\{\varnothing\}$ ,

[TABLE]

can be expressed as a function of $(\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta})/\sigma$ , ${\rm RSS}/\sigma^{2}$ and the variables in the set $\{\beta_{i}/\sigma:i\in K\}$ . Also, for $K=\varnothing$ , (7) can be expressed as a function of $(\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta})/\sigma$ and ${\rm RSS}/\sigma^{2}$ .

It is intuitively plausible that the wider the class of models over which one averages using specified data-based model weights, the smaller is the minimum coverage probability of the MATA interval, with nominal coverage $1-\alpha$ . Theorem 2 below formalizes this plausible result. Suppose that the integer $\ell$ satisfies $q+1<\ell<p$ . Let ${\mathscr{K}}^{**}$ denote the family of all subsets of $\{\ell+1,\dots,p\}$ , including the empty set. Obviously, ${\mathscr{K}}^{**}\subset{\mathscr{K}}$ . Let $I({\mathscr{K}}^{**})$ denote the MATA interval, with nominal coverage $1-\alpha$ , obtained by averaging (using the data-based weights (4), but with ${\mathscr{K}}$ replaced by ${\mathscr{K}}^{**}$ ) over the models $\big{\{}{\cal M}_{K}:K\in{\mathscr{K}}^{**}\big{\}}$ . The following theorem is a necessary preliminary to Theorem 2.

Theorem 1.

(a)

The coverage probability of the MATA interval $I({\mathscr{K}})$ ,

$P_{\boldsymbol{\beta},\sigma}(\theta\in I({\mathscr{K}}))$ , is a function of $(1/\sigma)(\beta_{q+1},\ldots,\beta_{p})$ . 2. (b)

The coverage probability of the MATA interval $I({\mathscr{K}}^{**})$ ,

$P_{\boldsymbol{\beta},\sigma}(\theta\in I({\mathscr{K}}^{**}))$ , is a function of $(1/\sigma)(\beta_{\ell+1},\ldots,\beta_{p})$ .

The proofs of parts (a) and (b) of this theorem are virtually identical and so only part (a) is proved in the appendix.

We will use the following theorem (proved in the appendix) in Section 4 to describe an easily-computed upper bound on the minimum coverage probability of the MATA interval $I({\mathscr{K}})$ .

Theorem 2. The minimum coverage probability of the MATA interval $I({\mathscr{K}})$ , with nominal coverage $1-\alpha$ , obtained by averaging (using the data-based weights (4)) over the models $\big{\{}{\cal M}_{K}:K\in{\mathscr{K}}\big{\}}$ is bounded above by the minimum over $(1/\sigma)(\beta_{\ell+1},\ldots,\beta_{p})\in{\mathbb{R}}^{p-\ell}$ of

[TABLE]

where $I({\mathscr{K}}^{**})$ denotes the MATA interval, with nominal coverage $1-\alpha$ , obtained by averaging (using the data-based weights (4), but with ${\mathscr{K}}$ replaced by ${\mathscr{K}}^{**}$ ) over the models $\big{\{}{\cal M}_{K}:K\in{\mathscr{K}}^{**}\big{\}}$ .

4. AN EASILY-COMPUTED UPPER BOUND ON THE MINIMUM COVERAGE PROBABILITY OF THE MATA INTERVAL

In this section we present an easily-computed upper bound on the minimum coverage probability of the MATA interval $I({\mathscr{K}})$ , with nominal coverage $1-\alpha$ , obtained by averaging (using the data-based weights (4)) over the models $\big{\{}{\cal M}_{K}:K\in{\mathscr{K}}\big{\}}$ . Assume, for notational convenience, that $\big{|}\text{corr}\big{(}\widehat{\theta},\widehat{\beta}_{j}\big{)}\big{|}$ is maximized with respect to $j\in\{q+1,\dots,p\}$ at $j=p$ . This assumption can always be satisfied using, if necessary, an initial rearrangement of the order of the last $p-q$ columns of the matrix $\boldsymbol{X}$ . Theorem 2 implies that this minimum coverage probability is bounded above by the coverage probability of the MATA interval $I({\mathscr{K}}^{*})$ , with nominal coverage $1-\alpha$ , for ${\mathscr{K}}^{*}=\big{\{}\varnothing,\{p\}\big{\}}$ and any given $\beta_{p}/\sigma$ . Theorem 1 of Kabaila, Welsh & Abeysekera (2016) provides a computationally-convenient expression for the latter coverage probability. This expression is easily minimized numerically with respect to $\beta_{p}/\sigma$ to obtain the value of an upper bound on the minimum coverage probability of the MATA interval $I({\mathscr{K}})$ , with nominal coverage $1-\alpha$ .

To apply Theorem 1 of Kabaila, Welsh & Abeysekera (2016), we introduce the following notation. Let $\boldsymbol{c}$ be the $p$ -vector $(0,\dots,0,1)$ , whose first $p-1$ components are zeros. Also let $\widehat{\sigma}^{2}=\text{RSS}/(n-p)$ , $v_{\theta}=\text{var}(\widehat{\theta})/\sigma^{2}=\boldsymbol{a}^{\top}(\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}\boldsymbol{a}$ , $v_{p}=\text{var}(\widehat{\beta}_{p})/\sigma^{2}=\boldsymbol{c}^{\top}(\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}\boldsymbol{c}$ and $\gamma=\beta_{p}/(\sigma\,v_{p}^{1/2})$ . Observe that $\gamma$ is a scaled version of $\beta_{p}$ . This scaling is very helpful for the computation of the minimum coverage probability of the MATA interval, as this minimum coverage is achieved at roughly the same value of $\gamma$ , for small and moderate sample sizes $n$ . Define $\widetilde{\rho}=\text{corr}(\widehat{\theta},\widehat{\beta}_{p})$ , which is equal to $\boldsymbol{a}^{\top}(\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}\boldsymbol{c}\big{/}(v_{\theta}\,v_{p})^{1/2}$ . Note that $v_{\theta}$ , $v_{p}$ and $\widetilde{\rho}$ are known, whereas $\gamma$ is an unknown parameter. Also note that $|\widetilde{\rho}|=|\rho|_{\mbox{\footnotesize$ \rm max $}}$ . Finally, let $m=n-p$ .

It follows from (4) that the weight $w\big{(}{\{p\}},{\mathscr{K}}^{*}\big{)}$ on the model ${\cal M}_{\{p\}}$ is given by

[TABLE]

Therefore, the function $w_{1}$ defined by Kabaila et al. (2016) must satisfy

[TABLE]

so that

[TABLE]

Condition C1 on the function $r$ implies that $w_{1}:[0,\infty)\rightarrow[0,1]$ is a decreasing continuous function, such that $w_{1}(z)$ approaches 0 as $z\rightarrow\infty$ . For the particular case that the weight on the model ${\cal M}_{K}$ is proportional to $\exp(-\text{GIC}(K)/2)$ , as shown in the appendix, $r(x,1)=\exp(d/2)\big{/}(1+x)^{n/2}$ and consequently

[TABLE]

We now apply the results of Kabaila, Welsh & Abeysekera (2016). The function $\delta_{u}(x,y)$ is defined on page 4 of this paper. As shown on page 6 of this paper, for the scenario considered in the present paper, this function takes the following particular form. For $0<u<1$ , define $\delta_{u}(x,y)$ to be the solution for $\delta$ in the equation

[TABLE]

where $G_{\nu}$ denotes the $t_{\nu}$ cdf. An immediate consequence of Theorem 1 of Kabaila, Welsh & Abeysekera (2016) is that the coverage probability of the MATA interval $I({\mathscr{K}}^{*})$ , with nominal coverage $1-\alpha$ , and any given $\gamma$ is given by

[TABLE]

where $\Phi$ and $\phi$ denote the $N(0,1)$ cdf and pdf, respectively, and $f_{\nu}$ denotes the pdf of $(Q/\nu)^{1/2}$ , where $Q\sim\chi^{2}_{\nu}$ . As noted on page 6 of Kabaila, Welsh & Abeysekera (2016), the conditions required for Theorem 3 of Kabaila, Welsh & Abeysekera (2016) to hold are satisfied. This theorem implies that this coverage probability is an even function of $\gamma$ for fixed $\widetilde{\rho}$ and an even function of $\widetilde{\rho}$ for fixed $\gamma$ . The upper bound on the minimum coverage probability of the MATA interval $I({\mathscr{K}})$ , with nominal coverage $1-\alpha$ , is obtained by setting $\widetilde{\rho}=|\rho|_{\mbox{\footnotesize$ \rm max $}}$ and then minimizing (9) over $\gamma\geq 0$ . The double integral (9) is very easily computed using the methods described in Appendix B of Kabaila, Welsh & Mainzer (2016). An R computer program for the computation of this double integral is available upon request.

5. NUMERICAL ILLUSTRATIONS

In this section, we present some computed values of the upper bound, described in the previous section, on the minimum coverage probability of the MATA interval $I({\mathscr{K}})$ , with nominal coverage 0.95, obtained using a weight for model ${\cal M}_{K}$ ( $K\in{\mathscr{K}}$ ) that is proportional to $\exp(-\text{GIC}(K)/2)$ for both $d=2$ (AIC) and $d=\ln(n)$ (BIC). Consider the real life Air Pollution data described in Section 11.14 of Chatterjee & Hadi (2012). The purpose of collecting this data was to study the dependence of total mortality on climate, socioeconomic and pollution explanatory variables. Let $z_{i+1}$ denote the explanatory variable $X_{i}$ described in Table 11.11 of Chatterjee & Hadi (2012), for $i=1,\dots,15$ . Consider the following linear regression model for this data:

[TABLE]

where the response variable $y$ is the total age-adjusted mortality from all causes, $\psi,\beta_{2},\dots,\beta_{16}$ are unknown parameters and $\varepsilon\sim N(0,\sigma^{2})$ , for $\sigma^{2}$ an unknown parameter. In this case, $n=60$ and $p=16$ . Suppose that $\mathscr{K}$ is the family of all subsets of $\{2,\dots,16\}$ including the empty set. For each $K\in\mathscr{K}$ , let ${\cal M}_{K}$ denote the model for which $\beta_{i}=0$ for all $i\in K$ . In other words, the number of models under consideration is $2^{15}=32,768$ . Suppose that the parameter of interest $\theta$ is $E(y)$ for $(z_{2},\dots,z_{16})=(z_{2}^{*},\dots,z_{16}^{*})$ , where $(z_{2}^{*},\dots,z_{16}^{*})$ is equal to

[TABLE]

Note that $(z_{2}^{*},\dots,z_{16}^{*})$ is well within the range of the values of $(z_{2},\dots,z_{16})$ in the data. Obviously, $\theta=\psi+\beta_{2}z_{2}^{*}+\dots+\beta_{16}z_{16}^{*}$ and so

[TABLE]

and $\theta=\beta_{1}$ . In this parametrization of the linear regression model, $\theta$ has the same meaning for all the models ${\cal M}_{K}$ , where $K\in\mathscr{K}$ . In this case, $|\rho|_{\mbox{\footnotesize$ \rm max $}}=0.9599$ and the upper bound on the minimum coverage probability of the MATA interval $I({\mathscr{K}})$ , with nominal coverage 0.95, is (a) 0.8900 for $d=2$ (AIC) and (b) 0.7940 for $d=\ln(n)$ (BIC).

6. LARGE SAMPLE RESULTS FOR THE MATA INTERVAL

The main result of this section provides conditions under which the MATA interval $I({\mathscr{K}})$ , with weight on model ${\cal M}_{K}$ proportional to $\exp(-\text{GIC}(K)/2)$ , has minimum coverage probability (i.e. confidence coefficient) that converges to 0 as $n\rightarrow\infty$ . An important advantage of the results presented in Sections 1–5 is that they are exact finite sample results and consequently their interpretation is very straightforward. By contrast, large sample results can have subtleties in their interpretation. It is these subtleties that we briefly explore before stating the main results of this section. We begin by reminding the reader of Hodges’s superefficient estimator and the well-known subtleties in the interpretation of large sample results for this point estimator. We then note that similar subtleties in the interpretation of large sample results also occur in the context of confidence intervals. Finally, we present the main result of this section which concerns the MATA interval.

Hodges’s superefficient estimator is described as follows. Suppose that $X_{1},X_{2},\dots$ are independent and identically $N(\theta,1)$ distributed, where $\theta\in\Theta=\mathbb{R}$ . The usual estimator of $\theta$ is $\overline{X}_{n}=(\sum_{i=1}^{n}X_{i})/n$ . Of course, $nE\big{(}(\overline{X}_{n}-\theta)^{2}\big{)}=1$ for all $\theta\in\Theta$ . Hodges’s superefficient estimator is

[TABLE]

where $0<b<1$ . As shown on p.442 of Lehmann and Casella (1998), $\lim_{n\rightarrow\infty}nE\big{(}(T_{n}-\theta)^{2}\big{)}=1$ if $\theta\neq 0$ and $\lim_{n\rightarrow\infty}nE\big{(}(T_{n}-\theta)^{2}\big{)}=b^{2}$ if $\theta=0$ . Thus, at first sight, it may appear that $T_{n}$ performs better (in terms of mean squared estimation error) than $\overline{X}_{n}$ when the sample size $n$ is large. However, as Figure 2.1 on p.443 of Lehmann & Casella (1998) shows, this apparent improvement in performance is misinformative: the supremum over $\theta$ of $nE\big{(}(T_{n}-\theta)^{2}\big{)}$ approaches infinity as $n\rightarrow\infty$ . The problem with the analysis of $nE\big{(}(T_{n}-\theta)^{2}\big{)}$ for each fixed $\theta$ as $n\rightarrow\infty$ is that this is a limit result that is pointwise in the parameter space $\Theta$ . We should, instead, consider $nE\big{(}(T_{n}-\theta)^{2}\big{)}$ across the entire parameter space $\Theta$ for each fixed $n$ and then let $n\rightarrow\infty$ . As pointed out on p.153 of Hajek (1971):

Especially misinformative are those limit results that are not uniform. Then the limit can exhibit some features that are not even approximately true for any finite $n$ .

and

Super efficient estimates produced by L.J. Hodges (see LeCam 1953, p.280) have their shocking properties only in the limit. For any finite $n$ they behave quite poorly for some parameter values. These values, however, depend on $n$ and disappear in the limit.

Kabaila (1995) presents the following confidence interval analogue of Hodges’s superefficient estimator. Suppose that $X_{1},X_{2},\dots$ have the same probability distribution as before. Also define $\overline{X}_{n}$ and $T_{n}$ as before. The usual $1-\alpha$ confidence interval for $\theta$ is $I_{n}=\big{[}\overline{X}_{n}-n^{-1/2}z_{1-\alpha},\overline{X}_{n}+n^{-1/2}z_{1-\alpha}\big{]}$ , where the quantile $z_{a}$ is defined by the requirement that $P(Z\leq z_{a})=a$ for $Z\sim N(0,1)$ . Of course, $P_{\theta}(\theta\in I_{n})=1-\alpha$ for all $\theta$ and $n^{1/2}(\text{length of }I_{n})=2z_{1-\alpha}$ . Let

[TABLE]

where, as before, $0<b<1$ . Now define the confidence interval $J_{n}=\big{[}T_{n}-n^{-1/2}z_{1-\alpha}W_{n}^{1/2},T_{n}+n^{-1/2}z_{1-\alpha}W_{n}^{1/2}\big{]}$ . It may be shown that for each $\theta$ , $\lim_{n\rightarrow\infty}P_{\theta}(\theta\in J_{n})=1-\alpha$ . In addition, it may be shown that $\lim_{n\rightarrow\infty}P_{\theta}(n^{1/2}(\text{length of }J_{n})=2z_{1-\alpha}b)=1$ for $\theta=0$ and $\lim_{n\rightarrow\infty}P_{\theta}(n^{1/2}(\text{length of }J_{n})=2z_{1-\alpha})=1$ for all $\theta\neq 0$ . Thus, at first sight it may appear that the confidence interval $J_{n}$ performs better than the confidence interval $I_{n}$ when $n$ is large. Kabaila (1995) shows that this apparent improvement in performamce is misinformative: the infimum over $\theta$ of $P_{\theta}(\theta\in J_{n})$ approaches 0 as $n\rightarrow\infty$ . In other words, the confidence coefficient of $J_{n}$ approaches 0 as $n\rightarrow\infty$ . The problem with the analysis of $P_{\theta}(\theta\in J_{n})$ for each fixed $\theta$ as $n\rightarrow\infty$ is that this is a limit result that is pointwise in the parameter space $\Theta$ . We should, instead, consider $P_{\theta}(\theta\in J_{n})$ across the entire parameter space $\Theta$ for each fixed $n$ and then let $n\rightarrow\infty$ . This point is also made by Leeb & Pötscher (2005, pp.31–32).

We now present the main results of this section. Consider the linear regression model and parameter of interest $\theta=\boldsymbol{a}^{\top}\boldsymbol{\beta}$ described in the introduction. Remember, we assume that the last $p-q$ components of $\boldsymbol{a}$ are zeros. Also consider the MATA interval $I({\mathscr{K}}^{*})$ , with nominal coverage $1-\alpha$ and weight on model ${\cal M}_{K}$ proportional to $\exp(-\text{GIC}(K)/2)$ , described in Section 4. The large sample framework that we consider is that $p$ and $q$ are fixed and $n\rightarrow\infty$ . Of course, many of the quantities which were defined in Section 4 now depend on $n$ . We make this dependence explicit in the notation by using $\beta_{p,n}$ , $v_{\theta,n}$ , $v_{p,n}$ , $\rho_{n}$ , $d_{n}$ and $\gamma_{n}$ to denote $\beta_{p}$ , $v_{\theta}$ , $v_{p}$ , $\widetilde{\rho}$ , $d$ and $\gamma$ , respectively. Note that $v_{\theta,n}$ , $v_{p,n}$ and $\rho_{n}$ are known, whereas $\gamma_{n}$ is an unknown parameter. The main result of this section requires that the following assumption concerning $d_{n}$ holds.

Assumption A Suppose that $\{d_{n}\}$ is an increasing sequence of nonnegative numbers that diverges to $\infty$ as $n\rightarrow\infty$ . Also suppose that $d_{n}/n\rightarrow 0$ as $n\rightarrow\infty$ .

This assumption holds, for example, when $d_{n}=\ln(n)$ , in which case the weight on model ${\cal M}_{K}$ is proportional to $\exp(-\text{BIC}(K)/2)$ .

Theorem 3. Consider the linear regression model and parameter of interest $\theta$ described in the introduction. Also consider the MATA interval $I({\mathscr{K}}^{*})$ , with nominal coverage $1-\alpha$ and weight on model ${\cal M}_{K}$ proportional to $\exp(-\text{GIC}(K)/2)$ , described in Section 4. Here, ${\mathscr{K}}^{*}=\big{\{}\varnothing,\{p\}\big{\}}$ . Suppose that $p$ and $q$ are fixed and that $\boldsymbol{D}=\lim_{n\rightarrow\infty}\boldsymbol{X}^{\top}\boldsymbol{X}/n$ exists and is nonsingular. Also suppose that

$\boldsymbol{a}^{\top}\boldsymbol{D}^{-1}\boldsymbol{c}\big{/}\big{(}\boldsymbol{a}^{\top}\boldsymbol{D}^{-1}\boldsymbol{a}\,\boldsymbol{c}^{\top}\boldsymbol{D}^{-1}\boldsymbol{c}\big{)}^{1/2}\neq 0$ . Finally, suppose that Assumption A holds. Then

(a)

The infimum over $\gamma_{n}\in\mathbb{R}$ of $P(\theta\in I({\mathscr{K}}^{*}))$ converges to 0, as $n\rightarrow\infty$ . 2. (b)

If $\beta_{p}$ and $\sigma^{2}$ ( $\sigma^{2}>0$ ) are fixed and $\beta_{p}\neq 0$ then $w(\varnothing;{\mathscr{K}}^{*})$ converges in probability to 1 and $P(\theta\in I({\mathscr{K}}^{*}))$ converges to $1-\alpha$ , as $n\rightarrow\infty$ . 3. (c)

If $\beta_{p}$ and $\sigma^{2}$ ( $\sigma^{2}>0$ ) are fixed and $\beta_{p}=0$ then $w(\{p\};{\mathscr{K}}^{*})$ converges in probability to 1 and $P(\theta\in I({\mathscr{K}}^{*}))$ converges to $1-\alpha$ , as $n\rightarrow\infty$ .

This result is proved in the appendix. The most important part of this theorem is (a) which implies that the MATA interval, with weight on model ${\cal M}_{K}$ proportional to $\exp(-\text{BIC}(K)/2)$ , has confidence coefficient that approaches 0 as $n\rightarrow\infty$ . In other words, this MATA interval should not be used when $n$ is large. Parts (b) and (c) of this theorem do not provide useful information as they are limits as $n\rightarrow\infty$ pointwise in the parameter space.

Another way of looking at Theorem 3 is the following. Consider the asymptotic framework that $\beta_{p}$ and $\sigma^{2}$ ( $\sigma^{2}>0$ ) are both fixed. If $\beta_{p}=0$ then $\gamma_{n}=0$ and if $\beta_{p}>0$ then $\gamma_{n}$ diverges to $\infty$ at rate $O(n^{1/2})$ . Sequences $\gamma_{n}$ that diverge to $\infty$ at a slower rate are not included in this analysis. The proof of part (a) of Theorem 3 presents one such sequence for which the coverage probability of the MATA interval $I({\mathscr{K}}^{*})$ converges to 0. This sequence is “missed” in the asymptotic framework that $\beta_{p}$ and $\sigma^{2}$ are both fixed. In other words, this asymptotic framework does not lead to an accurate appreciation of the confidence coefficient of this MATA interval when $n$ is large.

We now turn our attention to the asymptotic framework that $m=n-p$ is fixed and $n\rightarrow\infty$ . The following result is proved in the appendix.

Theorem 4. Consider the linear regression model and parameter of interest $\theta$ described in the introduction. Also consider the MATA interval $I({\mathscr{K}}^{*})$ , with nominal coverage $1-\alpha$ and weight on model ${\cal M}_{K}$ proportional to $\exp(-\text{GIC}(K)/2)$ , described in Section 4. T Suppose that $m=n-p$ is fixed. Also suppose that Assumption A holds. Then, for any given $\epsilon>0$ ,

[TABLE]

In other words, $w_{1}(\widehat{\gamma}^{2})$ converges in probability to 0 as $n\rightarrow\infty$ , uniformly in the parameter $\gamma$ .

This theorem and its proof suggest that the MATA interval described in this result will be close to the usual $1-\alpha$ confidence interval for $\theta$ based on the full model when $m=n-p$ is small compared to $n$ . An interpretation of this suggested result is that this MATA interval is rather uninteresting when $m=n-p$ is small compared to $n$ . A numerical exploration of the case that $m=n-p$ is small compared to $n$ is presented in the Supplementary Material.

7. CONCLUSION

We have derived an easily-computed new upper bound on the minimum coverage probability (i.e. the confidence coefficient) of the MATA confidence interval in the context of all possible subsets of a given set of explanatory variables in a linear regression model. The main application of this upper bound is that it can be used to help eliminate poorly performing model weights from further consideration. In the Supplementary Material we present graphs similar to those displayed in Figure 1 for a wide range of values of $n$ and $p$ . These graphs, combined with the large sample results presented in Section 6, show that the MATA confidence interval with weight on a model that is proportional to $\exp(-\text{BIC}/2)$ , where BIC is the Bayesian Information Criterion for this model, should not be used if $|\rho|_{\mbox{\footnotesize$ \rm max $}}$ is not too far from 1 and $p/n$ is not too close to 1.

BIBLIOGRAPHY

Abramowitz, M. & Stegun, I.A. (1965). Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover, New York.

Buckland, S.T., Burnham, K.P. & Augustin, N.H. (1997). Model selection: an integral part of inference. Biometrics, 53, 603–618.

Casella, G. & Berger, R.L. (2002). Statistical Inference, 2nd edition. Duxbury, Pacific Grove, CA.

Chatterjee, S. & Hadi, A.S. (2012). Regression Analysis by Example, 5th edition. Wiley, Hoboken, NJ.

Fieberg, J. & Johnson, D.H. (2015). MMI: Multimodel inference or models with management implications. Journal of Wildlife Management, 79, 708–718.

Fletcher, D. & Turek, D. (2011). Model-averaged profile likelihood confidence intervals. Journal of Agricultural, Biological and Environmental Statistics, 17, 38–51.

Graybill, F. A. (1976). Theory and Application of the Linear Model. Duxbury, Pacific Grove CA.

Hajek, J. (1971). Limiting properties of likelihoods and inference. In Foundations of Statistical Inference: Proceedings of the Symposium on the Foundatuions of Statistical Inference prepared under the auspices of the Rene Descartes Foundation and held at the Department of Statistics, University of Waterloo, Ontario, Canada, from March 31 to April 9, 1970, V.P. Godambe & D.A. Sprott eds, pp. 142–159. Holt, Reinhart and Winston, Toronto.

Hjort, N.L. & Claeskens, G. (2003). Frequentist model average estimators. Journal of the American Statistical Association, 98, 879–899.

Johnson, N.L., Kotz, S. & Balakrishnan, N. (1995). Continuous Univariate Distributions, Volume 2, 2nd edition. Wiley, New York.

Kabaila, P. (1995). The effect of model selection on confidence regions and prediction regions. Econometric Theory, 11, 537–549.

Kabaila, P. & Leeb, H. (2006). On the large-sample minimal coverage probability of confidence intervals after model selection. Journal of the American Statistical Association, 101, 619–629.

Kabaila, P. & Giri, K. (2009). Upper bounds on the minimum coverage probability of confidence intervals in regression after model selection. Australian & New Zealand Journal of Statistics, 51, 271–287.

Kabaila, P., Welsh, A.H. & Abeysekera, W. (2016). Model-averaged confidence intervals. Scandinavian Journal of Statistics, 43, 35–48.

Kabaila, P., Welsh, A.H. & Mainzer, R. (2016). The performance of model averaged tail area confidence intervals. Communications in Statistics - Theory and Methods, DOI: 10.1080/03610926.2016.1242741

LeCam, L. (1953). On some asymptotic properties of maximum likelihood estimates and related Bayes’ estimates. University of California Press, p.277–328.

Leeb, H. & Pötscher, B.M. (2005). Model selection and inference: facts and fiction. Econometric Theory, 21, 21–59.

Lehmann, E.L. & Casella, G. (1998). Theory of Point Estimation, 2nd edition. Springer, New York.

Turek, D. & Fletcher, D. (2012). Model-averaged Wald confidence intervals. Computational Statistics and Data Analysis, 56, 2809–2815.

Wang, H. & Zou, S.Z.F. (2013). Interval estimation by frequentist model averaging. Communications in Statistics - Theory and Methods, 42, 4342–4356.

APPENDIX

**The function $\boldsymbol{r}$ for weight on model $\boldsymbol{{\cal M}_{K}}$ proportional to ** $\boldsymbol{\exp(-\text{GIC}(K)/2)}$

Suppose that

[TABLE]

where $\text{GIC}(K)$ is given by (2) for each $K\in{\mathscr{K}}$ . As noted in Appendix B of Kabaila & Giri (2009), for each $K\in{\mathscr{K}}$ ,

[TABLE]

with the convention that $U_{K}=0$ for $K=\varnothing$ . It follows from this that

[TABLE]

and, for $K\in{\mathscr{K}}\setminus\{\varnothing\}$ ,

[TABLE]

It follows that $w(K;{\mathscr{K}})$ is of the form (4) for $r(x,y)=\exp(d\,y/2)\big{/}(1+x)^{n/2}$ , where $r:(0,\infty)\times\{1,\dots,p-q\}\rightarrow(0,\infty)$ satisfies conditions C1 and C2.

Proof of Lemma 1

Suppose that $K$ is given ( $K\in{\mathscr{K}}$ ). Let $\boldsymbol{\beta}_{K}$ denote the $p$ -vector obtained from $\boldsymbol{\beta}$ by setting to zero all of the components of $\boldsymbol{\beta}$ with indices belonging to $K$ . Since $K$ is a subset of $\{q+1,\dots,p\}$ , the first $q$ components of $\boldsymbol{\beta}_{K}$ are $(\beta_{1},\dots,\beta_{q})$ . Since we assume that the last $p-q$ components of $\boldsymbol{a}$ are zeros, $\boldsymbol{a}^{\top}\boldsymbol{\beta}=\boldsymbol{a}^{\top}\boldsymbol{\beta}_{K}$ . Thus

[TABLE]

Since $\boldsymbol{H}_{K}\boldsymbol{\beta}_{K}=\boldsymbol{0}$ ,

[TABLE]

It follows from this and (1) that

[TABLE]

Obviously, $\big{(}\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta}_{K}\big{)}\big{/}\sigma=\big{(}\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta}\big{)}\big{/}\sigma+\big{(}\boldsymbol{\beta}-\boldsymbol{\beta}_{K}\big{)}\big{/}\sigma$ . Hence $\boldsymbol{a}^{\top}\big{(}\widehat{\boldsymbol{\beta}}_{K}-\boldsymbol{\beta}_{K}\big{)}\big{/}\sigma$ can be expressed as a function of $\big{(}\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta}\big{)}\big{/}\sigma$ and the variables in the set $\{\beta_{i}/\sigma:i\in K\}$ .

Now we turn our attention to the denominator of the right-hand side of (11). It follows from (10) that, for each $K\in{\mathscr{K}}$ ,

[TABLE]

with the convention that $V_{K}=0$ for $K=\varnothing$ . Hence, for each $K\in{\mathscr{K}}$ ,

[TABLE]

Suppose that $K\neq\varnothing$ . Note that $V_{K}$ can be expressed as a function of the random variables in the set $\{\widehat{\beta}_{i}/\sigma:i\in K\}$ . Therefore, $S_{K}/\sigma$ can be expressed as a function of ${\rm RSS}/\sigma^{2}$ and the random variables in the set $\{\widehat{\beta}_{i}/\sigma:i\in K\}$ . Hence (11) can be expressed as a function of $(\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta})/\sigma$ , ${\rm RSS}/\sigma^{2}$ and the random variables in the set $\{\widehat{\beta}_{i}/\sigma:i\in K\}$ . Since $\widehat{\beta}_{i}/\sigma=(\widehat{\beta}_{i}-\beta_{i})/\sigma+\beta_{i}/\sigma$ for all $i\in K$ , (11) can be expressed as a function of $(\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta})/\sigma$ , ${\rm RSS}/\sigma^{2}$ and the variables in the set $\{\beta_{i}/\sigma:i\in K\}$ . Also, for $K=\varnothing$ , (11) can be expressed as a function of $(\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta})/\sigma$ and ${\rm RSS}/\sigma^{2}$ .

Proof of Theorem 1(a)

It may be shown that, for given $\boldsymbol{y}$ , $h\big{(}z,\boldsymbol{y};{\mathscr{K}}\big{)}$ is a continuous decreasing function of $z$ . It follows from this that, for any given $z$ ,

[TABLE]

Thus the coverage probability of the MATA interval $I({\mathscr{K}})$ , with nominal coverage $1-\alpha$ , is

[TABLE]

We see from (4) that, for each $K\in{\mathscr{K}}$ , $w(K;{\mathscr{K}})$ is a function of $\text{RSS}/\sigma^{2}$ and $(1/\sigma)(\widehat{\beta}_{q+1},\dots,\widehat{\beta}_{p})$ . It follows from Lemma 1 that the vector of random variables in the set

[TABLE]

can be expressed as a function of $(\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta})/\sigma$ , ${\rm RSS}/\sigma^{2}$ and $(1/\sigma)(\beta_{q+1},\dots,\beta_{p})$ . Therefore

[TABLE]

can be expressed as a function of $(\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta})/\sigma$ , ${\rm RSS}/\sigma^{2}$ and $(1/\sigma)(\beta_{q+1},\dots,\beta_{p})$ .

Now $(\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta})/\sigma$ and ${\rm RSS}/\sigma^{2}$ are independent random variables with $(\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta})/\sigma\sim N\big{(}\boldsymbol{0},(\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}\big{)}$ and ${\rm RSS}/\sigma^{2}\sim\chi^{2}_{n-p}$ . Hence (S0.Ex26) is a function of $(1/\sigma)(\beta_{q+1},\dots,\beta_{p})$ .

Proof of Theorem 2

Suppose that $(1/\sigma)(\beta_{\ell+1},\ldots,\beta_{p})$ is given. Choose $\beta_{q+1}/\sigma=\cdots=\beta_{\ell}/\sigma=t$ . We will consider $t\rightarrow\infty$ . Define ${\mathscr{J}}$ to be the family of sets that belong to ${\mathscr{K}}$ and include at least one element of the set $\{q+1,\ldots,\ell\}$ . Remember, ${\mathscr{K}}^{**}$ denotes the family of all subsets of $\{\ell+1,\dots,p\}$ , including the empty set. Thus ${\mathscr{K}}={\mathscr{J}}\cup{\mathscr{K}}^{**}$ , where ${\mathscr{J}}$ and ${\mathscr{K}}^{**}$ are disjoint sets. Hence

[TABLE]

Now consider $K$ to be a given element of ${\mathscr{J}}$ . It can be proved that $\big{(}\boldsymbol{H}_{K}(\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}\boldsymbol{H}_{K}^{\top}\big{)}^{-1}$ is a symmetric positive definite matrix. The noncentrality parameter $\lambda$ , given by (3), is bounded below by

[TABLE]

where $\|\cdot\|$ denotes the Euclidean norm. Since $K\in{\mathscr{J}}$ and $\beta_{q+1}/\sigma=\cdots=\beta_{\ell}/\sigma=t$ , $\|\boldsymbol{H}_{K}(\boldsymbol{\beta}/\sigma)\|^{2}\geq t^{2}$ and so $\lambda\rightarrow\infty$ as $t\rightarrow\infty$ . Thus

[TABLE]

It follows from condition C1 on the function $r$ that

[TABLE]

For each $K\in{\mathscr{J}}$ ,

[TABLE]

Therefore, for each $K\in{\mathscr{J}}$ , $w(K;{\mathscr{K}})\buildrel p\over{\longrightarrow}0$ , as $\beta_{q+1}/\sigma=\cdots=\beta_{\ell}/\sigma=t\rightarrow\infty$ . Since

[TABLE]

the first term on the right-hand side of (13) converges in probability to zero as $\beta_{q+1}/\sigma=\cdots=\beta_{\ell}/\sigma=t\rightarrow\infty$ .

For $K\in{\mathscr{K}}^{**}$ ,

[TABLE]

It follows from (14) that, for each $K\in{\mathscr{K}}^{**}$ ,

[TABLE]

It follows from (13) that

[TABLE]

By Theorem 1, the coverage probability of the MATA interval $I({\mathscr{K}})$ , with nominal coverage $1-\alpha$ , is a function of $(1/\sigma)(\beta_{q+1},\ldots,\beta_{\ell})$ and $(1/\sigma)(\beta_{\ell+1},\ldots,\beta_{p})$ . Since we suppose that $(1/\sigma)(\beta_{\ell+1},\ldots,\beta_{p})$ is given, the infimum of this coverage probability over $(1/\sigma)(\beta_{q+1},\ldots,\beta_{\ell})$ and $(1/\sigma)(\beta_{\ell+1},\ldots,\beta_{p})$ is less than or equal to

[TABLE]

for every $(1/\sigma)(\beta_{q+1},\ldots,\beta_{\ell})\in{\mathbb{R}}^{\ell-q}$ . Also, it follows from (16) that

[TABLE]

approaches

[TABLE]

as $\beta_{q+1}/\sigma=\cdots=\beta_{\ell}/\sigma=t\rightarrow\infty$ . Therefore the infimum of the coverage probability of the MATA interval $I({\mathscr{K}})$ , with nominal coverage $1-\alpha$ , is less than or equal to

[TABLE]

Since this is true for every given $(1/\sigma)(\beta_{q+1},\ldots,\beta_{\ell})\in{\mathbb{R}}^{\ell-q}$ , the infimum of the coverage probability of the MATA interval $I({\mathscr{K}})$ , with nominal coverage $1-\alpha$ , is less than or equal to the minimum over $(1/\sigma)(\beta_{q+1},\ldots,\beta_{\ell})\in{\mathbb{R}}^{\ell-q}$ of (17).

Proof of Theorem 3

Consider the MATA interval described in the statement of the theorem and suppose that the assumptions made in this statement hold. It follows that the sequence $\{\rho_{n}\}$ converges to the non-zero number $\rho_{\infty}=\boldsymbol{a}^{\top}\boldsymbol{D}^{-1}\boldsymbol{c}\big{/}\big{(}\boldsymbol{a}^{\top}\boldsymbol{D}^{-1}\boldsymbol{a}\,\boldsymbol{c}^{\top}\boldsymbol{D}^{-1}\boldsymbol{c}\big{)}^{1/2}$ as $n\rightarrow\infty$ . Let $\widehat{\gamma}_{n}=\widehat{\beta}_{p}/\big{(}\widehat{\sigma}\,v_{p,n}^{1/2}\big{)}$ . As in Section 4, define the function $w_{1}$ by (8). It follows from p.40 of Kabaila, Welsh & Abeysekera (2016) that the function defined by (5) is given by

[TABLE]

where $w(\{p\};{\mathscr{K}}^{*})=w_{1}(\widehat{\gamma}_{n}^{2})$ and $w(\varnothing;{\mathscr{K}}^{*})=1-w_{1}(\widehat{\gamma}_{n}^{2})$ . Remember, the MATA interval is obtained by solving the equations (6). Since, for any given $\boldsymbol{y}$ , $h\big{(}z,\boldsymbol{y};{\mathscr{K}}^{*}\big{)}$ is a continuous decreasing function of $z\in\mathbb{R}$ , the coverage probability of the MATA interval $I({\mathscr{K}}^{*})$ is

[TABLE]

We will need the following consequence of the exponential inequality 4.4.26 on p.70 of Abramowitz & Stegun (1965):

[TABLE]

for all $z>0$ .

Proof of part (a)

We show that the coverage probability of the interval $I({\mathscr{K}}^{*})$ converges to 0 when we consider $\sigma^{2}>0$ to be fixed and that $\beta_{p,n}=\sigma\,(v_{p,n}\,d_{n}/2)^{1/2}$ . It follows from this that $\gamma_{n}^{2}=d_{n}/2$ . Now

[TABLE]

where $B_{n}=\widehat{\beta}_{p}/(\sigma\,v_{p,n}^{1/2})$ . Note that $B_{n}\sim N\big{(}(d_{n}/2)^{1/2},1\big{)}$ . It follows from this and Assumption A that $\widehat{\gamma}_{n}^{2}=(d_{n}/2)+O_{p}\big{(}d_{n}^{1/2}\big{)}$ . Hence, by the first inequality in (19), $w_{1}(\widehat{\gamma}_{n}^{2})\buildrel p\over{\longrightarrow}1$ , where $\buildrel p\over{\longrightarrow}$ denotes convergence in probability as $n\rightarrow\infty$ . It follows from the fact that $0<G_{m}(z)<1$ for all $z\in\mathbb{R}$ that

[TABLE]

Now

[TABLE]

where $A_{n}=\big{(}\widehat{\theta}-\theta\big{)}\big{/}\big{(}\sigma\,v_{\theta,n}^{1/2}\big{)}$ . By Assumption A and since $\widehat{\gamma}_{n}^{2}=(d_{n}/2)+O_{p}\big{(}d_{n}^{1/2}\big{)}$ ,

[TABLE]

Obviously, $\widehat{\sigma}/\sigma\buildrel p\over{\longrightarrow}1$ . Since

[TABLE]

As noted earlier, $\rho_{n}$ converges to $\rho_{\infty}\neq 0$ , as $n\rightarrow\infty$ . We have the following two cases to consider. If $\rho_{\infty}>0$ then $G_{m+1}$ , evaluated at the right-hand side of (20), converges in probability to 0, as $n\rightarrow\infty$ . If, on the other hand, $\rho_{\infty}<0$ then $G_{m+1}$ , evaluated at the right-hand side of (20), converges in probability to 1, as $n\rightarrow\infty$ . Consequently, if $\rho_{\infty}>0$ then $h\big{(}\theta,\boldsymbol{y};{\mathscr{K}}^{*}\big{)}\buildrel p\over{\longrightarrow}0$ and if $\rho_{\infty}<0$ then $h\big{(}\theta,\boldsymbol{y};{\mathscr{K}}^{*}\big{)}\buildrel p\over{\longrightarrow}1$ . It follows from (18) that $P(\theta\in I({\mathscr{K}}^{*}))$ converges to 0, as $n\rightarrow\infty$ , in both of these cases.

Proof of part (b)

Suppose that $\beta_{p}$ and $\sigma^{2}$ ( $\sigma^{2}>0$ ) are fixed and $\beta_{p}\neq 0$ . Now

[TABLE]

and $n\,v_{p,n}=n\,\boldsymbol{c}^{\top}(\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}\boldsymbol{c}=\boldsymbol{c}^{\top}(\boldsymbol{X}^{\top}\boldsymbol{X}/n)^{-1}\boldsymbol{c}\rightarrow\boldsymbol{c}^{\top}\boldsymbol{D}^{-1}\boldsymbol{c}$ , as $n\rightarrow\infty$ . Thus $\gamma_{n}^{2}=O(n)$ . Now

[TABLE]

where $B_{n}=\widehat{\beta}_{p}/(\sigma\,v_{p,n}^{1/2})\sim N\big{(}\gamma_{n},1\big{)}$ . By the second inequality in (19), $w_{1}(\widehat{\gamma}_{n}^{2})\buildrel p\over{\longrightarrow}0$ . Thus $w(\varnothing;{\mathscr{K}}^{*})=1-w_{1}(\widehat{\gamma}_{n}^{2})\buildrel p\over{\longrightarrow}1$ .

It follows from the fact that $0<G_{m+1}(z)<1$ for all $z\in\mathbb{R}$ that

[TABLE]

Since

[TABLE]

for each $m$ , where $U(0,1)$ denotes the uniform distribution on the interval $(0,1)$ . By Slutsky’s theorem, $h\big{(}\theta,\boldsymbol{y};{\mathscr{K}}^{*}\big{)}\buildrel d\over{\longrightarrow}U(0,1)$ , where $\buildrel d\over{\longrightarrow}$ denotes convergence in distribution, as $n\rightarrow\infty$ . It follows from (18) that $P(\theta\in I({\mathscr{K}}^{*}))\rightarrow 1-\alpha$ , as $n\rightarrow\infty$ .

Proof of part (c)

Suppose that $\beta_{p}$ and $\sigma^{2}$ ( $\sigma^{2}>0$ ) are fixed and $\beta_{p}=0$ . In this case $\widehat{\gamma}_{n}^{2}=O_{p}(1)$ and, by the first inequality in (19), $w_{1}(\widehat{\gamma}_{n}^{2})\buildrel p\over{\longrightarrow}1$ . Thus

[TABLE]

Since

[TABLE]

for each $m$ . By Slutsky’s theorem, $h\big{(}\theta,\boldsymbol{y};{\mathscr{K}}^{*}\big{)}\buildrel d\over{\longrightarrow}U(0,1)$ . It follows from (18) that $P(\theta\in I({\mathscr{K}}^{*}))\rightarrow 1-\alpha$ , as $n\rightarrow\infty$ .

Proof of Theorem 4

Obviously, $w_{1}(\widehat{\gamma}_{n}^{2})$ is a decreasing function of $\widehat{\gamma}_{n}^{2}$ . Now $\widehat{\gamma}_{n}^{2}$ has the same distribution as

[TABLE]

where $U$ and $Q$ are independent, $U$ has a noncentral $\chi^{2}$ distribution with 1 degree of freedom and noncentrality parameter $\gamma^{2}$ and $Q$ has a $\chi_{m}^{2}$ distribution. For every $c>0$ ,

[TABLE]

is a decreasing function of $\gamma^{2}$ , see e.g. Johnson, Kotz & Balakrishnan (1995, p.487). Suppose that $\epsilon>0$ is given. This result implies that

[TABLE]

where $P_{\gamma}$ denotes the probability for true parameter value $\gamma$ . Obviously,

[TABLE]

Suppose that $\gamma=0$ , so that $\widehat{\gamma}_{n}^{2}$ has a $\chi_{1}^{2}$ distribution. By Assumption A, $d_{n}/n\rightarrow 0$ as $n\rightarrow\infty$ . Therefore $P_{\gamma=0}\big{(}w_{1}(\widehat{\gamma}_{n}^{2})\geq\epsilon\big{)}\rightarrow 0$ as $n\rightarrow\infty$ .