An Optimal Test for the Additive Model with Discrete or Categorical   Predictors

Abhijit Mandal

arXiv:1906.06828·stat.ME·June 18, 2019

An Optimal Test for the Additive Model with Discrete or Categorical Predictors

Abhijit Mandal

PDF

TL;DR

This paper introduces an asymptotically optimal test for additive models with discrete or categorical predictors, applicable to models with continuous covariates, and demonstrates its effectiveness through simulations and real data application.

Contribution

The paper develops a new goodness-of-fit test for additive models with discrete or categorical predictors, extending existing methods to handle mixed predictor types.

Findings

01

Test is asymptotically optimal with a $n^{-1/2}$ detection rate.

02

Simulation studies confirm theoretical properties.

03

Applied successfully to real data on diamond pricing.

Abstract

In multivariate nonparametric regression the additive models are very useful when a suitable parametric model is difficult to find. The backfitting algorithm is a powerful tool to estimate the additive components. However, due to complexity of the estimators, the asymptotic $p$ -value of the associated test is difficult to calculate without a Monte Carlo simulation. Moreover, the conventional tests assume that the predictor variables are strictly continuous. In this paper, a new test is introduced for the additive components with discrete or categorical predictors, where the model may contain continuous covariates. This method is also applied to the semiparametric regression to test the goodness-of-fit of the model. These tests are asymptotically optimal in terms of the rate of convergence, as they can detect a specific class of contiguous alternatives at a rate of $n^{- 1/2}$ . An…

Equations223

Y_{i} = α + p = 1 \sum P m_{p} (X_{p i}) + ϵ_{i},

Y_{i} = α + p = 1 \sum P m_{p} (X_{p i}) + ϵ_{i},

Y_{i} = α + p = 1 \sum P m_{p} (X_{p i}) + q = 1 \sum Q m_{P + q} (Z_{q i}) + ϵ_{i},

Y_{i} = α + p = 1 \sum P m_{p} (X_{p i}) + q = 1 \sum Q m_{P + q} (Z_{q i}) + ϵ_{i},

Y_{i} = α + p = 1 \sum P m_{p, θ_{p}} (X_{p i}) + q = 1 \sum Q m_{P + q} (Z_{q i}) + ϵ_{i},

Y_{i} = α + p = 1 \sum P m_{p, θ_{p}} (X_{p i}) + q = 1 \sum Q m_{P + q} (Z_{q i}) + ϵ_{i},

m_{p, θ_{p}} (X_{p}) = α_{p} + s = 1 \sum r_{p} θ_{p s} X_{p}^{s},

m_{p, θ_{p}} (X_{p}) = α_{p} + s = 1 \sum r_{p} θ_{p s} X_{p}^{s},

_{a}^{b} X_{(p)} = X_{p 1}^{a} X_{p 2}^{a} ⋮ X_{p n}^{a} X_{p 1}^{a + 1} X_{p 2}^{a + 1} ⋮ X_{p n}^{a + 1} \dots \dots ⋱ \dots X_{p 1}^{b} X_{p 2}^{b} ⋮ X_{p n}^{b},

_{a}^{b} X_{(p)} = X_{p 1}^{a} X_{p 2}^{a} ⋮ X_{p n}^{a} X_{p 1}^{a + 1} X_{p 2}^{a + 1} ⋮ X_{p n}^{a + 1} \dots \dots ⋱ \dots X_{p 1}^{b} X_{p 2}^{b} ⋮ X_{p n}^{b},

θ

θ

m_{P + q}

θ

θ

m_{[Z]}

Y_{i} = α + p = 1 \sum P_{1} m_{p, θ_{p}} (X_{p i}) + p = P_{1} + 1 \sum P m_{p} (X_{p i}) + q = 1 \sum Q_{1} m_{P + q, θ_{P + q}} (Z_{q i}) + q = Q_{1} + 1 \sum Q m_{P + q} (Z_{q i}) + ϵ_{i},

Y_{i} = α + p = 1 \sum P_{1} m_{p, θ_{p}} (X_{p i}) + p = P_{1} + 1 \sum P m_{p} (X_{p i}) + q = 1 \sum Q_{1} m_{P + q, θ_{P + q}} (Z_{q i}) + q = Q_{1} + 1 \sum Q m_{P + q} (Z_{q i}) + ϵ_{i},

H_{0} : m_{p} (\cdot) \in M_{p, Θ_{p}} \mbox f or p = 1, 2, \dots, P_{1}, \mbox an d m_{p} (\cdot) = 0 \mbox f or p = P_{1} + 1, P_{1} + 2, \dots, P .

H_{0} : m_{p} (\cdot) \in M_{p, Θ_{p}} \mbox f or p = 1, 2, \dots, P_{1}, \mbox an d m_{p} (\cdot) = 0 \mbox f or p = P_{1} + 1, P_{1} + 2, \dots, P .

R S S_{0} = i = 1 \sum n Y_{i} - \overset{α}{^} - p = 1 \sum P_{1} m_{p, θ_{p}} (X_{p i}) - q = 1 \sum Q_{1} m_{P + q, θ_{P + q}} (Z_{q i}) - q = Q_{1} + 1 \sum Q m_{P + q} (Z_{q i})^{2} .

R S S_{0} = i = 1 \sum n Y_{i} - \overset{α}{^} - p = 1 \sum P_{1} m_{p, θ_{p}} (X_{p i}) - q = 1 \sum Q_{1} m_{P + q, θ_{P + q}} (Z_{q i}) - q = Q_{1} + 1 \sum Q m_{P + q} (Z_{q i})^{2} .

Y_{i} = α + p = 1 \sum P m_{p} (X_{p i}) + q = 1 \sum Q_{1} m_{P + q, θ_{P + q}} (Z_{q i}) + q = Q_{1} + 1 \sum Q m_{P + q} (Z_{q i}) + ϵ_{i} .

Y_{i} = α + p = 1 \sum P m_{p} (X_{p i}) + q = 1 \sum Q_{1} m_{P + q, θ_{P + q}} (Z_{q i}) + q = Q_{1} + 1 \sum Q m_{P + q} (Z_{q i}) + ϵ_{i} .

R S S_{1} = i = 1 \sum n Y_{i} - \overset{α}{^} - p = 1 \sum P m_{p} (X_{p i}) - q = 1 \sum Q_{1} m_{P + q, θ_{P + q}} (Z_{q i}) + q = Q_{1} + 1 \sum Q m_{P + q} (Z_{q i})^{2},

R S S_{1} = i = 1 \sum n Y_{i} - \overset{α}{^} - p = 1 \sum P m_{p} (X_{p i}) - q = 1 \sum Q_{1} m_{P + q, θ_{P + q}} (Z_{q i}) + q = Q_{1} + 1 \sum Q m_{P + q} (Z_{q i})^{2},

λ_{n} (H_{0}) = \frac{n ( R S S _{0} - R S S _{1} )}{R S S _{1}} .

λ_{n} (H_{0}) = \frac{n ( R S S _{0} - R S S _{1} )}{R S S _{1}} .

Σ_{p p^{'}} = n \to \infty lim (R_{p}^{T} R_{p})^{- 1/2} R_{p}^{T} R_{p^{'}} (R_{p^{'}}^{T} R_{p^{'}})^{- 1/2},

Σ_{p p^{'}} = n \to \infty lim (R_{p}^{T} R_{p})^{- 1/2} R_{p}^{T} R_{p^{'}} (R_{p^{'}}^{T} R_{p^{'}})^{- 1/2},

H_{1} : m_{p} (\cdot) = n^{- 1/2} m_{p}^{*} (\cdot) \mbox f or p = 1, 2, \dots, P,

H_{1} : m_{p} (\cdot) = n^{- 1/2} m_{p}^{*} (\cdot) \mbox f or p = 1, 2, \dots, P,

H_{0}^{*} : m_{p} (\cdot) \in M_{p, Θ_{p}}, \mbox f or p = 1, 2, \dots, P .

H_{0}^{*} : m_{p} (\cdot) \in M_{p, Θ_{p}}, \mbox f or p = 1, 2, \dots, P .

R S S_{0}^{*} = i = 1 \sum n (Y_{i} - \overset{α}{^} - p = 1 \sum P m_{p, θ_{p}} (X_{p i}) - q = 1 \sum Q m_{P + q} (Z_{q i}))^{2},

R S S_{0}^{*} = i = 1 \sum n (Y_{i} - \overset{α}{^} - p = 1 \sum P m_{p, θ_{p}} (X_{p i}) - q = 1 \sum Q m_{P + q} (Z_{q i}))^{2},

R S S_{1}^{*} = i = 1 \sum n (Y_{i} - \overset{α}{^} - p = 1 \sum P m_{p} (X_{p i}) - q = 1 \sum Q m_{P + q} (Z_{q i}))^{2},

R S S_{1}^{*} = i = 1 \sum n (Y_{i} - \overset{α}{^} - p = 1 \sum P m_{p} (X_{p i}) - q = 1 \sum Q m_{P + q} (Z_{q i}))^{2},

H_{0}^{**} : m_{p} (\cdot) = 0 \mbox f or a l l p = 1, 2, \dots, P .

H_{0}^{**} : m_{p} (\cdot) = 0 \mbox f or a l l p = 1, 2, \dots, P .

R S S_{0}^{**} = i = 1 \sum n (Y_{i} - \overset{α}{^} - q = 1 \sum Q m_{P + q} (Z_{q i}))^{2} .

R S S_{0}^{**} = i = 1 \sum n (Y_{i} - \overset{α}{^} - q = 1 \sum Q m_{P + q} (Z_{q i}))^{2} .

σ_{p p^{'}, ij} = \frac{1}{c _{p i} c _{p^{'} j}} P (X_{p} = x_{p i}, X_{p^{'}} = x_{p^{'} j}),

σ_{p p^{'}, ij} = \frac{1}{c _{p i} c _{p^{'} j}} P (X_{p} = x_{p i}, X_{p^{'}} = x_{p^{'} j}),

m_{6} (Z_{1}) = Z_{1}, m_{7} (Z_{2}) = Z_{2}^{2}, m_{8} (Z_{3}) = Z_{3}^{2}, \mbox an d m_{9} (Z_{4}) = s in (π Z_{4}) .

m_{6} (Z_{1}) = Z_{1}, m_{7} (Z_{2}) = Z_{2}^{2}, m_{8} (Z_{3}) = Z_{3}^{2}, \mbox an d m_{9} (Z_{4}) = s in (π Z_{4}) .

m_{1} (x_{1}) = β (x_{1} - 0.75)^{2},

m_{1} (x_{1}) = β (x_{1} - 0.75)^{2},

β_{L S} = \frac{C o v ( X _{1} , Y )}{V a r ( X _{1} )} = - p m_{1} (0) + (1 - 2 q) m_{1} (1) + q m_{1} (2),

β_{L S} = \frac{C o v ( X _{1} , Y )}{V a r ( X _{1} )} = - p m_{1} (0) + (1 - 2 q) m_{1} (1) + q m_{1} (2),

m_{1} (x_{1})

m_{1} (x_{1})

m_{3} (x_{3})

S_{p} = n_{p 1}^{- 1} J_{n_{p 1}} O_{n_{p 2}, n_{p 1}} ⋮ O_{n_{p k_{p}}, n_{p 1}} O_{n_{p 1}, n_{p 2}} n_{p 2}^{- 1} J_{n_{p 2}} ⋮ O_{n_{p k_{p}}, n_{p 2}} \dots \dots ⋱ \dots O_{n_{p 1}, n_{p k_{p}}} O_{n_{p 2}, n_{p k_{p}}} ⋮ n_{p k_{p}}^{- 1} J_{n_{p k_{p}}}, \mbox f or p = 1, 2, \dots, P,

S_{p} = n_{p 1}^{- 1} J_{n_{p 1}} O_{n_{p 2}, n_{p 1}} ⋮ O_{n_{p k_{p}}, n_{p 1}} O_{n_{p 1}, n_{p 2}} n_{p 2}^{- 1} J_{n_{p 2}} ⋮ O_{n_{p k_{p}}, n_{p 2}} \dots \dots ⋱ \dots O_{n_{p 1}, n_{p k_{p}}} O_{n_{p 2}, n_{p k_{p}}} ⋮ n_{p k_{p}}^{- 1} J_{n_{p k_{p}}}, \mbox f or p = 1, 2, \dots, P,

S_{P + q} = (Z_{z_{q}}^{T} K_{z_{q}} Z_{z_{q}})^{- 1} Z_{z_{q}}^{T} K_{z_{q}}, \mbox f or q = 1, 2, \dots, Q,

S_{P + q} = (Z_{z_{q}}^{T} K_{z_{q}} Z_{z_{q}})^{- 1} Z_{z_{q}}^{T} K_{z_{q}}, \mbox f or q = 1, 2, \dots, Q,

Z_{z_{q}} = 11 ⋮ 1 (Z_{q 1} - z_{q}) (Z_{q 2} - z_{q}) ⋮ (Z_{q n} - z_{q}) \dots \dots ⋱ \dots (Z_{q 1} - z_{q})^{d_{q}} (Z_{q 2} - z_{q})^{d_{q}} ⋮ (Z_{q n} - z_{q})^{d_{q}} .

Z_{z_{q}} = 11 ⋮ 1 (Z_{q 1} - z_{q}) (Z_{q 2} - z_{q}) ⋮ (Z_{q n} - z_{q}) \dots \dots ⋱ \dots (Z_{q 1} - z_{q})^{d_{q}} (Z_{q 2} - z_{q})^{d_{q}} ⋮ (Z_{q n} - z_{q})^{d_{q}} .

I_{n} S_{2}^{*} ⋮ S_{P + Q}^{*} S_{1}^{*} I_{n} ⋮ S_{P + Q}^{*} \dots \dots ⋱ \dots S_{1}^{*} S_{2}^{*} ⋮ I_{n} m_{1} m_{2} ⋮ m_{P + Q} = S_{1}^{*} S_{2}^{*} ⋮ S_{P + Q}^{*} Y^{*},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

An Optimal Test for the Additive Model with Discrete or Categorical Predictors

Abhijit Mandal

Department of Mathematics, Wayne State University

Detroit, MI 48202, U.S.A

Abstract

In multivariate nonparametric regression the additive models are very useful when a suitable parametric model is difficult to find. The backfitting algorithm is a powerful tool to estimate the additive components. However, due to complexity of the estimators, the asymptotic $p$ -value of the associated test is difficult to calculate without a Monte Carlo simulation. Moreover, the conventional tests assume that the predictor variables are strictly continuous. In this paper, a new test is introduced for the additive components with discrete or categorical predictors, where the model may contain continuous covariates. This method is also applied to the semiparametric regression to test the goodness-of-fit of the model. These tests are asymptotically optimal in terms of the rate of convergence, as they can detect a specific class of contiguous alternatives at a rate of $n^{-1/2}$ . An extensive simulation study is presented to support the theoretical results derived in this paper. Finally, the method is applied to a real data to model the diamond price based on its quality attributes and physical measurements.

AMS 2010 subject classification: Primary 62G10; Secondary 62J12, 62G20.

Keywords: Additive model; Categorical data analysis; Backfitting algorithm; Generalized likelihood ratio test; Semiparametric model; Local polynomial regression.

1 Introduction

The additive model is a widely used multivariate smoothing technique. It was originally suggested by Friedman and Stuetzle (1981) and popularized due to extensive discussion in Hastie and Tibshirani (1990). It models a random sample $\{(Y_{i},{\boldsymbol{X}}_{i}):i=1,2,\cdots,n\}$ by

[TABLE]

where the random error $\epsilon_{i}$ has mean zero, constant variance $\sigma^{2}$ , and the additive component $m_{p}$ is an unknown smooth function for $p=1,2,\cdots,P$ . Stone (1985, 1986) have shown that the additive model reduces a full $P$ -dimensional nonparametric regression effectively to a one-dimensional problem by fitting the model with the same asymptotic efficiency, i.e., an optimal convergence rate of $n^{-2/5}$ for twice continuously differentiable functions. So, it has the very desirable property of reducing the “curse of dimensionality” in a satisfactory manner. In this paper, the additive function is estimated using the backfitting algorithm proposed by Buja et al (1989). Opsomer and Ruppert (1997), Wand (1999) and Opsomer (2000) studied asymptotic properties of the backfitting estimators. In the literature, there are several other algorithms, such as the marginal integration estimation method (Tjøstheim and Auestad, 1994), the estimating equation method (Mammen et al, 1999) and the Bayesian backfitting algorithm (Hastie and Tibshirani, 2000), among others.

To our knowledge, there are relatively limited theoretical results on the testing problem for the additive models where discrete or categorical (possibly mixed with continuous) explanatory variables are considered. Sperlich et al (2002) and Yang et al (2003) considered marginal integration estimators to construct tests for testing the additive components with continuous variables. The asymptotic critical values of these tests are difficult to obtain due to the complicated expressions for the bias and the variance of the test statistic. Moreover, the authors observed that the asymptotic accuracy of their result is limited for small and moderate sample sizes. In the same setup, Fan et al (2001) and Fan and Jiang (2005) proposed the generalized likelihood ratio (GLR) test for testing the significance of additive components using backfitting estimators. The idea is based on comparison of pseudo-likelihood functions under the null and the alternative hypotheses, which leads to the log-ratio of the variance estimators under the null and the alternative. Similar to the maximum likelihood ratio tests in parametric models, the GLR test has an important fundamental property that the asymptotic null distribution of the test is independent of nuisance parameters and functions. This property is referred to as the Wilks phenomenon. The GLR test is asymptotically distribution-free; and it is asymptotically optimal in terms of convergence of the nonparametric hypothesis testing problem (see Ingster, 1993 and Spokoiny, 1996). However, the authors mentioned that the GLR test may not be accurate as the test statistic contains an unknown bias term. So, a Monte Carlo simulation or a bootstrap technique is performed to calculate the $p$ -value of the test. This somehow restricts the method for being widely applicable among the general practitioners.

In this paper, we propose a GLR test for the additive components having discrete or categorical valued predictors, while the model may contain continuous valued covariates. For categorical predictors this test may be regarded as the generalized analysis of covariance (ANCOVA), where covariates are modeled by nonparametric functions and the normality assumption on the error term is not required. In this case, the predictors may be referred as treatment or block of the design of experiment.

The rest of the paper is organized as follows. We gave an overview of the nonparametric additive model and the semiparametric additive model in Sections 2 and 3, respectively. In Section 4, we introduced the GLR test and presented the theoretical properties. An extensive simulation studies are performed in Section 5 to explore the behavior of the proposed test. In Section 6, we have applied our method to analyze a real data containing diamond price, and we proposed an appropriate model based on quality attributes and physical measurements. Section 7 has some concluding remarks. The assumptions of the theorems are given in Appendix A. Appendix B presents a brief description of the backfitting estimator. The proofs of the theorems are given in Appendix C.

2 The Nonparametric Additive Model

Let us consider a one dimensional response variable $Y$ , a $P$ -dimensional predictor ${\boldsymbol{X}}=(X_{1},X_{2},\cdots,X_{P})^{T}$ and an additional $Q$ -dimensional covariate ${\boldsymbol{Z}}=(Z_{1},Z_{2},\cdots,Z_{Q})^{T}$ . We assume that ${\boldsymbol{X}}$ contains only discrete or categorical variables, but ${\boldsymbol{Z}}$ may contain all type of variables – categorical, discrete or continuous. If $X_{p}$ is a discrete valued random variable, then $k_{p}$ denotes the number of distinct values, $x_{p1},x_{p2},\cdots,x_{pk_{p}}$ , where $k_{p}<\infty$ for all $p=1,2,\cdots,P$ . If $X_{p}$ is a categorical variable, then $x_{p1},x_{p2},\cdots,x_{pk_{p}}$ are different levels of the variable. Let $(Y_{1},{\boldsymbol{X}}_{1},{\boldsymbol{Z}}_{1}),\cdots,(Y_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n})$ be a random sample of size $n$ from $(Y,{\boldsymbol{X}},{\boldsymbol{Z}})$ . The nonparametric additive model is given by

[TABLE]

where the random error $\epsilon_{i}$ has mean zero and a constant variance $\sigma^{2}$ . To ensure identifiability of components of the additive model, we set $E(m_{p}(X_{pi}))=0$ for all $p=1,2,\cdots,P$ , and $E(m_{P+q}(Z_{qi}))=0$ for all $q=1,2,\cdots,Q$ . The intercept parameter $\alpha=E(Y_{i})$ is generally estimated by $\hat{\alpha}=\bar{Y}=\sum_{i}Y_{i}/n$ . The backfitting estimator is used to estimate the nonparametric functions $m_{d}(\cdot)$ for $d=1,2,\cdots,P+Q$ . For the ease of readability, a discussion on the backfitting estimators is given in Appendix B. We have divided regressors into two groups – predictor and covariate, as we construct a test of significance for predictors only, whereas their effect is adjusted by “nuisance” covariates. In other words, if we are interested to test the effect of a subset of regressors in the additive model, we name those regressors as predictors and remaining regressors as covariates. We require all predictors to be discrete or categorical variables, but there is no restriction on covariates.

3 The Semiparametric Additive Model

The semiparametric additive model (SAM) is the combination of a parametric model and a nonparametric additive model. Here, some of the additive components are modeled parametrically while the remaining ones are unspecified and are estimated nonparametrically. First, we model predictors parametrically and covariates nonparametrically, then a generalized model is considered. In general, the predictors are assumed to be discrete valued random variables. However, if the predictors are ordinal categorical variables, their order or rank may also be modeled parametrically. Let us consider the following SAM model:

[TABLE]

where $\mathcal{M}_{p,\Theta_{p}}=\{m_{p,{\boldsymbol{\theta}}_{p}}(X_{pi}),{\boldsymbol{\theta}}_{p}\in\Theta_{p}\}$ is a family of parametric functions for $p=1,2,\cdots,P$ . We assume that $m_{p,{\boldsymbol{\theta}}_{p}}(\cdot)$ is completely known except for the value of the parameter ${\boldsymbol{\theta}}_{p}$ , $p=1,2,\cdots,P$ . Opsomer and Ruppert (1999) and Jiang et al (2007) have studied this model when the parametric models are linear functions. One might be interested to build a SAM when the main interest of study is to precisely quantify the effect of the predictors $X_{1},X_{2},\cdots,X_{P}$ on the dependent variable $Y$ , but the relationship is observed in the presence of “nuisance” covariates $Z_{1},Z_{2},\cdots,Z_{Q}$ . The use of the parametric forms for predictors, if properly specified, allows us to make an easily interpretable inference about their effect on $Y$ . On the other hand, modeling covariates nonparametrically, one may avoid potential introduction of bias in the estimated relationship between predictors and $Y$ . Another possible situation when a SAM would be useful, if someone is fairly confident about the shape of the relationship between predictors and $Y$ , but not about that of the other covariates. It can be shown that by modeling some predictors using appropriate parametric functions, the risk of over-fitting the model is reduced by decreasing the overall degrees of freedom of the test.

To ensure the identifiability of the model, we assume that the expectation of the parametric term is zero, i.e. $E[m_{p,{\boldsymbol{\theta}}_{p}}(X_{p})]=0$ for all $p=1,2,\cdots,P$ , and $E(m_{P+q}(Z_{qi}))=0$ for all $q=1,2,\cdots,Q$ . We consider the case where $m_{p,{\boldsymbol{\theta}}_{p}}(\cdot)$ is a polynomial for all $p=1,2,\cdots,P$ . As $X_{p}$ takes $k_{p}$ values, one needs at most $(k_{p}-1)$ parameters to completely specify $m_{p,{\boldsymbol{\theta}}_{p}}(\cdot)$ . So, we assume that $m_{p,{\boldsymbol{\theta}}_{p}}(\cdot)$ is a polynomial of degree $r_{p}$ , where $0<r_{p}<k_{p}-1$ , $p=1,2,\cdots,P$ . Therefore, with a slight abuse of notation, we write

[TABLE]

where ${\boldsymbol{\theta}}_{p}=(\theta_{p1},\theta_{p2},\cdots,\theta_{pr_{p}})^{T}$ , $p=1,2,\cdots,P$ . Note that $\alpha_{p}$ is not an independent parameter, it just makes $m_{p,{\boldsymbol{\theta}}_{p}}$ centered at zero. Let us define ${\boldsymbol{\theta}}=(\alpha^{*},{\boldsymbol{\theta}}_{1}^{T},{\boldsymbol{\theta}}_{2}^{T},\cdots,{\boldsymbol{\theta}}_{P}^{T})^{T}$ , where $\alpha^{*}=\sum_{p}\alpha_{p}$ . For $a<b$ , we define

[TABLE]

and if $a=b$ then ${}_{a}^{b}{\boldsymbol{X}}_{(p)}$ is a vector containing the first column of Equation (3.3). Let ${\boldsymbol{X}}^{*}=(\boldsymbol{1}_{n},{}_{1}^{r_{1}}{\boldsymbol{X}}_{(1)},$ ${}_{1}^{r_{2}}{\boldsymbol{X}}_{(2)},\cdots,{}_{1}^{r_{P}}{\boldsymbol{X}}_{(P)})$ . Then, following Speckman (1988), the estimates of the additive components are derived from the backfitting algorithm as the solution of the following equations:

[TABLE]

provided $({\boldsymbol{X}}^{*T}{\boldsymbol{X}}^{*})^{-1}$ exists. Here, $\boldsymbol{S}_{P+q}^{*}$ is the centered smoothing matrix $\boldsymbol{S}_{d}^{*}$ for $d=P+q$ as defined after Equation (A.4) in Appendix B. Suppose $\boldsymbol{W}_{P+q}$ is the additive smoother matrix of $\boldsymbol{m}_{P+q}$ , so that the backfitting estimate of $\boldsymbol{m}_{P+q}$ is $\widetilde{\boldsymbol{m}}_{P+q}=\boldsymbol{W}_{P+q}({\boldsymbol{Y}}^{*}-{\boldsymbol{X}}^{*}{\boldsymbol{\widetilde{\boldsymbol{\theta}}}})$ for $q=1,2,\cdots,Q$ . Let us define $\boldsymbol{W}_{[Z]}=\sum_{q=1}^{Q}\boldsymbol{W}_{P+q}$ and $\widetilde{\boldsymbol{m}}_{[Z]}=\sum_{q=1}^{Q}\widetilde{\boldsymbol{m}}_{P+q}$ . Then, the above normal equations are solved non-iteratively as

[TABLE]

Sometimes the experimenter may have a prior knowledge about some of the variables and would like to model them parametrically, whereas keeping other variables in the nonparametric model. In that case, one may consider the following generalized SAM model:

[TABLE]

where $P_{1}\leq P,\ Q_{1}\leq Q$ and $\mathcal{M}_{p,\Theta_{p}}=\{m_{p,{\boldsymbol{\theta}}_{p}}(X_{pi}),{\boldsymbol{\theta}}_{p}\in\Theta_{p}\}$ for $p=1,2,\cdots,P_{1}$ , $\mathcal{M}_{P+q,\Theta_{P+q}}=\{m_{P+q,{\boldsymbol{\theta}}_{P+q}}(Z_{qi}),{\boldsymbol{\theta}}_{P+q}\in\Theta_{P+q}\}$ for $q=1,2,\cdots,Q_{1}$ are families of parametric functions. For simplicity, we assume that $m_{d}$ is a polynomial of degree $r_{d}$ for all $d=1,2,\cdots,P_{1}$ and $d=P+1,P+2,\cdots,P+Q_{1}$ .

4 The Generalized Likelihood Ratio Test

Let us consider the generalized SAM model defined in Equation (3.6). For this model, one may be interested mainly in two type of tests based on the predictor variables – a goodness-of-fit test for the parametric function and a model utility test for the nonparametric function. First, we present the generalized test that includes both type of tests, then we discuss about the individual test. So, we are now interested in the following null hypothesis:

[TABLE]

Under $H_{0}$ , we define $\widetilde{m}_{d}(\cdot)$ and $\widetilde{m}_{d,\widetilde{\boldsymbol{\theta}}_{d}}$ as the backfitting estimators of $m_{d}(\cdot)$ and $m_{d,{\boldsymbol{\theta}}_{d}}$ , respectively. Then, the residual sum of squares, under $H_{0}$ , is given by

[TABLE]

As we are testing only for predictors by keeping covariates unchanged, the unconstrained model is

[TABLE]

Under this model, the residual sum of square is given by

[TABLE]

where $\widehat{m}_{d}(\cdot)$ and $\widehat{m}_{d,\widehat{\boldsymbol{\theta}}_{d}}$ are the backfitting estimators of $m_{d}(\cdot)$ and $m_{d,{\boldsymbol{\theta}}_{d}}$ , respectively, under the unconstrained model. The generalized likelihood ratio (GLR) test statistic for testing null hypothesis $H_{0}$ is defined as

[TABLE]

If the difference between $RSS_{0}$ and $RSS_{1}$ is small, then the GLR test statistic may be approximated by $n\log(\frac{RSS_{0}}{RSS_{1}})$ . In the parametric model, this is equivalent to the log-likelihood ratio test statistic, where estimators are replaced by the corresponding maximum likelihood estimators. Generally, the nonparametric maximum likelihood estimate does not exist and even when it does exist, the resulting maximum likelihood ratio test is not optimal (see Fan and Jiang, 2005, Hall and Marron, 1988). So, the GLR test statistic may be regarded as a log-ratio of the quasi-likelihoods.

We assumed homoscedasticity of the error term, i.e. error $\epsilon_{i}$ in model (3.6) has a constant variance. If this assumption is not valid, one may consider a GLR statistic by taking weighted residual sum of squares. The subsequent analysis and the backfitting algorithm will be modified similar to the weighted likelihood approach for the parametric models.

4.1 Asymptotic Distribution

To derive the null distribution of the GLR test statistic, let us define $c_{p}=(\sqrt{c_{p1}},\sqrt{c_{p2}},\cdots,\sqrt{c_{pk_{p}}})^{T}$ , where $c_{pj}=P(X_{p}=x_{pj})$ for $j=1,2,\cdots,k_{p}$ and $p=1,2,\cdots,P$ . We also define ${}_{a}^{b}{\boldsymbol{Z}}_{(P+q)}$ in a similar way of ${}_{a}^{b}{\boldsymbol{X}}_{(p)}$ as defined in Equation (3.3) by replacing ${\boldsymbol{X}}$ with ${\boldsymbol{Z}}$ . Let us denote $\boldsymbol{T}^{*}=(\boldsymbol{1}_{n},{}_{1}^{r_{1}}{\boldsymbol{X}}_{(1)},{}_{1}^{r_{2}}{\boldsymbol{X}}_{(2)},\cdots,$ ${}_{1}^{r_{P_{1}}}{\boldsymbol{X}}_{(P_{1})},{}_{1}^{r_{P+1}}\boldsymbol{Z}_{(P+1)},{}_{1}^{r_{P+2}}\boldsymbol{Z}_{(P+2)},$ $\cdots,{}_{1}^{r_{P+Q1}}\boldsymbol{Z}_{(P+Q_{1})})$ . Suppose $I=\sum_{p=1}^{P_{1}}(k_{p}-r_{p}-1)+\sum_{p=P_{1}+1}^{P}k_{p}$ , and $\boldsymbol{\Sigma}_{1}$ is a $I\times I$ dimensional block diagonal matrix whose $p$ -th diagonal block is an identity matrix of order $(k_{p}-r_{p}-1)$ if $p\leq P_{1}$ , and $\left(\boldsymbol{I}_{k_{p}}-c_{p}c_{p}^{T}\right)$ if $p>P_{1}$ . Define another $I\times I$ dimensional block matrix $\boldsymbol{\Sigma}_{2}$ , whose $p$ -th diagonal block is the identity matrix of order $(k_{p}-r_{p}-1)$ if $p\leq P_{1}$ , and of order $k_{p}$ if $p>P_{1}$ . For $p\neq p^{\prime}\in\{1,2,\cdots,P\}$ , the $pp^{\prime}$ -th off-diagonal block of $\boldsymbol{\Sigma}_{2}$ is given by

[TABLE]

where $\boldsymbol{R}_{p}={}_{r_{p}+1}^{k_{p}-1}{\boldsymbol{X}}_{(p)}$ if $p\leq P_{1}$ and $\boldsymbol{R}_{p}={}_{0}^{k_{p}-1}{\boldsymbol{X}}_{(p)}$ if $p>P_{1}$ . Then, the following theorem gives the asymptotic null distribution of the GLR statistic.

Theorem 1.

Suppose that regularity conditions (C1)–(C8) in Appendix A hold. Further assume that the limit of $n^{-1}\boldsymbol{T}^{*T}\boldsymbol{T}^{*}$ exists and it is invertible. Let us consider the unconstrained model (4.3) and the null hypothesis $H_{0}$ in (4.1), where $m_{p,{\boldsymbol{\theta}}_{p}}$ is a polynomial of degree $r_{p}$ and $0<r_{p}<(k_{p}-1)$ , for $p=1,2,\cdots,P_{1}$ . Then, under $H_{0}$ , the asymptotic distribution of the GLR test statistic coincides with $\sum_{i=1}^{s}\lambda_{i}V_{i}^{2}$ , where $V_{1},V_{2},\cdots,V_{s}$ are independent and identically distributed (i.i.d.) standard normal variables, $\lambda_{1},\lambda_{2},\cdots,\lambda_{s}$ are non-zero eigenvalues of $\boldsymbol{\Sigma}_{1}\boldsymbol{\Sigma}_{2}\boldsymbol{\Sigma}_{1}$ and $s$ is the rank of $\boldsymbol{\Sigma}_{1}\boldsymbol{\Sigma}_{2}\boldsymbol{\Sigma}_{1}$ .

The proof of the theorem is given in Appendix B. Theorem 1 shows that the asymptotic null distribution of the GLR test statistic is a linear combination of chi-square variables. The critical region of the test may be calculated using the algorithm proposed by Davies (1980). Note that the null distribution does not depend on the modeling of covariates as we keep covariates unchanged under both the null and the alternative hypotheses. But, in practice, it reduces the possible over-fitting; thus the finite sample performance of the test improves due to parametric modeling those covariates. However, one must be careful while modeling covariates, and it is important to verify whether such a parametric model is valid or not. A wrong model may cause severe power loss as demonstrated in the simulation studies. A special case of Theorem 1 when the predictor variables are pairwise independent, the null distribution is reduced to a single chi-square as mentioned in the following corollary.

Corollary 1.

Suppose that the predictor variables are pairwise independent, and the assumptions of Theorem 1 hold. Then, under $H_{0}$ , the asymptotic distribution of the GLR test statistic is a chi-square distribution with degrees of freedom $\sum_{p=1}^{P_{1}}(k_{p}-r_{p}-1)+\sum_{p=P_{1}+1}^{P}(k_{p}-1)$ .

It is shown in the simulation section that even if the predictors are not pairwise independent one may approximate the null distribution using Corollary 1 unless predictors are strongly correlated. Simulation studies show that the approximation makes the test little anti-conservative in small or moderate sample sizes. However, as sample size increases it gives a good approximation.

It is interesting to note that the asymptotic null distribution of the GLR test statistic does not depend on the nuisance parameters – the design densities of ${\boldsymbol{Z}}$ , $m_{d}$ functions for the covariates and the error distribution. But, in general, it depends on the design densities of ${\boldsymbol{X}}$ as shown in Theorem 1. So, the Wilks phenomenon does not hold in the true sense, however, it holds good if predictors are pairwise independent.

The main advantage of our method is that it is easy to calculate the $p$ -values of the test. In fact, the discrete valued predictors make the test simple. On the other hand, as shown by Fan and Jiang (2005) and Jiang et al (2007), if the predictors are continuous the GLR test becomes complicated, and the asymptotic null distribution depends on kernel density functions and bandwidth parameters. Moreover, the authors mentioned that the null distribution may not be accurate as the test statistic contains an unknown bias term. So, the null distribution is calculated by Monte Carlo simulation or using a conditional bootstrap method. In our test, we do not need any additional conditions to choose the bandwidth parameter for continuous covariates as long as assumption (C5) holds. Therefore, for simplicity, we may use the same bandwidth parameter which is optimal for estimation (see Section 5).

The main contribution of this paper is that the GLR test is constructed for the additive components with discrete or categorical predictors. The asymptotic distribution of the GLR test statistic is derived in Fan and Jiang (2005) with the strict assumption that all the predictors and covariates are continuous, more specifically, their marginal distributions must be Lipschitz continuous on some bounded support. Keeping in mind the real applications, this restriction is modified for the discrete or categorical predictors. For this reason, the proof of the theorem does not directly follow from the previous method, and we used a novel approach to derive the asymptotic distribution.

4.2 Power Function

We now consider the power function of the GLR test. Let us take the following contiguous alternative hypothesis:

[TABLE]

where $m_{p}^{*}$ is an additive function under the alternative hypothesis such that $E(m_{p}^{*})=0$ . We define $m_{p}^{\prime}$ as the best fitted polynomial of degree $r_{p}$ to $m_{p}^{*}$ for $p=1,2,\cdots,P_{1}$ , and $m_{p}^{\prime}=m_{p}^{*}$ for $p=P_{1}+1,P_{1}+2,\cdots,P$ . Using the following theorem, we get the power function of the GLR test.

Theorem 2.

Let us consider the notations and assumptions of Theorem 1. Then, under $H_{1}$ , the asymptotic distribution of the GLR test statistic coincides with $\delta^{2}+\sum_{i=1}^{s}\lambda_{i}V_{i}^{2}$ , where $\delta^{2}=\sum_{r,s=1}^{P}E(m^{\prime}_{r}m^{\prime}_{s})$ .

It is interesting to note that the GLR test detects a specific class of contiguous alternatives at a rate of $n^{-1/2}$ . So, the power of the test is asymptotically optimal in terms of the rate of convergence. For a fixed alternative the power of the GLR test convergences to one, i.e. the test is consistent. In case of pairwise independent predictors, the theorem simplifies as below.

Corollary 2.

Suppose that the predictor variables are pairwise independent, and the assumptions of Theorem 1 hold. Then, under $H_{1}$ , the asymptotic distribution of the GLR test statistic is non-central chi-square with the non-centrality parameter $\sum_{p=1}^{P}E({m_{p}^{\prime}}^{2})$ and the degrees of freedom $\sum_{p=1}^{P_{1}}(k_{p}-r_{p}-1)+\sum_{p=P_{1}+1}^{P}(k_{p}-1)$ .

It is not surprising that the GLR statistic is $\sqrt{n}$ -consistent, whereas most of the nonparametric tests have relatively slower rate. Estimating the $m_{p}$ function corresponding to a discrete or categorical predictor $X_{p}$ is a finite dimensional problem as we assume that $k_{p}$ , the domain of $X_{p}$ , is finite. Thus, all $m_{p}$ functions for $p=1,2,\cdots,P$ of the additive models in (3.6) and (4.3) are equivalent to the parametric part of the semiparametric model. So, the corresponding convergence rate is consistent with Corollary 1 of Opsomer and Ruppert (1999). The same result is also obtained by Speckman (1988). However, this rate depends on the bandwidth parameter selected for smoothing the continuous covariates (if any). The bandwidth parameter must be selected based on assumption (C5) given in Appendix A. Even if when the model contains continuous covariates, as the null hypothesis is the significance only for the predictors, the GLR test has a convergence rate similar to a parametric model. Intuitively, in large sample sizes, the effect of smoothers for continuous covariates cancels out when two residual sum of squares are subtracted in the numerator of the GLR statistic in Equation (4.5). Meanwhile, the denominator (decided by $n$ ) of the GLR statistic converges to $\sigma^{2}$ , the error variance of the additive model. So, those continuous smooth functions do not have any major role in the rate of convergence of the GLR test as long as their bandwidth parameters are optimally selected.

We have started with a very general test in $H_{0}$ that includes both the goodness-of-fit test and the model utility test. For this reason, the construction of $\boldsymbol{\Sigma_{1}}$ and $\boldsymbol{\Sigma_{2}}$ used in Theorem 1 looks slightly complicated. However, if we are interested only in one type of test, those matrices becomes very simple. In these cases, the residual sum of squares $RSS_{0}$ and $RSS_{1}$ also have simpler expressions. Now, we discuss about these special cases. To make the procedure further simple, we assume that all covariates are modeled nonparametrically.

4.3 The Goodness-of-Fit Test for the Semiparametric Model

Let us consider the full nonparametric additive model given in (2.1). With respect that base model, we now construct a goodness-of-fit test for the semiparametric model (3.1), where all predictors are modeled parametrically. So, the null hypothesis for this problem is

[TABLE]

The residual sum of squares, under $H_{0}^{*}$ , is given by

[TABLE]

where $\widetilde{m}_{p,{\boldsymbol{\widetilde{\boldsymbol{\theta}}}}_{p}}$ is the backfitting estimator of $m_{p,{\boldsymbol{\theta}}_{p}}$ under $H_{0}^{*}$ for $p=1,2,\cdots,P$ , and $\widetilde{m}_{P+q}$ is the backfitting estimator of $m_{P+q}$ under $H_{0}^{*}$ for $q=1,2,\cdots,Q$ . Under the unconstrained nonparametric additive model, the residual sum of square is given by

[TABLE]

where $\widehat{m}_{1}(\cdot),\widehat{m}_{2}(\cdot),\cdots,\widehat{m}_{P+Q}(\cdot)$ are the backfitting estimators under the full model given in (2.1). Suppose $L=\sum_{p=1}^{P}(k_{p}-r_{p}-1)$ , and $\boldsymbol{\Sigma}_{2}$ is a $L\times L$ dimensional block matrix, whose $p$ -th diagonal block is an identity matrix of order $(k_{p}-r_{p}-1)$ , and for $p\neq p^{\prime}\in\{1,2,\cdots,P\}$ the $pp^{\prime}$ -th off-diagonal block of $\boldsymbol{\Sigma}_{2}$ is given in Equation (4.6) with $\boldsymbol{R}_{p}={}_{r_{p}+1}^{k_{p}-1}{\boldsymbol{X}}_{(p)}$ . Notice that $\boldsymbol{\Sigma}_{1}$ , defined in Section 4.1, becomes an identity matrix in this setup as $P_{1}=P$ . Moreover, we need existence of $n^{-1}{\boldsymbol{X}}^{*T}{\boldsymbol{X}}^{*}$ instead of $n^{-1}\boldsymbol{T}^{*T}\boldsymbol{T}^{*}$ , where ${\boldsymbol{X}}^{*}$ is defined in Section 3 after Equation (3.3). Then, the following result gives the asymptotic null distribution of the GLR goodness-of-fit test statistic $\lambda_{n}(H_{0}^{*})=\frac{n(RSS_{0}^{*}-RSS_{1}^{*})}{RSS_{1}^{*}}$ .

Corollary 3.

Suppose that the limit of $n^{-1}{\boldsymbol{X}}^{*T}{\boldsymbol{X}}^{*}$ exists and it is invertible, and regularity conditions (C1)–(C8) in Appendix A hold. Let us consider the unconstrained model (2.1) and the null hypothesis $H_{0}^{*}$ in (4.8), where $m_{p,{\boldsymbol{\theta}}_{p}}$ is a polynomial of degree $r_{p}$ and $0<r_{p}<(k_{p}-1)$ for all $p=1,2,\cdots,P$ . Then, the asymptotic distribution of the GLR goodness-of-fit test statistic, under $H_{0}^{*}$ , coincides with $\sum_{i=1}^{s}\lambda_{i}V_{i}^{2}$ , where $V_{1},V_{2},\cdots,V_{s}$ are i.i.d. standard normal variables, $\lambda_{1},\lambda_{2},\cdots,\lambda_{s}$ are non-zero eigenvalues of $\boldsymbol{\Sigma}_{2}$ and $s$ is the rank of $\boldsymbol{\Sigma}_{2}$ .

In general, the asymptotic null distribution of the GLR goodness-of-fit test statistic is a linear combination of chi-square variables. But, the distribution comes out to be a single chi-square if the predictor variables are pairwise independent. In this case, the degrees of freedom of the test statistic becomes $\sum_{p=1}^{P}(k_{p}-r_{p}-1)$ .

4.4 Model Utility Test for the Nonparametric Additive Model

Let us consider the null hypothesis that there is no association between $Y$ and $X_{1},\cdots,X_{p}$ , where $Z_{1},\cdots,Z_{q}$ are covariates of the nonparametric additive model (2.1). The null hypothesis can be written as

[TABLE]

Let $\widetilde{m}_{P+1}(\cdot),\widetilde{m}_{P+2}(\cdot),\cdots,\widetilde{m}_{P+Q}(\cdot)$ be the backfitting estimators under $H_{0}^{**}$ . Then, the residual sum of squares, under $H_{0}^{**}$ , is given by

[TABLE]

Under the unconstrained nonparametric additive model (2.1), the residual sum of square is given in Equation (4.10). So, the GLR nonparametric test statistic becomes $\lambda_{n}(H_{0}^{**})=\frac{n(RSS_{0}^{**}-RSS_{1}^{*})}{RSS_{1}^{*}}$ .

Suppose $\boldsymbol{\Sigma}_{1}$ is a $K\times K$ dimensional block diagonal matrix whose $p$ -th diagonal block is $\left(\boldsymbol{I}_{k_{p}}-c_{p}c_{p}^{T}\right)$ for $p=1,2,\cdots,P$ . Define another $K\times K$ dimensional block matrix $\boldsymbol{\Sigma}_{2}$ , whose $p$ -th diagonal block is an identity matrix of order $k_{p}$ , and for $p\neq p^{\prime}\in\{1,2,\cdots,P\}$ the $ij$ -th element of the $pp^{\prime}$ -th off-diagonal block of $\boldsymbol{\Sigma}_{2}$ is given by

[TABLE]

where $i=1,2,\cdots,k_{p}$ and $j=1,2,\cdots,k_{p^{\prime}}$ . The following result gives the asymptotic distribution of the GLR nonparametric test statistic for testing the null hypothesis $H_{0}^{**}$ .

Corollary 4.

Let us assume that regularity conditions (C1)–(C8) in Appendix A hold. Let us consider the unconstrained model (2.1) and the null hypothesis $H_{0}^{**}$ in (4.11). Then, under $H_{0}^{**}$ , the asymptotic distribution of the GLR nonparametric test statistic coincides with $\sum_{i=1}^{s}\lambda_{i}V_{i}^{2}$ , where $V_{1},V_{2},\cdots,V_{s}$ are i.i.d. standard normal variables, $\lambda_{1},\lambda_{2},\cdots,\lambda_{s}$ are non-zero eigenvalues of $\boldsymbol{\Sigma}_{1}\boldsymbol{\Sigma}_{2}\boldsymbol{\Sigma}_{1}$ and $s$ is the rank of $\boldsymbol{\Sigma}_{1}\boldsymbol{\Sigma}_{2}\boldsymbol{\Sigma}_{1}$ .

If the predictor variables are pairwise independent, the GLR nonparametric test statistic follows the chi-square distribution with degrees of freedom $\sum_{p=1}^{P}(k_{p}-1)$ under the null hypothesis. The setup of the GLR nonparametric test is similar to the classical ANCOVA when predictors are categorical variables. In ANCOVA predictors are called treatments or blocks, and covariates are modeled parametrically. The goal is to test the treatment or block effect in the design of experiment. So, the GLR test is generalization the classical ANCOVA, where covariates are modeled nonparametrically. Moreover, we do not need to assume that the error distribution is normal.

5 Simulation

In the first part of the simulation, we check the null distribution of the GLR nonparametric test statistic for the hypothesis given in (4.11). Then, we demonstrate the power of the GLR test and compare it with the F-test associated with the nested linear models in the regression analysis. And finally, the performance of the general GLR test for testing $H_{0}$ in (4.1) under the semiparametric model is presented. All numerical examples in this paper are performed using R software. The R code for the GLR test will be provided on request.

5.1 Null Distribution

Let us consider the nonparametric additive model defined in (2.1), where $P=5$ and $Q=4$ . All five random variables in ${\boldsymbol{X}}$ and the first two random variables in ${\boldsymbol{Z}}$ are discrete. The number of discrete values taken by $X_{1},X_{2},\cdots,X_{5}$ and $Z_{1},Z_{2}$ are 3, 4, 5, 4, 3 and 5, 4, respectively, starting from zero with an increment one. The probabilities for each of these variables are generated independently from a uniform (0,1) distribution, and then they are standardized so that the total probabilities become one. We have discarded very low values of probability ( $<0.05$ ) to avoid very small or zero frequencies. To make the situation general, we have taken few independent and few dependent variables. $X_{1}$ and $X_{2}$ are independent random variables; $X_{3},X_{4},X_{5}$ form a group of dependent variables, but they are independent of $X_{1}$ and $X_{2}$ . Similarly, $(Z_{1},Z_{2})$ and $(Z_{3},Z_{4})$ are two independent groups. The covariance matrices for the dependent groups are also generated randomly. Finally, these parameters are kept fixed throughout the entire simulation. To generate a set of dependent discrete variables, first, a random sample is drawn from a multivariate normal distribution with a fixed covariance matrix, then observations are discretized based on their probabilities. Here, we are interested in testing the null hypothesis $H_{0}^{**}:m_{1}=m_{2}=\cdots=m_{5}=0$ . To show the null distribution we have taken $m_{p}=0$ for all $p=1,2,\cdots,5$ . For covariates, the $m_{d}$ functions are taken as

[TABLE]

Notice that these functions are not centered at mean zero. However, it does not violate assumptions of model (2.1) as constants needed to center those functions contribute to the intercept term $\alpha$ . The smoother using Nadaraya-Watson estimator (Watson, 1964) is taken to smooth $Z_{3}$ and $Z_{4}$ . We used the default bandwidth parameter for the kernel density as $h=1.06sn^{-1/5}$ , where $s$ is the standard deviation of corresponding variable, and $n$ is the sample size. We have taken two different types of error distributions – the standard normal distribution and the chi-square distribution with 5 degrees of freedom. A sample of size 500 from $(Y,{\boldsymbol{X}},{\boldsymbol{Z}})$ is generated, and this exercise is replicated 1,000 times. The histograms of the observed GLR nonparametric test statistic are presented in Figure 1, and the corresponding kernel density estimates are also plotted. The plots show that the empirical distributions match with the theoretical null distribution obtained from Corollary 4. The plots give an indication of inflated level, but further simulation studies show that the convergence improves as sample size increases. In the same figure, we have plotted the density of the null distribution of the GLR test under independence assumption on the predictors. It is a single chi-square distribution with degrees of freedom $\sum_{p=1}^{P}(k_{p}-1)=14$ . It gives a good approximation of the null distribution. In fact, further simulation studies show that, unless some predictors are strongly correlated, this approximation works reasonably well. In Figures 1(a) and (b) the error distributions are different in two plots, so they demonstrate that the null distribution of the GLR test does not depend on the choice of the error distribution.

5.2 Power Function

In the power calculation for testing $H_{0}^{**}:m_{1}=m_{2}=\cdots=m_{5}=0$ , we have taken the same setup of the previous example, except for the distribution of $X_{1}$ and the corresponding $m_{1}$ function. Here, we have taken

[TABLE]

and the distribution of $X_{1}$ is given by $P(X_{1}=0)=p^{2},\ P(X_{1}=1)=2pq\mbox{ and }P(X_{1}=2)=q^{2},$ where $p=1-q=0.75$ . So, it violates the null hypothesis $H_{0}^{**}$ if $\beta\neq 0$ . As we compare the power of the GLR test and the classical F-test associated with the nested linear models, the error term is generated from the standard normal distribution to make all results comparable. For each value of $\beta$ , we simulated 500 samples and repeated it 1,000 times to calculate the observed power. The observed power is the proportion of the test statistics greater than the corresponding critical value at 5% level of significance obtained from Corollary 4. In Figure 2(a), the GLR test shows a good power, but the F-test completely fails in this situation. The GLR test is slightly anti-conservative at null, which was also reflected in Figure 1. In the same plot, we presented the observed powers of few other tests including the GLR test under independence assumption, abbreviated as GLR (ind.). Its performance is very similar to the original GLR test although predictors were not pairwise independent. We also plotted the theoretical power function of the GLR test calculated from Corollary 2 under the independence assumption. The observed and the theoretical power functions are very close to each other.

If we fit a linear regression model in this setup, the breakdown situation of the F-test is apparent as the least square estimates of the regression coefficients vanish in the large sample sizes. Conditioning on the other variables the expected value of the least square estimate of the regression coefficient corresponding to $X_{1}$ turns out to be

[TABLE]

which simplifies to zero for all values of $\beta$ in Equation (5.2) when $p=0.75$ . Thus, $H_{0}^{**}$ seems to be true for all values of $\beta$ with respect to a linear model, and the power function for the F-test centers around the nominal level of the test. On the other hand, the full additive model successfully captures the nonlinear relationship; and the power of the GLR test tends to one as $\beta$ increases.

In the same plot, we presented the observed power function of the GLR semiparametric test (abbreviated as GLR Semi.) assuming that all covariates are linearly related with $Y$ (although it is not true), where the predictors are modeled nonparametrically. As the covariates are highly nonlinear, including a $\sin$ function, the semiparametric test breaks down. Notice that the distribution of the GLR test statistic does not change if the covariates are modeled parametrically, so the theoretical critical value of the GLR semiparametric test is same as the GLR nonparametric test, and it is obtained from Corollary 4. The GLR Semi. (ind.) test in Figure 2(a) is the approximation of the GLR semiparametric test under independence assumption on the predictor variables. The performance of these two tests are similar to the F-test. It shows that a wrong model of the covariates may cause severe power loss.

We have investigated few more cases by generating different relationships between $Y,X$ and $Y,Z$ . Figure 2(b) presents power functions where $Y,X$ is nonlinear as given in Equation (5.2), but $Y,Z$ is linear. The functions corresponding to the linear relationships between $Y$ and $Z$ are taken simply as $m_{d}(z)=z$ for $d=6,7,8,9$ instead of nonlinear functions in Equation (5.1). Here, we get the similar results from the GLR and F-test. The GLR semiparametric test gives almost equal power as described by its theoretical power function derived from Corollary 2. In fact, in this case, the theoretical power of the GLR semiparametric test is same as the GLR nonparametric test. But the advantage of semiparametric modeling is that its finite sample performance is better than the full nonparametric test, if the modeling of the parametric part is correct. For this reason, the nonparametric GLR test is showing slightly inflated level, whereas the GLR semiparametric test properly maintains the level of the test.

In Figure 2(c), we have plotted the power functions when $Y,X$ is linear by taking $m_{1}(x)=\beta x$ instead of Equation (5.2), but $Y,Z$ is nonlinear as given in Equation (5.1). Even if the F-test successfully models the relationship between $Y$ and $X$ , its power is almost unchanged as it fails to model the relationship between $Y$ and $Z$ . Similarly, the GLR semiparametric test is also showing poor power.

Finally, the power functions of these tests, when both the relationships between $Y,X$ and $Y,Z$ are linear, are plotted in Figure 2(d). Here, all assumptions of the F-test are satisfied, so it is the most powerful among all unbiased tests. It is interesting to notice that the power of all GLR tests are very competitive with the F-test. Therefore, even if the both relationships are linear, we do not expect to lose a significance amount of power by conducting the GLR test. These simulation results show that it is better to use the GLR test unless we are confident of a suitable parametric model. If some assumptions of the parametric model are violated, the F-test may break down. On the other hand, the GLR test produces very high power almost in all situations.

5.3 The Goodness-of-Fit Test

In a similar setup, we have studied the general GLR test including the goodness-of-fit testing problem for the semiparametric model. We test the null hypothesis (4.1) that there exists a linear relationship between $Y$ and $X_{2}$ , but there is no effect of other components, i.e., $m_{1}=m_{3}=m_{4}=m_{5}=0$ . We did not assume any parametric model for the covariates. In this simulation, we have taken $m_{2}(x_{2})=\frac{1}{2}x_{2}$ and the same $m_{1}$ as given in Equation (5.2). So, the null hypothesis is true when $\beta=0$ , and power of the GLR test should increase as $\beta$ deviates from zero. The other setup for the simulation is same as the previous simulations including the functions for the covariates as given in Equation (5.1). Figure 3(a) shows that the observed and theoretical null distributions are close to each other. The power functions of different tests are given in Figure 3(b). All tests in this plot are semiparametric tests, however, to make similarity with the previous simulation, we denote ‘GLR Semi.’ when all covariates are assumed to be linear. ‘GLR Test’ refers to the main semiparametric test whose null distribution is derived from Theorem 1 without assuming that covariates are linearly modeled. Similarly ‘GLR (ind.)’ is the approximation of this test by assuming that all predictors are pairwise independent. The black dotted line in the plot is the approximate power function derived for Corollary 2 that assumes all predictors are pairwise independent. The main GLR test maintains the nominal level of the test and gives good power when the null hypothesis is not true. GLR (ind.) shows slight inflated power, however, it gives good approximation of the original test. GLR Semi. and GLR Semi. (ind.) fail to produce any significant power as the linearity assumptions on covariates are not satisfied.

As a whole, these simulation results in Section 5 give enough justification for the theoretical results derived in the paper. These GLR tests are simple to calculate and produce a good power. As a virtue of the nonparametric method the tests do not depend on the error distribution of the model. For the model utility test, the GLR test gives better power than the classical F-test when the parametric modeling is not appropriate. And even if the parametric model holds good, the GLR test produces a comparative power. The GLR test can be further simplified if we assume that the predictors are pairwise independent. In this case, the test is slightly anti-conservative, but overall the approximation is good unless some predictors are strongly correlated.

6 Real Data Example

In this section, we apply the GLR test to analyze the diamonds data-set used in Wickham (2009). This data-set contains the price (in 2008 US dollars) and other attributes of 53,940 diamonds. The attributes include the four C’s of diamond quality – cut, color, clarity and carat. There are three main physical measurements $x$ , $y$ and $z$ – the largest length, width, and height of a diamond, respectively. The data-set has other two physical measurements - depth and table, but we have not included them in this analysis as they are functions of $x$ , $y$ and $z$ . Carat is a unit of mass used for measuring gemstones and pearls. Cut is an objective measure of a diamond’s light performance what we generally think of as sparkle. Cut, color and clarity are categorical variables, and other variables are continuous. There are five categories of cut - Fair, Good, Very Good, Premium and Ideal; and the percentage of each diamonds in this data-set are 2.98, 9.10, 22.30, 25.57 and 39.95, respectively. Color has seven categories D (best) to J (worst) with 12.56%, 18.16%, 17.69%, 20.93%, 15.40%, 10.05% and 5.21%, respectively. Clarity contains eight categories I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1 and IF (best) with 1.37%, 17.04%, 24.22%, 22.73%, 15.15%, 9.39%, 6.77% and 3.32%, respectively.

As cut, color and clarity are categorical variables, we apply the GLR test for testing their effect in diamond price, whereas carat, $x$ , $y$ and $z$ serve as covariates. The sample size is huge, so a slight deviation from a null hypothesis may cause rejection of the null hypothesis. For this reason, we have taken a random sub-sample of 1,000 observations from the data-set At first, we test the null hypothesis (4.11) that there is no effect of cut, color and clarity. So, $H_{0}^{**}:m_{1}=m_{2}=m_{3}=0$ . The $p$ -value of the GLR test comes out to be zero. Then, we have conducted three different tests to check the main effect of cut, color and clarity, separately, by taking others as covariates (for example, $H_{0}^{**}:m_{1}=0$ ). Even in these cases $p$ -values of the GLR tests are $6.45\times 10^{-5}$ for cut and zero for color and clarity. This is not surprising as there is a strong effect of cut, color and clarity in the price of diamonds.

All categorical variables, cut, color and clarity, maintain an order in their quality. So, in the next step, we have conducted some goodness-of-fit tests to know the effect of different levels of those variables. We are particularly interested to know whether there is any specific pattern in the order of different levels of a category. The levels are ranked according to the increasing order of quality. In our first goodness-of-fit test, we have taken the order of cut as a single predictor ( $P=1$ ) and six other variables including the ranks of color and clarity as covariates ( $Q=6$ ). We tested the null hypothesis that the diamond price is linear with the rank of the levels of cut. The $p$ -value of the GLR test is 0.5476. This indicates that the diamond price may be modeled as a linear function of the rank of its cutting quality. The box plots in Figure 4(a) show the partial effect of cut after eliminating the effects of other variables. The middlemost line of median in the box plot is replaced by the corresponding mean, so it gives the mean effect of cut for that category. The red line in the plot is the least square regression line of the price difference on the rank of the cut effect. The plot also shows that the ranks in cut maintain a proper linear relationship in modeling the diamond price.

In the second semiparametric test, we checked the linearity effect of the rank of color in diamond price. The GLR test gave a $p$ -value of $2.02\times 10^{-8}$ , so we considered a quadratic model for the rank of color. The $p$ -value of the test came out to be 0.0857, so the quadratic effect of color is not rejected at 5% level of significance (see Figure 4(b)). Similarly, the $p$ -values of corresponding to the linear, quadratic and cubic models of the rank of clarity are $10^{-15}$ , 0.0412 and 0.4233, respectively. So, the effect of clarity in diamond price may be modeled as a cubic function of its rank (see Figure 4(c)).

Finally, we have constructed a semiparametric model where the ranks of cut, color and clarity were taken as linear, quadratic and cubic functions, respectively. Carat, $x$ , $y$ and $z$ serve as covariates for this test, and they are modeled nonparametrically. So, according to our notations $P=3$ and $Q=4$ . The GLR test (4.8) for the goodness-of-fit of the model gave a $p$ -value of 0.1104 for this semiparametric model. Figure 4(d) displays the scatter plot of the original price of diamonds and their fitted price from the semiparametric model; the correlation between the two prices is 0.9632. The points in the plot cluster around the red line which represents the perfect fit. These results indicate that if we build a semiparametric model for this data the appropriate choices of the additive components for the rank of cut, color and clarity are a linear, quadratic and cubic functions, respectively.

The estimate of the overall mean price for the data-set is $\hat{\alpha}=3931.3760$ , and the standard deviation is $\hat{\sigma}=4149.5376$ . Let us consider an additive model as $y=\hat{\alpha}+\hat{\sigma}\sum_{d=1}^{7}m_{d},$ where $m_{1},m_{2},\cdots,m_{7}$ being the additive effects of the rank of cut, color, clarity, carat, $x$ , $y$ and $z$ , respectively. Then, the parametric parts of the fitted semiparametric model are given by

[TABLE]

7 Discussion

In an additive model, a novel method is derived for testing the main effect of the predictor variables which take discrete or categorical values. The main effect is adjusted by covariates possibly containing continuous valued random variables. The predictors and covariates are modeled nonparametrically using an additive model, so the test avoids loss of power due to model misspecification often arises in classical parametric tests. This method is further extended to the semiparametric model, and a goodness-of-fit test is derived. The simulation results show that the GLR test may outperform the parametric test when the model for just one of the components fails; and at the same time, it produces a comparable power as the conventional tests if the assumed parametric model holds good. The power of the GLR test is asymptotically optimal in terms of the rate of convergence, and it can detect a specific class of contiguous alternatives at a rate of $n^{-1/2}$ . In case of categorical predictors the GLR test generalizes the classical ANCOVA by modeling covariates nonparametrically and without assuming normality of the error term. So, it is a development of the basic statistical theory, and the methodology can be widely useful in practice.

Appendix A: Regularity Conditions

To derive the asymptotic distribution of the GLR test statistic, we need the following assumptions:

(C1)

Suppose $c_{pj}=P(X_{p}=x_{pj})$ , then $c_{pj}\in(0,1)$ for all $j=1,2,\cdots,k_{p}$ and $p=1,2,\cdots,P$ , where $\sum_{j=1}^{k_{p}}c_{pj}=1$ . 2. (C2)

The kernel function $K(z)$ is bounded and Lipschitz-continuous with a bounded support. 3. (C3)

If $Z_{q}$ is continuous, then the density $f_{q}$ of $Z_{q}$ is Lipschitz-continuous and bounded away from 0 and has bounded supports $\Omega_{q}$ for $q\in\{1,2,\cdots,Q\}$ . 4. (C4)

If both $Z_{q}$ and $Z_{q^{\prime}}$ are continuous, then the joint density $f_{qq^{\prime}}$ of $Z_{q}$ and $Z_{q^{\prime}}$ is Lipschitz-continuous on its support $\Omega_{q}\times\Omega_{q^{\prime}}$ for $q\neq q^{\prime}\in\{1,2,\cdots,Q\}$ . 5. (C5)

$nh_{q}/\log(n)\rightarrow\infty$ as $n\rightarrow\infty$ and $h_{q}\rightarrow 0$ for $q=1,2,\cdots,Q$ . 6. (C6)

If $Z_{q}$ is continuous and $d_{q}$ be the degree of the polynomial used for smoothing of $Z_{q}$ , then the $(d_{q}+1)$ -th derivative of $m_{P+q}$ , for $q\in\{1,2,\cdots,Q\}$ , exist and is bounded and continuous. 7. (C7)

$\sigma^{2}=\mbox{Var}[\epsilon]=E[\epsilon^{2}]<\infty$ . 8. (C8)

$E[m_{q}(Z_{q})|X_{p}=x_{pj}]=0$ for all $j=1,2,\cdots,k_{p}$ , $p=1,2,\cdots,P$ and $q=1,2,\cdots,Q$ .

Acknowledgements: The author very much appreciates Kuchibhotla Arun Kumar for carefully reading the paper including all proofs and providing helpful comments and suggestions.

Appendix B: Backfitting Estimators

Let us define $\boldsymbol{m}_{p}=(m_{p}(X_{p1}),m_{p}(X_{p2}),\cdots,m_{p}(X_{pn}))^{T}$ for $p\in\{1,2,\cdots,P\}$ , and $\boldsymbol{m}_{P+q}=(m_{P+q}(Z_{q1}),$ $m_{P+q}(Z_{q2})\cdots,m_{P+q}(Z_{qn}))^{T}$ for $q\in\{1,2,\cdots,Q\}$ . The additive components, $\boldsymbol{m}_{1},\boldsymbol{m}_{2},\cdots,\boldsymbol{m}_{P+Q}$ , are estimated using the backfitting estimators. The first step is to select a suitable smoothing matrix $\boldsymbol{S}_{d}$ for $d\in\{1,2,\cdots,P+Q\}$ , where $\widehat{\boldsymbol{m}}_{d}=\boldsymbol{S}_{d}{\boldsymbol{Y}}_{res}$ is the estimator of $\boldsymbol{m}_{d}$ , and ${\boldsymbol{Y}}_{res}$ is the residual of $\boldsymbol{Y}=(Y_{1},\cdots,Y_{n})^{T}$ given other additive components. This step is then repeated until convergence of all additive components (Hastie and Tibshirani, 1990). As ${\boldsymbol{X}}$ contains categorical or discrete valued random variables, the bin smoother at a point mass is appropriate. Suppose, for $j=1,2,\cdots,k_{p}$ , there are $n_{pj}$ observations at $X_{p}=x_{pj}$ , where $\sum_{j=1}^{k_{p}}n_{pj}=n$ , $p=1,2,\cdots,P$ . If the observations are sorted according to the values of $X_{p}$ , then the smoothing matrix for $m_{p}$ is given by

[TABLE]

where $\boldsymbol{J}_{n}$ is a $n\times n$ matrix with elements 1, and $\boldsymbol{O}_{m,n}$ is a $m\times n$ matrix with elements 0. It essentially means that $\boldsymbol{S}_{p}$ is constructed such a way that $\widehat{\boldsymbol{m}}_{p}(x_{pj})$ is the partial mean of ${\boldsymbol{Y}}_{res}$ , where $X_{p}=x_{pj}$ for $p=1,2,\cdots,P$ and $j=1,2,\cdots,k_{p}$ .

The covariate ${\boldsymbol{Z}}$ may contain any type of random variable – categorical, discrete or continuous. If some components of ${\boldsymbol{Z}}$ are categorical or discrete, then we use bin smoother again. Otherwise, for continuous valued covariates, one may choose a smoother that uses local polynomials. For the simplicity of notation, we assume that all covariates are continuous. In fact, the situation is even simpler for categorical or discrete covariates. Let $d_{q}$ be the degree of the polynomial used for smoothing of $Z_{q}$ for $q=1,2,\cdots,Q$ . Note that Nadaraya-Watson estimate (Watson, 1964) is a trivial case of the polynomial smoothing where the degree of the polynomial is zero. Suppose $K(\cdot)$ is the kernel function, and denote $K_{h_{q}}(z)=h_{q}^{-1}K(\frac{z}{h_{q}})$ , where $h_{q}$ is the bandwidth parameter. Then, the smoothing matrix of $Z_{q}$ is given by

[TABLE]

where $\boldsymbol{K}_{z_{q}}={\rm{diag}}\{K_{h_{q}}(Z_{q1}-z_{q}),\cdots,K_{h_{q}}(Z_{qn}-z_{q})\}$ is a diagonal matrix containing the kernel weight, and

[TABLE]

Then, the normal equations for the backfitting estimators (Buja et al, 1989, Opsomer and Ruppert, 1998) are given by

[TABLE]

where $\boldsymbol{S}_{d}^{*}=(\boldsymbol{I}_{n}-\boldsymbol{1}_{n}\boldsymbol{1}_{n}^{T}/n)\boldsymbol{S}_{d}$ is the centered smoothing matrix for $d=1,2,\cdots,P+Q$ , $\boldsymbol{Y}^{*}=\boldsymbol{Y}-\bar{Y}\boldsymbol{1}_{n}$ and $\boldsymbol{1}_{n}$ is the $n$ -dimensional vector of elements 1. The solution to the normal equations has the form

[TABLE]

provided the inverse exists. Here, $\boldsymbol{M}$ and $\boldsymbol{C}$ are the associated matrices. So, the backfitting estimator of $\boldsymbol{m}_{d}$ is given by

[TABLE]

where $\boldsymbol{W}_{d}=\boldsymbol{E}_{d}\boldsymbol{M}^{-1}\boldsymbol{C}$ , and $\boldsymbol{E}_{d}$ is a block matrix of dimension $n\times n(P+Q)$ with $n\times n$ identity matrix in the $d$ -th block and zero elsewhere.

Let us denote $\boldsymbol{W}=\sum_{d=1}^{P+Q}\boldsymbol{W}_{d}$ . Suppose $\boldsymbol{W}^{[-d]}$ is the smoother matrix for the additive model after dropping out the term containing $\boldsymbol{m}_{d},\ d=1,2,\cdots,P+Q$ . Then, the following lemma from Opsomer (2000) ensures the existence and uniqueness of the backfitting estimators of the additive model.

Lemma 1.

If $||\boldsymbol{S}_{d}^{*}\boldsymbol{W}^{[-d]}||<1$ for some $d\in(1,2,\cdots,P+Q)$ , where $||\cdot||$ denotes any matrix norm, then the backfitting estimators uniquely exist and

[TABLE]

Appendix C: Proofs

Lemma 2.

Let us assume that conditions (C2)–(C5) hold, then

[TABLE]

for all $d=1,2,\cdots,P+Q$ . The term $o\left(\frac{\boldsymbol{1}_{n}\boldsymbol{1}_{n}^{T}}{n}\right)$ means that each element is of order $o\left(\frac{1}{n}\right)$ .

Proof.

This property is proved in Opsomer and Ruppert (1997) using a local polynomial fitting where the smoothing matrix is as defined in (A.2). This is also true for the point mass bin smoother as

[TABLE]

Note that for $d=1,2,\cdots,P$ the relationship is exact, and we do not need any assumption for this. ∎

Lemma 3.

If the predictors and covariates are pairwise independent then, under conditions (C1)–(C6), we have

[TABLE]

for all $d\neq d^{\prime}\in\{1,2,\cdots,P+Q\}$ .

Proof.

Using equation (A.9) for $p$ and $p^{\prime}\in\{1,2,\cdots,P\}$ , we get

[TABLE]

Note that $\boldsymbol{S}_{p}\boldsymbol{S}_{p}=\boldsymbol{S}_{p}$ for $p=1,2,\cdots,P$ . For $p\neq p^{\prime}\in\{1,2,\cdots,P\}$ , we define $\boldsymbol{U}=\boldsymbol{S}_{p}\boldsymbol{S}_{p^{\prime}}$ . Here $\boldsymbol{U}$ is a block matrix containing each element in the $rs$ -th block equal to

[TABLE]

where $r=1,2,\cdots,k_{p}$ and $s=1,2,\cdots,k_{p^{\prime}}$ . Using strong law of large numbers (SLLN) and assumption (C1), we get

[TABLE]

Combining equations (A.11) and (A.13) the $ij$ -th element of $\boldsymbol{S}_{p}^{*}\boldsymbol{S}_{p^{\prime}}^{*}$ becomes

[TABLE]

So, the lemma is proved for $p\neq p^{\prime}\in\{1,2,\cdots,P\}$ . For $d\neq d^{\prime}\in\{P+1,P+2,\cdots,P+Q\}$ Opsomer and Ruppert (1997) have shown that under conditions (C2)–(C6)

[TABLE]

where $f_{dd^{\prime}}(\cdot)$ is the joint distribution of $Z_{d}$ and $Z_{d^{\prime}}$ , whereas $f_{d}(\cdot)$ and $f_{d^{\prime}}(\cdot)$ are their marginal distributions. So, the lemma is true for $d\neq d^{\prime}\in\{P+1,P+2,\cdots,P+Q\}$ . Now, using condition (C8) and applying the same technique, we can prove this result when $d=1,2,\cdots,P$ , and $d^{\prime}=P+1,P+2,\cdots,P+Q$ , or vise versa. ∎

Lemma 4.

Let us denote $\boldsymbol{W}=\sum_{d=1}^{P+Q}\boldsymbol{W}_{d}$ , where $\boldsymbol{W}_{d}$ is given in equation (A.6). Then, under conditions (C1)–(C6)

[TABLE]

where $\boldsymbol{S}^{*}=\sum_{d=1}^{P+Q}\boldsymbol{S}_{d}^{*}$ .

Proof.

For $P+Q=2$ , we get

[TABLE]

From equation (A.7), we have

[TABLE]

Using Lemma 3, we have the following approximation

[TABLE]

This approximation is exact when the corresponding predictors or covariates are pairwise independent. Combining equations (A.18) and (A.19), we get

[TABLE]

Similarly $\boldsymbol{W}_{2}\approx\boldsymbol{S}_{2}^{*}+o\left(\frac{\boldsymbol{1}_{n}\boldsymbol{1}_{n}^{T}}{n}\right)\mbox{ a.s}.$ Now, for all values of $P$ and $Q$ , we prove by recursion that

[TABLE]

Therefore, by taking summation over $d=1,2,\cdots,P+Q$ the lemma is proved. ∎

Lemma 5.

Suppose conditions (C1)–(C4) and (C8) are satisfied. Then, under $H_{0}^{**}$ ,

[TABLE]

for all $p=1,2,\cdots,P$ , where $\boldsymbol{m}=\sum_{p=1}^{P+Q}\boldsymbol{m}_{p}$ .

Proof.

Note that

[TABLE]

For $p=1,2,\cdots,P$ and $q=1,2,\cdots,Q$ we define $\boldsymbol{u}=\boldsymbol{S}_{p}\boldsymbol{m}_{P+q}=(u_{1}\boldsymbol{1}_{n_{p1}}^{T},u_{2}\boldsymbol{1}_{n_{p2}}^{T},$ $\cdots,u_{n_{pk_{p}}}\boldsymbol{1}_{n_{pk_{p}}}^{T})^{T}$ . Then

[TABLE]

Using strong law of large numbers (SLLN) and assumption (C8) we get

[TABLE]

So

[TABLE]

Hence, under $H_{0}^{**}$ , for all $p=1,2,\cdots,P$

[TABLE]

Again using SLLN we get

[TABLE]

for all $q=1,2,\cdots,Q$ . Similarly, $\boldsymbol{1}_{n}^{T}\boldsymbol{m}_{p}/n=0$ a.s. for all $p=1,2,\cdots,P$ . Hence using Lemma 2 the lemma is proved from equation (A.27). ∎

Lemma 6.

Denote $\boldsymbol{A}_{2n}=(\boldsymbol{W}-\boldsymbol{I}_{n})^{T}(\boldsymbol{W}-\boldsymbol{I}_{n})$ , then under conditions (C1)–(C6)

[TABLE]

where $\boldsymbol{S}^{*}=\sum_{d=1}^{P+Q}\boldsymbol{S}^{*}_{d}$ .

Proof.

Using Lemma 4, we get

[TABLE]

∎

Corollary 4.

Denote $\boldsymbol{A}_{1n}=(\boldsymbol{W}_{[Z]}-\boldsymbol{I}_{n})^{T}(\boldsymbol{W}_{[Z]}-\boldsymbol{I}_{n})$ , where $\boldsymbol{W}_{[Z]}$ is the smoother matrix for the additive model after dropping all $P$ predictors. Using an argument similar to that in the proof of Lemma 6, we find

[TABLE]

where $\boldsymbol{S}_{[Z]}^{*}=\sum_{d=1}^{Q}\boldsymbol{S}_{P+d}^{*}$ . So

[TABLE]

$\boldsymbol{S}_{[X]}^{*}=\sum_{d=1}^{P}\boldsymbol{S}_{d}^{*}$ . Now $\boldsymbol{S}_{d}^{*}\boldsymbol{S}_{d}^{*}=\boldsymbol{S}_{d}^{*}$ and $\boldsymbol{S}_{d}^{*T}=\boldsymbol{S}_{d}^{*}$ for $d=1,2,\cdots,P$ . From the technique used in Lemma 3, it can be shown that $\boldsymbol{S}_{[X]}^{*}\boldsymbol{S}_{[Z]}^{*T}$ is a symmetric matrix. So, using Lemma 3 equation (A.32) reduces to

[TABLE]

Now

[TABLE]

Let us define $\boldsymbol{\epsilon}=(\epsilon_{1},\cdots,\epsilon_{n})^{T}$ . Then

[TABLE]

Using Lemma 5, under $H_{0}^{**}$ , we have

[TABLE]

Moreover, using (A.26) it is easy to show that, under $H_{0}^{**}$ , $\boldsymbol{\epsilon}^{T}\boldsymbol{S}_{d}^{*}\boldsymbol{m}=o_{p}(1)$ for all $d=1,2,\cdots,P$ . Hence, from (A.35), we get

[TABLE]

Now, for $d=1,2,\cdots,P$ , using the definition in (A.1), we have

[TABLE]

where $\boldsymbol{e}_{dj}$ is a vector with $n_{dj}$ elements one and rest are zero. If $X_{di}=x_{dj}$ , then the $i$ -th element of $\boldsymbol{e}_{dj}$ is one. Note that in this definition, we did not sort $X_{d}$ according to their observed values. As $E[\epsilon_{i}^{2}]<\infty$ under condition (C7), using central limit theorem (CLT), we get

[TABLE]

As $\boldsymbol{e}_{dj}^{T}\boldsymbol{\epsilon}$ and $\boldsymbol{e}_{dj^{\prime}}^{T}\boldsymbol{\epsilon}$ are independent for all $j\neq j^{\prime}\in\{1,2,\cdots,k_{d}\}$ , the components of $\boldsymbol{U}_{d}=(U_{d1},U_{d2},\cdots,U_{dk_{d}})^{T}$ are i.i.d. standard normal variables. Therefore, from (A.38), we have $\frac{1}{\sigma^{2}}\boldsymbol{\epsilon}^{T}\boldsymbol{S}_{d}\boldsymbol{\epsilon}\overset{a}{\sim}\chi^{2}(k_{d}).$ Let us define

[TABLE]

Then $\bar{U}=c_{d}^{T}\boldsymbol{U}_{d}$ a.s., where $c_{d}=(\sqrt{c_{d1}},\sqrt{c_{d2}},\cdots,\sqrt{c_{dk_{d}}})^{T}$ . Note that $\bar{U}^{2}=\frac{1}{n}\boldsymbol{\epsilon}^{T}\boldsymbol{\epsilon}$ . Hence

[TABLE]

because $(\boldsymbol{I}_{k_{d}}-c_{d}c_{d}^{T})$ is an idempotent matrix of rank $(k_{d}-1)$ . If all predictors are pairwise independent, then from Lemma 3, we get

[TABLE]

for $d\neq d^{\prime}\in\{1,2,\cdots,P\}$ . So, $\boldsymbol{\epsilon}^{T}\boldsymbol{S}_{d}^{*}\boldsymbol{\epsilon}$ and $\boldsymbol{\epsilon}^{T}\boldsymbol{S}_{d^{\prime}}^{*}\boldsymbol{\epsilon}$ are asymptotically independent for all $d\neq d^{\prime}$ (see p. 84 of Bapat, 2012). Therefore, if the predictors are pairwise independent, then under $H_{0}^{**}$ , equation (A.37) gives

[TABLE]

However, in general, the above distribution comes out to be a sum of $P$ dependent chi-square variables as

[TABLE]

Suppose $d\neq d^{\prime}\in\{1,2,\cdots,P\}$ , $j=1,2,\cdots,k_{d}$ and $j^{\prime}=1,2,\cdots,k_{d^{\prime}}$ , then the correlation between $U_{dj}$ and $U_{d^{\prime}j^{\prime}}$ is given by

[TABLE]

The last expression is derived using the same techniques as used in equation (A.13). Note that combining equations (A.44) and (A.45), we obtain result (A.43) if the predictor variables are independent.

Now, it is easy to show that

[TABLE]

Therefore, using Slutsky’s theorem, $\sigma^{2}$ in (A.44) may be replaced by $\frac{1}{n}RSS_{1}^{*}$ , and therefore

[TABLE]

Define $\boldsymbol{U}=(\boldsymbol{U}_{1}^{T},\boldsymbol{U}_{2}^{T},\cdots,\boldsymbol{U}_{P}^{T})^{T}$ , and $\boldsymbol{U}^{*}=\boldsymbol{\Sigma}_{1}\boldsymbol{U}$ , where $\boldsymbol{\Sigma}_{1}$ is defined in Section 4.4. As $\boldsymbol{\Sigma}_{1}$ is an idempotent matrix, $\lambda_{n}(H_{0}^{**})$ in equation (A.47) is written as

[TABLE]

Now, the covariance matrix of $\boldsymbol{U}$ is $\boldsymbol{\Sigma}_{2}$ (defined in Section 4.4), which is a block matrix with $p$ -th diagonal block is an identity matrix of order $k_{p}$ , and the $ij$ -th element of the $pp^{\prime}$ -th off-diagonal block is given in (A.45). So, the covariance matrix of $\boldsymbol{U}^{*}$ is $\boldsymbol{\Sigma}_{1}\boldsymbol{\Sigma}_{2}\boldsymbol{\Sigma}_{1}$ . Let $\lambda_{1},\lambda_{2},\cdots,\lambda_{s}$ are non-zero eigenvalues of $\boldsymbol{\Sigma}_{1}\boldsymbol{\Sigma}_{2}\boldsymbol{\Sigma}_{1}$ , where $s$ is the rank of $\boldsymbol{\Sigma}_{1}\boldsymbol{\Sigma}_{2}\boldsymbol{\Sigma}_{1}$ . Suppose $V=(V_{1},V_{2},\cdots,V_{s})^{T}$ is a vector of i.i.d. standard normal variables, and $\Lambda={\rm diag}(\lambda_{1},\lambda_{2},\cdots,\lambda_{s})$ . Then, from (A.48), the theorem is proved as

[TABLE]

∎

Theorem 3.

Let us consider the notations and assumptions of Corollary 4. Then, under $H_{1}$ , the asymptotic distribution of the GLR test statistic coincides with $\delta^{2}+\sum_{i=1}^{s}\lambda_{i}V_{i}^{2}$ , where $\delta^{2}=\sum_{r,s=1}^{P}E(m_{r}^{*}m_{s}^{*})$ .

Proof.

If $H_{0}^{**}$ is not true, then from (A.35), we get

[TABLE]

Now, the result in equation (A.44) reduces to

[TABLE]

From the proof of Lemma 5, we have

[TABLE]

Using $\boldsymbol{S}_{p}\boldsymbol{m}_{p}=\boldsymbol{m}_{p}$ , we get $\boldsymbol{m}_{r}^{T}\boldsymbol{S}_{p}\boldsymbol{m}_{p}=\boldsymbol{m}_{r}^{T}\boldsymbol{m}_{p}$ for all $p,r=1,2,\cdots,P$ . Hence

[TABLE]

Suppose $u_{ij}$ is the $(i,j)$ -th element of $\boldsymbol{m}_{r}^{T}\boldsymbol{S}_{p}\boldsymbol{m}_{s}$ for some $p,r,s=1,2,\cdots,P$ . Then

[TABLE]

Note that

[TABLE]

So, using condition (C1), we get from equation (A.54)

[TABLE]

Hence, from equation (A.52), we get

[TABLE]

Therefore

[TABLE]

As $E(m_{p})=0$ for all $p=1,2,\cdots,P$ , we get

[TABLE]

Combining (A.46), (A.50), (A.51) and (A.59) the theorem is proved. ∎

Corollary 3.

The residual sum of squares under $H_{0}^{*}$ can be written as

[TABLE]

where

[TABLE]

Using Lemma 4 it can be shown that

[TABLE]

Hence

[TABLE]

Therefore, equation (A.61) reduces to

[TABLE]

As ${\boldsymbol{X}}^{*}\left({\boldsymbol{X}}^{*T}{\boldsymbol{X}}^{*}\right)^{-1}{\boldsymbol{X}}^{*T}$ is an idempotent matrix, using (A.62) and (A.64) we get from equation (A.60)

[TABLE]

Suppose ${\boldsymbol{\theta}}_{0}$ is the true value of ${\boldsymbol{\theta}}$ under $H_{0}^{*}$ . So, under $H_{0}^{*}$ , the model can be written as

[TABLE]

where $\boldsymbol{m}_{[Z]}=\sum_{q=1}^{Q}\boldsymbol{m}_{P+q}(\cdot)$ . Opsomer and Ruppert (1999) have shown that ${\boldsymbol{\widetilde{\boldsymbol{\theta}}}}$ is a consistent estimator of ${\boldsymbol{\theta}}_{0}$ . Hence, under $H_{0}^{*}$ , from equation (A.62) we get

[TABLE]

Using a similar technique of equation (A.26) it can be shown that

[TABLE]

So, combining (A.67) and (A.68) we get

[TABLE]

Hence, equation (A.65) simplifies to

[TABLE]

Now, proceeding the same way as the proof of Corollary 4, we get

[TABLE]

Using equations (A.23) and (A.69) we get

[TABLE]

Combining equations (A.34) and (A.35) we get

[TABLE]

Using CLT it is easy to establish that $\boldsymbol{m}_{[X]}^{T}\boldsymbol{S}_{[X]}^{*}\boldsymbol{\epsilon}=o_{p}(1)$ . From equation (A.58), we have $\boldsymbol{m}_{[X]}^{T}\boldsymbol{S}_{[X]}^{*}\boldsymbol{m}_{[X]}=\boldsymbol{m}_{[X]}^{T}\boldsymbol{m}_{[X]}+o_{p}(1)$ . Then, equation (A.73) turns out to be

[TABLE]

Note that

[TABLE]

Using condition (C8) and equation (A.69) it can be shown that the second and the forth terms in equation (A.75) tend to zero in probability; and by CLT the third and the fifth terms is asymptotically zero. Therefore

[TABLE]

As $\frac{1}{n}\boldsymbol{m}_{[X]}^{T}\boldsymbol{1}_{n}\overset{\mbox{a.s.}}{=}E(\boldsymbol{m}_{[X]})=0$ , we get from the above equation

[TABLE]

Combining (A.74) and (LABEL:yxy), we get from (A.72)

[TABLE]

where $\boldsymbol{R}_{p,1}={}_{0}^{r_{p}}{\boldsymbol{X}}_{(p)}$ and ${}_{a}^{b}{\boldsymbol{X}}_{(p)}$ is defined in (3.3). It can be shown that

[TABLE]

where $\boldsymbol{R}_{p,2}={}_{0}^{k_{p}-1}{\boldsymbol{X}}_{(p)}$ . So $\boldsymbol{S}_{p}$ may be regarded as the hat matrix in context of the classical regression in fitting of a $k_{p}$ degree polynomial. Equation (A.79) shows that columns of the matrix $\boldsymbol{S}_{p}$ form an orthogonal basis for the column space of $\boldsymbol{R}_{p,2}$ . Similarly, columns of $\boldsymbol{R}_{p,1}\left(\boldsymbol{R}_{p,1}^{T}\boldsymbol{R}_{p,1}\right)^{-1}\boldsymbol{R}_{p,1}^{T}$ form an orthogonal basis for the column space of $\boldsymbol{R}_{p,1}$ . Using some matrix calculations it can be shown that

[TABLE]

where $\boldsymbol{R}_{p}={}_{r_{p}+1}^{k_{p}-1}{\boldsymbol{X}}_{(p)}$ . Now $\boldsymbol{R}_{p}\left(\boldsymbol{R}_{p}^{T}\boldsymbol{R}_{p}\right)^{-1}\boldsymbol{R}_{p}^{T}$ is an idempotent matrix with rank $(k_{p}-r_{p}-1)$ . Hence

[TABLE]

where

[TABLE]

So $(k_{d}-r_{p}-1)$ components of $\boldsymbol{U}_{p}$ are i.i.d. standard normal variables. From equation (A.78), we get

[TABLE]

where

[TABLE]

Rest of the proof is done using the same technique as the proof of Corollary 4. ∎

Theorem 1.

In this case, we can show that

[TABLE]

where $\boldsymbol{R}_{p,1}={}_{0}^{r_{p}}{\boldsymbol{X}}_{(p)}$ . Hence, the proof of the theorem follows from Corollaries 3 and 4. ∎

Theorem 2.

Combining steps of Theorems 1 and 3, we get the proof of the current theorem. ∎

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bapat (2012) Bapat RB (2012) Linear algebra and linear models. Springer Science & Business Media
2Buja et al (1989) Buja A, Hastie T, Tibshirani R (1989) Linear smoothers and additive models. Ann Statist 17(2):453–555
3Davies (1980) Davies RB (1980) The distribution of a linear combination of χ 2 superscript 𝜒 2 \chi^{2} random variables. Algorithm AS 155. Appl Statist 29:323–333
4Fan and Jiang (2005) Fan J, Jiang J (2005) Nonparametric inferences for additive models. J Amer Statist Assoc 100(471):890–907
5Fan et al (2001) Fan J, Zhang C, Zhang J (2001) Generalized likelihood ratio statistics and Wilks phenomenon. Ann Statist 29(1):153–193
6Friedman and Stuetzle (1981) Friedman JH, Stuetzle W (1981) Projection pursuit regression. J Amer Statist Assoc 76(376):817–823
7Hall and Marron (1988) Hall P, Marron JS (1988) Variable window width kernel estimates of probability densities. Probab Theory Related Fields 80(1):37–49
8Hastie and Tibshirani (2000) Hastie T, Tibshirani R (2000) Bayesian backfitting (with discussion). Statist Sci 15(3):196–223, with comments and a rejoinder by the authors

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

An Optimal Test for the Additive Model with Discrete or Categorical Predictors

Abstract

1 Introduction

2 The Nonparametric Additive Model

3 The Semiparametric Additive Model

4 The Generalized Likelihood Ratio Test

4.1 Asymptotic Distribution

Theorem 1**.**

Corollary 1**.**

4.2 Power Function

Theorem 2**.**

Corollary 2**.**

4.3 The Goodness-of-Fit Test for the Semiparametric Model

Corollary 3**.**

4.4 Model Utility Test for the Nonparametric Additive Model

Corollary 4**.**

5 Simulation

5.1 Null Distribution

5.2 Power Function

5.3 The Goodness-of-Fit Test

6 Real Data Example

7 Discussion

Appendix A: Regularity Conditions

Appendix B: Backfitting Estimators

Lemma 1**.**

Appendix C: Proofs

Lemma 2**.**

Proof.

Lemma 3**.**

Proof.

Lemma 4**.**

Proof.

Lemma 5**.**

Proof.

Lemma 6**.**

Proof.

Corollary 4.

Theorem 3**.**

Proof.

Corollary 3.

Theorem 1.

Theorem 2.

Theorem 1.

Corollary 1.

Theorem 2.

Corollary 2.

Corollary 3.

Corollary 4.

Lemma 1.

Lemma 2.

Lemma 3.

Lemma 4.

Lemma 5.

Lemma 6.

Theorem 3.