Rediscovering a little known fact about the t-test and the F-test:   Algebraic, Geometric, Distributional and Graphical Considerations

Jennifer A. Sinnott; Steven N. MacEachern; Mario Peruggia

arXiv:1907.08703·stat.OT·July 14, 2022

Rediscovering a little known fact about the t-test and the F-test: Algebraic, Geometric, Distributional and Graphical Considerations

Jennifer A. Sinnott, Steven N. MacEachern, Mario Peruggia

PDF

Open Access

TL;DR

This paper explores the foundational principles of the t-test and F-test, revealing their underlying algebraic and geometric structures, and discusses implications for hypothesis testing and residual diagnostics.

Contribution

It demonstrates how simple algebraic manipulations align the t-test with the recommended null hypothesis approach and extends these insights to Gaussian linear models.

Findings

01

Algebraic manipulations reveal equivalence of t-test procedures

02

Geometric intuition clarifies test statistic interpretation

03

Application impacts residual diagnostics in practice

Abstract

We discuss the role that the null hypothesis should play in the construction of a test statistic used to make a decision about that hypothesis. To construct the test statistic for a point null hypothesis about a binomial proportion, a common recommendation is to act as if the null hypothesis is true. We argue that, on the surface, the one-sample t-test of a point null hypothesis about a Gaussian population mean does not appear to follow the recommendation. We show how simple algebraic manipulations of the usual t-statistic lead to an equivalent test procedure consistent with the recommendation. We provide geometric intuition regarding this equivalence and we consider extensions to testing nested hypotheses in Gaussian linear models. We discuss an application to graphical residual diagnostics where the form of the test statistic makes a practical difference. By examining the formulation…

Equations89

p \pm z_{\frac{α}{2}} p (1 - p) / n .

p \pm z_{\frac{α}{2}} p (1 - p) / n .

\frac{p - p _{0}}{p _{0} ( 1 - p _{0} ) / n} .

\frac{p - p _{0}}{p _{0} ( 1 - p _{0} ) / n} .

\overset{ˉ}{Y} \pm t_{n - 1, \frac{α}{2}} S / n

\overset{ˉ}{Y} \pm t_{n - 1, \frac{α}{2}} S / n

S^{2} = \frac{1}{n - 1} i = 1 \sum n (Y_{i} - \overset{ˉ}{Y})^{2}

S^{2} = \frac{1}{n - 1} i = 1 \sum n (Y_{i} - \overset{ˉ}{Y})^{2}

S_{0}^{2} := \frac{1}{n} i = 1 \sum n (Y_{i} - μ_{0})^{2} .

S_{0}^{2} := \frac{1}{n} i = 1 \sum n (Y_{i} - μ_{0})^{2} .

T_{0} := \frac{Y ˉ - μ _{0}}{S _{0} / n} .

T_{0} := \frac{Y ˉ - μ _{0}}{S _{0} / n} .

T = \frac{Y ˉ - μ _{0}}{S / n} .

T = \frac{Y ˉ - μ _{0}}{S / n} .

T = \frac{n - 1 T _{0}}{n - T _{0}^{2}},

T = \frac{n - 1 T _{0}}{n - T _{0}^{2}},

R_{T_{0}} = {y = (y_{1}, \dots, y_{n})^{T} : ∣ T_{0} (y) ∣ \geq c_{α}} .

R_{T_{0}} = {y = (y_{1}, \dots, y_{n})^{T} : ∣ T_{0} (y) ∣ \geq c_{α}} .

R_{T} = {y = (y_{1}, \dots, y_{n})^{T} : ∣ T (y) ∣ \geq (n - 1 c_{α}) / (n - c_{α}^{2})}

R_{T} = {y = (y_{1}, \dots, y_{n})^{T} : ∣ T (y) ∣ \geq (n - 1 c_{α}) / (n - c_{α}^{2})}

λ (Y) = \frac{sup _{σ^{2}} L ( μ _{0} , σ ^{2} ∣ Y )}{sup _{μ, σ^{2}} L ( μ , σ ^{2} ∣ Y )}

λ (Y) = \frac{sup _{σ^{2}} L ( μ _{0} , σ ^{2} ∣ Y )}{sup _{μ, σ^{2}} L ( μ , σ ^{2} ∣ Y )}

R = \frac{\sum _{j = 1} ( Y _{j} - μ _{0} ) ^{2}}{\sum _{j = 1}^{n} ( Y _{j} - Y ˉ ) ^{2}},

R = \frac{\sum _{j = 1} ( Y _{j} - μ _{0} ) ^{2}}{\sum _{j = 1}^{n} ( Y _{j} - Y ˉ ) ^{2}},

R = \frac{\sum _{j = 1} ( Y _{j} - Y ˉ ) ^{2} + n ( Y ˉ - μ _{0} ) ^{2}}{\sum _{j = 1}^{n} ( Y _{j} - Y ˉ ) ^{2}} = 1 + \frac{T ^{2}}{n - 1}

R = \frac{\sum _{j = 1} ( Y _{j} - Y ˉ ) ^{2} + n ( Y ˉ - μ _{0} ) ^{2}}{\sum _{j = 1}^{n} ( Y _{j} - Y ˉ ) ^{2}} = 1 + \frac{T ^{2}}{n - 1}

R = \frac{\sum _{j = 1} ( Y _{j} - μ _{0} ) ^{2}}{\sum _{j = 1}^{n} ( Y _{j} - μ _{0} ) ^{2} - n ( Y ˉ - μ _{0} ) ^{2}} = \frac{1}{1 - T _{0}^{2} / n} .

R = \frac{\sum _{j = 1} ( Y _{j} - μ _{0} ) ^{2}}{\sum _{j = 1}^{n} ( Y _{j} - μ _{0} ) ^{2} - n ( Y ˉ - μ _{0} ) ^{2}} = \frac{1}{1 - T _{0}^{2} / n} .

∥ v ∥^{2}

∥ v ∥^{2}

i.e., i = 1 \sum n (Y_{i} - μ_{0})^{2}

=

T_{0}^{2} = n \frac{SST}{SSTO} = n cos^{2} θ and T^{2} = (n - 1) \frac{SST}{SSE} = (n - 1) cot^{2} θ .

T_{0}^{2} = n \frac{SST}{SSTO} = n cos^{2} θ and T^{2} = (n - 1) \frac{SST}{SSE} = (n - 1) cot^{2} θ .

T^{2} = (n - 1) cot^{2} θ = (n - 1) \frac{cos ^{2} θ}{sin ^{2} θ} = (n - 1) \frac{cos ^{2} θ}{1 - cos ^{2} θ} .

T^{2} = (n - 1) cot^{2} θ = (n - 1) \frac{cos ^{2} θ}{sin ^{2} θ} = (n - 1) \frac{cos ^{2} θ}{1 - cos ^{2} θ} .

Y = X β + ϵ,

Y = X β + ϵ,

Y = X_{1} β_{1} + X_{2} β_{2} + ϵ,

Y = X_{1} β_{1} + X_{2} β_{2} + ϵ,

H_{0} : β_{2} = 0 vs. H_{A} : β_{2} \neq = 0 .

H_{0} : β_{2} = 0 vs. H_{A} : β_{2} \neq = 0 .

P_{1} = X_{1} (X_{1}^{T} X_{1})^{- 1} X_{1}^{T}, Q_{1} = I - P_{1},

P_{1} = X_{1} (X_{1}^{T} X_{1})^{- 1} X_{1}^{T}, Q_{1} = I - P_{1},

P_{12} = X (X^{T} X)^{- 1} X^{T}, Q_{12} = I - P_{12} .

P_{12} = X (X^{T} X)^{- 1} X^{T}, Q_{12} = I - P_{12} .

Y_{1} = P_{1} Y,

Y_{1} = P_{1} Y,

r_{1} = Y - Y_{1} = Q_{1} Y,

r_{1} = Y - Y_{1} = Q_{1} Y,

SSE_{1} = Y^{T} Q_{1}^{T} Q_{1} Y = Y^{T} Q_{1} Y .

SSE_{1} = Y^{T} Q_{1}^{T} Q_{1} Y = Y^{T} Q_{1} Y .

Y_{12} = P_{12} Y,

Y_{12} = P_{12} Y,

r = Y - Y_{12} = Q_{12} Y,

r = Y - Y_{12} = Q_{12} Y,

SSE_{12} = Y^{T} Q_{12} Y .

SSE_{12} = Y^{T} Q_{12} Y .

SS_{2 ∣ 1} = SSE_{1} - SSE_{12} = Y^{T} (Q_{1} - Q_{12}) Y = Y^{T} (P_{12} - P_{1}) Y .

SS_{2 ∣ 1} = SSE_{1} - SSE_{12} = Y^{T} (Q_{1} - Q_{12}) Y = Y^{T} (P_{12} - P_{1}) Y .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference · Advanced Statistical Methods and Models · Statistical Methods and Bayesian Inference

Full text

Rediscovering a little known fact about the $t$ -test

and the $F$ -test: Algebraic, Geometric, Distributional and Graphical Considerations

Jennifer A. Sinnott, Steven N. MacEachern, and Mario Peruggia

Department of Statistics, The Ohio State University, Columbus, Ohio, USA

Abstract

We discuss the role that the null hypothesis should play in the construction of a test statistic used to make a decision about that hypothesis. To construct the test statistic for a point null hypothesis about a binomial proportion, a common recommendation is to act as if the null hypothesis is true. We argue that, on the surface, the one-sample $t$ -test of a point null hypothesis about a Gaussian population mean does not appear to follow the recommendation. We show how simple algebraic manipulations of the usual t-statistic lead to an equivalent test procedure consistent with the recommendation. We provide geometric intuition regarding this equivalence and we consider extensions to testing nested hypotheses in Gaussian linear models. We discuss an application to graphical residual diagnostics where the form of the test statistic makes a practical difference. By examining the formulation of the test statistic from multiple perspectives in this familiar example, we provide simple, concrete illustrations of some important issues that can guide the formulation of effective solutions to more complex statistical problems.

Keywords: Binomial proportion; $F$ -test; Nested models; Null hypothesis; Orthogonal sum of squares decomposition; Test statistic

1 Introduction

Among the first procedures taught in an introductory statistics class are hypothesis testing and confidence interval estimation for a proportion (see, e.g., Moore et al. (2012)). For example, students may be given data on the sexes of a sample of $n$ babies born during a certain time period. They may be asked either to estimate the true proportion $p$ of babies born male and provide a confidence interval, or to test whether the proportion is equal to, for example, 0.5.111 There is evidence that this proportion is larger than 0.5 in most of the world; (see, e.g., Chao et al. (2019)). Typically, for large $n,$ the distribution of the sample proportion is approximated by ${\widehat{p}}\stackrel{{\scriptstyle\cdot}}{{\sim}}N\left(p,p(1-p)/n\right),$ and two slightly different procedures are introduced. For estimation and confidence interval construction, ${\widehat{p}}$ is commonly plugged into the variance formula, and a $100(1-\alpha)\%$ confidence interval is calculated as

[TABLE]

For testing $H_{0}:p=p_{0}$ for a pre-specified $p_{0}$ , students are advised to act as though the null were true, and use the null to construct the test statistic. As a result, $p_{0}$ is plugged into the variance formula, producing the test statistic

[TABLE]

Although many different approaches to both testing and interval estimation have been proposed — and many commonly used statistical software packages allow the user to apply continuity corrections to these formulas to improve the asymptotic approximation (e.g., by setting the argument correct = TRUE in the R function prop.test) — in the authors’ experience, the above methods are still frequently taught for hand calculation in introductory statistics classes of various levels. For instance, Example 10.3.5 in Casella and Berger (2002) discusses precisely two test procedures based on test statistics that use ${\widehat{p}}$ or $p_{0}$ to estimate the variance, commenting on their relative merits in terms of a comparison of their power functions. For further discussions of procedures used in the one-sample proportion setting, see, e.g., Agresti and Coull (1998) and Yang and Black (2019).

Also among the first procedures taught are estimation and hypothesis testing for the mean $\mu$ of a normal $N(\mu,\sigma^{2})$ population with unknown variance $\sigma^{2}.$ For example, students may be given data on the heights of a random sample of U.S. women and be asked to estimate the true mean height, or test whether it is equal to some specified value. If our data consist of a random sample $Y_{1},\ldots,Y_{n}$ from the $N(\mu,\sigma^{2})$ population, $\bar{Y}\sim N\left(\mu,\sigma^{2}/n\right),$ and a confidence interval is constructed analogously to (1), as

[TABLE]

where

[TABLE]

is the sample variance. (This follows from observing that $T:=(\bar{Y}-\mu)/(S/\sqrt{n})$ has a $t$ distribution with $n-1$ degrees of freedom, accounting for the replacement of $\sigma$ with $S$ .) To test $H_{0}:\mu=\mu_{0}$ for a pre-specified $\mu_{0},$ we can, analogously to (2), invoke the null. When $H_{0}$ holds, we know $\mu=\mu_{0}$ but still need to estimate $\sigma^{2}.$ Since $\mu$ is known, the most efficient estimator of $\sigma^{2}$ is:

[TABLE]

Our test statistic would thus be:

[TABLE]

But, of course, people do not use this test statistic! Instead, they construct a statistic that ignores the information that $\mu=\mu_{0}$ provided by $H_{0},$ and perform the standard one-sample $t$ -test using the test statistic

[TABLE]

At first glance, one might suspect that using this test statistic would be less efficient than using $T_{0},$ since its denominator has $n-1$ degrees of freedom rather than $n.$

We are thus led to wonder why information provided by the null is discarded in constructing the one-sample $t$ -test. In the remainder of the paper we clarify this question and present a more general perspective that we think will be of interest to colleagues who teach this material as well as those interested in the development and implications of some of our most fundamental statistical tools.

2 Establishing the connection

The connection between the two methods proposed at the end of the previous section can be established from an algebraic and from a geometric point of view. We look at these two approaches separately.

To begin, we note that any intuition that a test based on $T_{0}$ rather than $T$ could be more efficient is wrong: a tail-area test based on $T_{0}$ and one based on $T$ produce identical answers. This is because $T$ is a one-to-one, increasing function of $T_{0},$

[TABLE]

over the interval $(-\sqrt{n},\sqrt{n})$ , which is the set of possible values for $T_{0}$ . Specifically, for any fixed $\alpha$ , with $0\leq\alpha\leq 1$ , let $c_{\alpha}\geq 0$ be the critical value of the size $\alpha$ test based on $T_{0}$ . The rejection region of this test is

[TABLE]

Because the transformation in Equation (4) is monotonic increasing on $[0,\sqrt{n})$ , the set

[TABLE]

satisfies $R_{T}=R_{T_{0}}$ . It follows that the test that rejects if and only if $|T({\bf y})|\geq(\sqrt{n-1}~{}c_{\alpha})/(\sqrt{n-c_{\alpha}^{2}})$ has the exact same rejection region (in sample space) as the test that rejects when $|T_{0}({\bf y})|\geq c_{\alpha}$ . The two tests must then have the same size and power function and are therefore equivalent.

As noted by a colleague, a simple way to establish Equation (4) is to recognize that the one sample $t$ -test can be derived as a likelihood ratio test that rejects $H_{0}:\mu=\mu_{0}$ when the ratio

[TABLE]

is small or, equivalently, when the ratio of sums of squares under the null and full model,

[TABLE]

is large. This ratio can be expressed as

[TABLE]

or as

[TABLE]

The former expression leads to the standard $t$ -test based on $T$ , while the latter leads to the test based on $T_{0}$ . Equating these two expressions yields the identity of Equation (4).

This relationship between $T$ and $T_{0}$ is, of course, not new: for example, it arises substantively in Lehmann’s approach for demonstrating that the one sample $t$ -test is a uniformly most powerful (UMP) unbiased test of $H_{0}:\mu=\mu_{0}$ vs. $H_{A}:\mu\neq\mu_{0}$ Lehmann (1986). The full details of the argument are best left to Lehmann, but, very briefly, for parameters in exponential family distributions, Lehmann’s Theorem 1 in Chapter 5 gives a set of conditions about the form of a test statistic in relation to the family’s sufficient statistics. When these conditions are satisfied, a test based on the test statistic is UMP unbiased. The set of conditions Lehmann provides is satisfied by $T_{0}$ rather than $T,$ and the UMP unbiasedness of the $t$ -test is then established by exhibiting that $T$ is a one-to-one function of $T_{0}.$

Interestingly, this equivalence does not seem to be widely known (at least based on our informal surveying of several colleagues). This is somewhat surprising. In fact, in addition to appearing in Lehmann’s book, the algebraic equivalence of the test statistics is periodically mentioned in the literature (see, e.g., Lefante Jr and Shah (1986); Good (1986); Shah and Lefante Jr (1987); Shah and Krishnamoorthy (1993); LaMotte (1994)). However, we feel that the equivalence is worth revisiting, both in the context of the $t$ -test and in the more general setting of nested linear models, where an analogous equivalence holds. The geometric interpretation of the equivalence, not described in these earlier references, provides an interesting addition to the geometric interpretation of linear models. Moreover, despite the test statistics leading to identical conclusions in the linear models setting, one choice naturally leads a practitioner to consider so-called studentized residuals while the other leads to so-called standardized residuals—and these sets of residuals do have different properties and, when plotted, may lead to different visual interpretations. We expand on these remarks in subsequent sections.

3 The geometric point of view

Interestingly, the equivalence of $T_{0}$ and $T$ can be understood geometrically because they can both be viewed as trigonometric functions of the same angle, and it is possible to express any trigonometric function in terms of any other trigonometric function, up to sign. To see the geometric relationship, define the vectors ${\bf v}=(Y_{1}-\mu_{0},Y_{2}-\mu_{0},\ldots,Y_{n}-\mu_{0})^{{\sf\scriptscriptstyle{T}}}$ and ${\bf 1}=(1,1,\ldots,1)^{{\sf\scriptscriptstyle{T}}}.$ Then, the orthogonal projection of ${\bf v}$ onto ${\bf 1}$ is ${\bf u}=(\bar{Y}-\mu_{0}){\bf 1},$ and the Pythagorean Theorem implies:

[TABLE]

where we introduce analysis of variance terminology, with SSTO, SST, and SSE indicating the Sums of Squares for Total, Treatment, and Error, respectively. Thus, if we define $\theta$ to be the angle between ${\bf 1}$ and ${\bf v},$ then:

[TABLE]

A stylized, two-dimensional representation of the essence of these geometric relationships is presented in Figure 1. Using basic trigonometric expressions it is easy to derive the stated algebraic relationship between $T$ and $T_{0}$ . In fact,

[TABLE]

Substituting $\cos^{2}\theta=T_{0}^{2}/n$ into this expression and taking square roots on both sides (making sure the signs match, as they should) yields Equation (4).

4 Extension to linear models

The results presented in the previous sections are not specific to the $t$ -test setting. In fact, constructing a test statistic by invoking the null hypothesis and constructing it in the “traditional” way produces equivalent test procedures across a range of linear models. This connection can be established by rewriting the two statistics as functions of different terms in the orthogonal decomposition of the sum of squares.

4.1 Nested models

For instance, consider the standard linear model

[TABLE]

where ${\bf Y}=(Y_{1},\ldots,Y_{n})^{{\sf\scriptscriptstyle{T}}}$ is a vector of observations, ${\bf X}_{n\times p}$ is a design matrix of rank $p<n$ , $\boldsymbol{\beta}=(\beta_{1},\ldots,\beta_{p})^{{\sf\scriptscriptstyle{T}}}$ is a vector of regression parameters, and $\boldsymbol{\epsilon}=(\epsilon_{1},\ldots,\epsilon_{n})^{{\sf\scriptscriptstyle{T}}}$ is an error vector with elements $\epsilon_{i}\stackrel{{\scriptstyle\text{iid}}}{{\sim}}N(0,\sigma^{2})$ . Suppose we wish to determine if a specific collection of $p_{2}$ covariates in ${\bf X}$ does not significantly contribute to the prediction of ${\bf Y}$ in the linear model. We can formulate this question as a testing problem in which the null hypothesis states that the $p_{2}$ regression coefficients for these covariates are all zero. Without loss of generality we can assume that the parameters of interest are the last $p_{2}<p$ and rewrite the model as

[TABLE]

where ${\bf X}=[{\bf X}_{1}|{\bf X}_{2}]$ and $\boldsymbol{\beta}=(\boldsymbol{\beta}_{1}^{{\sf\scriptscriptstyle{T}}},\boldsymbol{\beta}_{2}^{{\sf\scriptscriptstyle{T}}})^{{\sf\scriptscriptstyle{T}}}$ , with $\boldsymbol{\beta}_{i}$ of dimension $p_{i}$ for $i=1,2,$ and $p_{1}+p_{2}=p.$ The testing problem concerning the nested model can then be stated as

[TABLE]

Both the “traditional” and the “null hypothesis” testing procedures try to quantify the importance of the reduction in error sums of squares that ensues from entertaining the full model rather than the reduced model, but they differ in the comparison yardstick they use. The “traditional” procedure uses a yardstick based on the full model. The “null hypothesis” procedure uses a yardstick based on the reduced model with $\boldsymbol{\beta}_{2}={\bf 0}$ .

Geometrically, the statistics arise from a sequence of projections. Specifically, define:

[TABLE]

and

[TABLE]

The matrix ${\bf P}_{1}$ operates an orthogonal projection onto the space spanned by the columns of the reduced design matrix ${\bf X}_{1}$ and the matrix ${\bf P}_{12}$ operates an orthogonal projection onto the space spanned by the columns of the full design matrix ${\bf X}$ . Under the reduced model, the vector of predicted values is

[TABLE]

the vector of residuals is

[TABLE]

and the residual sum of squares is

[TABLE]

Similarly, under the full model, the vector of predicted values is

[TABLE]

the vector of residuals is

[TABLE]

and the residual sum of squares is

[TABLE]

The reduction in sums of squares ensuing from fitting the larger model is given by

[TABLE]

The “traditional” procedure compares $\text{SS}_{2\mid 1}$ to $\text{SSE}_{12}$ , the error sum of squares for the full model, while the “null hypothesis” procedure compares $\text{SS}_{2\mid 1}$ to $\text{SSE}_{1}=\text{SS}_{2\mid 1}+\text{SSE}_{12}$ , the error sum of squares for the reduced model envisioned to hold under the null. After adjusting for the degrees of freedom of the various sums of squares, the resulting test statistics are

[TABLE]

and

[TABLE]

respectively.

4.2 Algebra, geometry, and distributional results

The orthogonal decomposition at play in this setting is analogous to the one presented in Section 2 and is described in a stylized, two-dimensional display in Figure 2, along with the relationships between its various elements. Algebraic and trigonometric manipulations similar to those outlined in Section 2 show that $F_{\text{trad}}$ is a one-to-one, increasing function of $F_{\text{null}}$ over $(0,(n-p_{1})/p_{2})$ , the set of possible values for $F_{\text{null}}$ :

[TABLE]

Thus, as in the case of the $t$ -test, tail-area tests using $F_{\text{trad}}$ and $F_{\text{null}}$ are identical. Note that, when $p=1,$ $p_{1}=0,$ and $p_{2}=1,$ the relationship between $F_{\text{trad}}$ and $F_{\text{null}}$ given in Equation (6) reduces to the relationship between $T^{2}$ and $T_{0}^{2}$ implied by Equation (4).

The implementation of either test procedure requires knowledge of the distribution of the corresponding test statistic under the null hypothesis. Using the notation introduced in Figure 2, standard distributional results imply that, under the null hypothesis,

[TABLE]

with $b^{2}$ independent of $c^{2}$ .

Then,

[TABLE]

as it is the ratio of two independent chi-square random variables divided by their degrees of freedom. Also,

[TABLE]

as it is the ratio between a chi-square random variable and the sum of that chi-square random variable and an independent chi-square random variable.

4.3 Does the difference ever matter?

While the test procedures based on $F_{\text{trad}}$ and $F_{\text{null}}$ produce identical inferences, the realized values of the test statistics are different. In this section we consider a situation in which, arguably, it is preferable to work with one of the two statistics rather than the other.

Residual plots are effective graphical devices for assessing the quality of the fit of a linear regression model and for detecting potential outliers. As noted in Section 9.4.1 of Weisberg (2014), a simple test for determining if observation $i$ is an outlier in a regression model that includes $p_{1}$ predictors is to include an additional predictor which is an indicator of the observation in question (i.e., a 0-1 vector whose only element equal to 1 is the $i$ -th one) and to test if the regression coefficient of the indicator is equal to zero.

Assuming normal errors for the regression model and letting $p_{2}=1$ , it is natural to cast this problem into the framework of Section 4.1 and compare the full model with $p=p_{1}+p_{2}$ predictors (the original predictors and the indicator of observation $i$ ) and the nested model that omits the indicator variable. Observation $i$ is declared an outlier if the null hypothesis that the coefficient of its indicator variable is zero is rejected.

The traditional statistic for this problem is $F_{\text{trad}}$ , which has an $F_{1,n-p}$ distribution under the null. The square root of $F_{\text{trad}}$ (with sign matching the sign of the regression residual for observation $i$ ) is the usual $t$ statistic for outlier detection described by Weisberg (2014). It is also a quantity known as the studentized residual for observation $i$ , a normalized version of the raw residual, $\hat{e}_{i}$ , computed using an estimate of the error variance, $\hat{\sigma}^{2}_{(i)}$ , that omits observation $i$ from the calculation. Conceptually, this point of view is appealing because, if the null hypothesis were violated and observation $i$ were indeed an outlier, its inclusion in the calculation would inflate the estimate of the error variance. As stated in Weisberg (2014), the studentized residual can be expressed as

[TABLE]

where $h_{ii}$ denotes the leverage of observation $i$ given by the $i$ -th diagonal element of the projection (or hat) matrix ${\bf P}_{12}$ for the full model.

On the other hand, as seen in Section 4.1, the same test could also be performed using the statistic $F_{\text{null}}$ . The signed square root of $F_{\text{null}}$ turns out to be what is called the standardized residual for observation $i$ , a normalized version of the raw residual, $\hat{e}_{i}$ , computed using an estimate of the error variance, $\hat{\sigma}^{2}$ , that uses all observations, including observation $i$ . This would be the natural calculation to perform if one were to assume that the null hypothesis were true. As stated in Weisberg (2014), the standardized residual can be expressed as

[TABLE]

and the deterministic relationship between studentized and standardized residuals is given by

[TABLE]

This deterministic relationship mirrors, on the square root scale, the deterministic relationship between $F_{\text{trad}}$ and $F_{\text{null}}$ . Ultimately, because of the deterministic relationships relating $F_{\text{trad}}$ , $F_{\text{null}}$ , and the two residual test statistics, an outlier test based on any of these four statistics leads to the same decision.

Residual plots are often used to conduct an exploratory assessment of the fit of the regression model. In this type of analysis, the plots are scanned visually for the existence of identifiable patterns and idiosyncratic features that might reveal violations of the modeling assumptions. With regard to outlier detection specifically, plots of residuals vs. fitted values are inspected to reveal the presence of unusually large residuals. We argue that, owing to the nonlinearity of the transformation that relates standardized residuals to studentized residuals, a studentized residual plot is better suited than a standardized residual plot to achieve this goal.

We illustrate this point with an example based on a subset of the data on brain and body weights for 100 species of placental mammals reported in Sacher and Staffeldt (1974). Here, for the measurements on the 21 species of primates included in the data set, we consider the simple linear regression of the natural logarithm of brain weight on the natural logarithm of body weight. Standardized and studentized residual plots are presented in the top row of Figure 3. Two species stand out: Homo Sapiens (with large positive residuals) and Gorilla Gorilla (with large negative residuals). Both are flagged as outliers at the 0.05 level with respective p-values of 0.0034 and 0.0301 (unadjusted for multiplicity of comparisons).

The extent to which these two species outlie compared to the other 19 species is clearly different. As evidenced visually in both plots, the residual for Homo Sapiens is further removed from the bulk of the residuals than the residual for Gorilla Gorilla and this impression is more notably accentuated in the studentized residual plot. This is due to the nonlinear relationship between standardized and studentized residuals which causes the difference in absolute size between the two to increase monotonically as the absolute size of the standardized residual goes from 1 to infinity. In particular, as shown in Figure 4, the size of such difference becomes very noticeable when the absolute value of the standardized residual exceeds a value of about 2.5.

In our example, the absolute difference between studentized and standardized residuals is 0.6563 (very noticeable) for Homo Sapiens, 0.2394 (noticeable) for Gorilla Gorilla, and between 0.0011 and 0.0273 (hardly noticeable) for all other species. The displays in the bottom line of Figure 3, being based on $F_{\text{null}}$ and $F_{\text{trad}}$ which are the squared versions of the standardized and studentized residuals, emphasize even more the features just described. In summary, the displays based on the studentized residuals and on $F_{\text{trad}}$ can focus the analyst’s attention on the most extreme cases more effectively than those based on the standardized residuals and on $F_{\text{null}}$ .

5 The Role of the Null Hypothesis in the Construction of a Test

Statistic

The fundamental question raised by the examples we presented in this article concerns the role that the null hypothesis should play in the testing paradigm. By assumption, the null hypothesis is assumed true in order to assess statistical significance, but to what extent should one rely on it to construct the test statistic? When confronted with a new statistical model and a new parameter of interest, it can be something of an art to determine a good choice of test statistic. Three common “automatic” approaches for constructing test statistics from likelihoods privilege the null differently: score tests are typically built under the null; Wald tests are typically built under the alternative; and likelihood ratio tests compare the null and the alternative somewhat equally.

We consider first the case of an i.i.d. sample of size $n$ from $f(x\,|\,\theta)$ , a distribution indexed by a single parameter, $\theta$ , and rely on the results and examples presented in Casella and Berger (2002). We denote by $L(\theta\,|\,{\bf X})=f({\bf X}\,|\,\theta)$ the likelihood function.

The score is defined as $S({\bf X}\,|\,\theta)=d/d\,\theta\,\log f({\bf X}\,|\,\theta)$ . It can be shown that, for all $\theta$ , $ES({\bf X}\,|\,\theta)=0$ and $\text{Var}S({\bf X}\,|\,\theta)=I_{n}(\theta)$ , the expected Fisher information in the sample. The point null hypothesis $H_{0}:\theta=\theta_{0}$ is tested using the score test statistic $S({\bf X}\,|\,\theta_{0})/\sqrt{I_{n}(\theta_{0})}$ , which has mean 0 and variance 1 for all $n$ , and, under appropriate regularity conditions, converges in distribution under the null to a standard normal as $n$ goes to infinity, enabling the derivation of approximate cut-off values. Equivalently, the test can be based on the square of the score test statistic which has an asymptotic $\chi^{2}_{1}$ distribution. For $n$ independent Bernoulli $(p)$ observations yielding $y$ successes, ${\widehat{p}}=y/n$ and the resulting score test statistic for testing $H_{0}:p=p_{0}$ is the one given in formula (2). Its squared version is therefore

[TABLE]

Suppose that, for all $\theta$ , $W_{n}({\bf X})$ is a consistent sequence of estimators of $\theta$ , having standard error $S_{n}({\bf X})$ . The Wald statistic for testing $H_{0}:\theta=\theta_{0}$ is constructed as $(W_{n}({\bf X})-\theta_{0})/S_{n}({\bf X})$ and, if asymptotic normality holds, approximate cut-off values can again be derived under the null based on the quantiles of a standard normal. If the square of the Wald statistic is used for testing, approximate cut-offs should be based on the quantiles of a $\chi^{2}_{1}$ distribution. Often $W_{n}({\bf X})$ is taken to be the maximum likelihood estimator of $\theta$ , with $S_{n}({\bf X})=1/\sqrt{I_{n}(W_{n}({\bf X}))}$ . Upon observing $y$ successes out of $n$ independent Bernoulli $(p)$ trials, this recipe yields the statistic of formula (2), but with $p_{0}$ replaced by $\hat{p}=y/n$ in the denominator of that expression. The squared version of the statistic is therefore

[TABLE]

The likelihood ratio test statistic for testing $H_{0}:\theta=\theta_{0}$ is defined as

[TABLE]

Assuming appropriate regularity conditions, $-2\log\lambda({\bf X})$ has an asymptotic $\chi^{2}_{1}$ distribution under the null that can be used to obtain approximate cut-offs for the test. For the case of $n$ independent Bernoulli $(p)$ observations, denoting by $y$ the total number of successes, the resulting likelihood ratio test will reject for large values of

[TABLE]

Engle (1984) defines these three types of tests for the more general situation in which the parameter vector is multidimensional, including the case in which only a subset of the parameters are of inferential interest while the remaining ones are regarded as nuisance parameters. A detailed recount of the insightful results presented there is beyond the scope of this article, but an important message is that, quite generally, the three types of tests will behave asymptotically similarly under the null and under local alternatives, although the asymptotic behavior for alternative values away from $\theta_{0}$ will typically differ.

For finite samples the three statistics may yield different tests. The reason for this is illustrated in Figure 5 which presents scatter plots of the squared score, $U_{S}$ , and squared Wald, $U_{W}$ , statistics against the log-likelihood statistic, $U_{L}$ , and of the squared score statistics, $U_{S}$ , against the squared Wald statistic, $U_{W}$ , for $n=30$ and $p_{0}=1/3$ . While these statistics are, separately, related monotonically for $\hat{p}\leq 1/3$ and $\hat{p}>1/3$ , the overall relationships are not monotonic. An examination of the rejection regions for these tests shows that the order in which the total number of successes enters the rejection region (as the size of the tests increase) differs among them. This is a situation in which the choice of which statistic to use matters.

As an example of a multidimensional situation including parameters of inferential interest and nuisance parameters, consider again the problem of testing a nested reduced model against the full model in the Gaussian linear model setting. There, the likelihood ratio test rejects the null hypothesis that the reduced model holds when the ratio

[TABLE]

is small, or, equivalently, when the ratio $\text{SSE}_{1}/{\color[rgb]{0,0,0}\text{SSE}_{12}}$ of the error sum of squares under the reduced (null) model and the full model is large, ultimately leading to the equivalent tests based on $F_{\text{null}}$ (a multiple of the score statistic as defined in Engle (1984)) and $F_{\text{trad}}$ (a multiple of the Wald statistic as defined in Engle (1984)). This structure of the likelihood ratio test for nested models had already been noticed for the special case presented in Section 2, when discussing the derivation of the $t$ -test in its two equivalent forms based on the ratio of Equation (5). Using the multiparameter definitions of the three types of test statistics, their deterministic functional relationships, and considering their asymptotic and finite sample distributions, Engle (1984) shows that the resulting tests are, in this case, equivalent both asymptotically and in finite samples.

6 Discussion

The idea of constructing a test statistic by pretending that the null hypothesis is true is routinely presented as a general guideline when using binomial data for testing the hypothesis that a population proportion is equal to a given value. Yet, this guideline is not followed, at least on the surface, when normal data are used to build the $t$ -test for testing the hypothesis that the population mean is equal to a given value. As we noted in the paper, the $t$ -test is actually equivalent to a procedure based on a test statistic derived by following the guideline, but making the connection requires a little algebra, and is, to our knowledge, not typically made in introductory statistics classes, even at the graduate level. We have also noted that the the same considerations presented for the $t$ -test extend to the use of the $F$ -test for testing hypotheses concerning nested linear models with Gaussian errors.

So, we are left to speculate why, in the case of the $t$ -test and of the $F$ -test, the “traditional” procedure is preferred to the “null hypothesis” procedure. If a formal comparison is required, there is no clear distributional advantage of one approach over the other. For the comparison of nested linear models, under the null, the “traditional” procedure requires calculation of the tail area of an $F$ distribution and the “null hypothesis” procedure requires calculation of the tail area of a Beta distribution. If a power calculation has to be performed under some alternative, it can be based on the non-central $F$ -distribution for the traditional procedure and on the Type I non-central Beta distribution for the “null hypothesis” procedure, again with no clear advantage of one approach over the other. Similar considerations apply to the case of the $t$ -test.

An appealing aspect of the “traditional” procedures is that the $t$ -statistic $T$ and the $F$ -statistic $F_{\text{trad}}$ are both constructed as ratios of independent quantities. Because, in both cases, the decision rule is based on an assessment of the relative size of the numerator and denominator, it is conceivable that independence may have been a key factor in establishing the tradition, as an informal comparison of independent quantities is easier. Under the null, the denominators of the “null hypothesis” test statistics are more efficient estimators of variability (have more degrees of freedom) than their “traditional” counterparts. However, this gain in efficiency is offset by the dependence between numerator and denominator (see LaMotte (1994) for a related discussion).

In addition to the basic guiding principles, other considerations may be at play when a certain tradition is established of preferring one form of a test procedure over another for a given problem. For the nested model comparison, we already noted one desirable feature exhibited by $F_{\text{trad}}$ , namely that its numerator and denominator are independent. Another feature worth noting is that the denominator of $F_{\text{trad}}$ does not depend on the particular reduced model under consideration while the denominator of $F_{\text{null}}$ does. Although this is not much of a computational burden, it is intuitively appealing to be able to use the same yardstick in the denominator when testing different nested models against the same full model. Further, the graphical example of Section 4.3 illustrates that when the value of the statistic itself is of interest, rather than the formal testing decision, there may be practical reasons for preferring the use of one statistic over the other.

In Section 5 we reviewed three popular methods for building test statistics (the score, Wald, and likelihood ratio methods), discussing the different emphasis that they place on the null and alternative hypotheses. For all cases examined in this paper, the three methods yield asymptotically equivalent procedures while emphasizing different features of the testing problem. As noted in Engle (1984) this is related to the different metrics used to evaluate discrepancy between the null and the alternative. The Wald test accounts directly for differences in the parameter values, the likelihood ratio test measures differences in the log-likelihoods, and the score test assesses how steep the slope of the log-likelihood is at the null value. While under very general conditions the three methods yield procedures that are asymptotically equivalent, we have noticed that the resulting finite sample tests may differ for independent Bernoulli data. Engle (1984) presents additional examples where finite-sample conclusions might differ, comments on the different insight that the various formulations might bring to bear for specific models, and suggests that potential computational considerations might induce the analyst to opt for one of the tests over the other two.

In sum, while we do not have a conclusive explanation as to why certain traditions have established themselves as the standard of practice for specific problems, we believe that these issues, often overlooked, are worth ruminating on, as they help us better see what considerations lead to the preference of one statistical procedure over another. Choosing the right test statistic for a particular problem can be somewhat of an art, and understanding the similarities, differences, advantages, and disadvantages of the choice in the simple settings we considered may be helpful when turning to more complicated settings.

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grants No. SES-1424481, No. DMS-1613110, and No. SES-1921523.

Bibliography14

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Agresti and Coull (1998) A. Agresti , B. A. Coull (1998). Approximate is better than “exact” for interval estimation of binomial proportions . The American Statistician, 52, no. 2, pp. 119–126.
2Casella and Berger (2002) G. Casella , R. Berger (2002). Statistical Inference . Duxbury-Thomson Learning, Second ed.
3Chao et al. (2019) F. Chao , P. Gerland , A. R. Cook , L. Alkema (2019). Systematic assessment of the sex ratio at birth for all countries and estimation of national imbalances and regional reference levels . Proceedings of the National Academy of Sciences, 116, no. 19, pp. 9303–9311.
4Engle (1984) R. F. Engle (1984). Chapter 13 Wald, likelihood ratio, and Lagrange multiplier tests in econometrics . Elsevier, vol. 2 of Handbook of Econometrics , pp. 775–826.
5Good (1986) I. Good (1986). Comments, conjectures, and conclusions: C 258 editorial note on c 257 regarding the t-test . Journal of Statistical Computation and Simulation, 25, no. 3-4, pp. 296–297.
6La Motte (1994) L. R. La Motte (1994). A note on the role of independence in t statistics constructed from linear statistics in regression models . The American Statistician, 48, no. 3, pp. 238–240.
7Lefante Jr and Shah (1986) J. J. Lefante Jr , A. K. Shah (1986). C 257. a note on the one-sample t-test . Journal of Statistical Computation and Simulation, 25, no. 3-4, pp. 295–296.
8Lehmann (1986) E. L. Lehmann (1986). Testing Statistical Hypotheses . John Wiley & Sons.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Rediscovering a little known fact about the ttt-test

Abstract

1 Introduction

2 Establishing the connection

3 The geometric point of view

4 Extension to linear models

4.1 Nested models

4.2 Algebra, geometry, and distributional results

4.3 Does the difference ever matter?

5 The Role of the Null Hypothesis in the Construction of a Test

6 Discussion

Acknowledgements

Rediscovering a little known fact about the $t$ -test