High dimensional statistical inference: theoretical development to data   analytics

Deepak Nag Ayyala

arXiv:1908.06600·math.ST·August 20, 2019

High dimensional statistical inference: theoretical development to data analytics

Deepak Nag Ayyala

PDF

Open Access

TL;DR

This paper provides a comprehensive overview of high dimensional statistical inference, covering theoretical developments and computational tools essential for big data analytics in various applied fields.

Contribution

It offers an in-depth synthesis of recent advances in high dimensional inference methods and their practical applications in data analytics.

Findings

01

Summarizes key theoretical developments in high dimensional inference.

02

Discusses computational tools for big data analysis.

03

Highlights challenges and solutions in high dimensional data applications.

Abstract

This article is due to appear in the Handbook of Statistics, Vol. 43, Elsevier/North-Holland, Amsterdam, edited by Arni S. R. Srinivasa Rao and C. R. Rao. In modern day analytics, there is ever growing need to develop statistical models to study high dimensional data. Between dimension reduction, asymptotics-driven methods and random projection based methods, there are several approaches developed so far. For high dimensional parametric models, estimation and hypothesis testing for mean and covariance matrices have been extensively studied. However, practical implementation of these methods are fairly limited and are primarily restricted to researchers involved in high dimensional inference. With several applied fields such as genomics, metagenomics and social networking, high dimensional inference is a key component of big data analytics. In this chapter, a comprehensive overview of…

Equations219

η - η = k = 1 \sum p (θ_{nk} - θ_{k}) = p o_{p} (n^{- 1/2}) = o_{p} (n^{- 1/2}) .

η - η = k = 1 \sum p (θ_{nk} - θ_{k}) = p o_{p} (n^{- 1/2}) = o_{p} (n^{- 1/2}) .

η - η = p o (n^{- 1/2}) = o_{p} (n^{1/2})

η - η = p o (n^{- 1/2}) = o_{p} (n^{1/2})

H_{0} : μ_{1} = μ_{2} \mbox v s . H_{A} : μ_{1} \neq = μ_{2} .

H_{0} : μ_{1} = μ_{2} \mbox v s . H_{A} : μ_{1} \neq = μ_{2} .

T_{H o t}^{2} = \frac{n + m - p - 1}{( n + m - 2 ) p} \frac{nm}{n + m} (\overline{X} - \overline{Y})^{⊤} S^{- 1} (\overline{X} - \overline{Y})

T_{H o t}^{2} = \frac{n + m - p - 1}{( n + m - 2 ) p} \frac{nm}{n + m} (\overline{X} - \overline{Y})^{⊤} S^{- 1} (\overline{X} - \overline{Y})

T_{C F} = \sum_{k = 1}^{p} \frac{X _{k} - Y _{k}}{S _{k k}},

T_{C F} = \sum_{k = 1}^{p} \frac{X _{k} - Y _{k}}{S _{k k}},

T_{D e m p} = \frac{( X - Y ) ^{⊤} ( X - Y )}{\sum _{k = 1}^{n + m - 2} W _{k}^{⊤} W _{k}},

T_{D e m p} = \frac{( X - Y ) ^{⊤} ( X - Y )}{\sum _{k = 1}^{n + m - 2} W _{k}^{⊤} W _{k}},

E {(\overline{X} - \overline{Y})^{⊤} (\overline{X} - \overline{Y})} = (μ_{1} - μ_{2})^{⊤} (μ_{1} - μ_{2}) + (\frac{1}{n} + \frac{1}{m}) tr (Σ) .

E {(\overline{X} - \overline{Y})^{⊤} (\overline{X} - \overline{Y})} = (μ_{1} - μ_{2})^{⊤} (μ_{1} - μ_{2}) + (\frac{1}{n} + \frac{1}{m}) tr (Σ) .

M_{n} = (\overline{X} - \overline{Y})^{⊤} (\overline{X} - \overline{Y}) - \frac{n + m}{nm} tr (S) .

M_{n} = (\overline{X} - \overline{Y})^{⊤} (\overline{X} - \overline{Y}) - \frac{n + m}{nm} tr (S) .

T_{B S} = \frac{( X - Y ) ^{⊤} ( X - Y ) - \frac{n + m}{nm} tr ( S )}{\frac{n + m}{nm} \frac{2 ( n + m - 1 ) ( n + m - 2 )}{( n + m ) ( n + m - 3 )} { tr ( S ^{2} ) - ( n + m - 2 ) ^{- 1} tr ^{2} S }}

T_{B S} = \frac{( X - Y ) ^{⊤} ( X - Y ) - \frac{n + m}{nm} tr ( S )}{\frac{n + m}{nm} \frac{2 ( n + m - 1 ) ( n + m - 2 )}{( n + m ) ( n + m - 3 )} { tr ( S ^{2} ) - ( n + m - 2 ) ^{- 1} tr ^{2} S }}

p \to \infty lim \frac{λ _{m a x}}{tr Σ ^{2}} = p \to \infty lim \frac{1 + ( p - 1 ) ρ}{p + ( p ^{2} - p ) ρ ^{2}} = 1,

p \to \infty lim \frac{λ _{m a x}}{tr Σ ^{2}} = p \to \infty lim \frac{1 + ( p - 1 ) ρ}{p + ( p ^{2} - p ) ρ ^{2}} = 1,

X = μ + Γ Z,

X = μ + Γ Z,

E (\sum_{i \neq = j}^{n} X_{i}^{⊤} X_{j}) = n (n - 1) μ_{1}^{⊤} μ_{1}, E (\sum_{i \neq = j}^{m} Y_{i}^{⊤} Y_{j}) = m (m - 1) μ_{2}^{⊤} μ_{2}, E (\sum_{i, j} X_{i}^{⊤} Y_{j}) = nm μ_{1}^{⊤} μ_{2} .

E (\sum_{i \neq = j}^{n} X_{i}^{⊤} X_{j}) = n (n - 1) μ_{1}^{⊤} μ_{1}, E (\sum_{i \neq = j}^{m} Y_{i}^{⊤} Y_{j}) = m (m - 1) μ_{2}^{⊤} μ_{2}, E (\sum_{i, j} X_{i}^{⊤} Y_{j}) = nm μ_{1}^{⊤} μ_{2} .

T_{n} = \frac{1}{n ( n - 1 )} \sum_{i \neq = j}^{n} X_{i}^{⊤} X_{j} + \frac{1}{m ( m - 1 )} \sum_{i \neq = j}^{m} Y_{i}^{⊤} Y_{j} - \frac{2}{nm} \sum_{i = 1}^{n} \sum_{j = 1}^{m} X_{i}^{⊤} Y_{j}

T_{n} = \frac{1}{n ( n - 1 )} \sum_{i \neq = j}^{n} X_{i}^{⊤} X_{j} + \frac{1}{m ( m - 1 )} \sum_{i \neq = j}^{m} Y_{i}^{⊤} Y_{j} - \frac{2}{nm} \sum_{i = 1}^{n} \sum_{j = 1}^{m} X_{i}^{⊤} Y_{j}

var (T_{n}) = [\frac{2}{n ( n - 1 )} tr (Σ_{1}^{2}) + \frac{2}{m ( m - 1 )} tr (Σ_{2}^{2}) + \frac{4}{nm} tr (Σ_{1} Σ_{2})] {1 + o (1)} .

var (T_{n}) = [\frac{2}{n ( n - 1 )} tr (Σ_{1}^{2}) + \frac{2}{m ( m - 1 )} tr (Σ_{2}^{2}) + \frac{4}{nm} tr (Σ_{1} Σ_{2})] {1 + o (1)} .

tr (Σ_{1}^{2})

tr (Σ_{1}^{2})

tr (Σ_{2}^{2})

tr (Σ_{1} Σ_{2})

T_{C Q} = \frac{T _{n}}{\frac{2}{n ( n - 1 )} tr ( Σ _{1}^{2} ) + \frac{2}{m ( m - 1 )} tr ( Σ _{2}^{2} ) + \frac{4}{nm} tr ( Σ _{1} Σ _{2} )}

T_{C Q} = \frac{T _{n}}{\frac{2}{n ( n - 1 )} tr ( Σ _{1}^{2} ) + \frac{2}{m ( m - 1 )} tr ( Σ _{2}^{2} ) + \frac{4}{nm} tr ( Σ _{1} Σ _{2} )}

W_{n} = (\overline{X} - \overline{Y})^{⊤} D_{S}^{- 1} (\overline{X} - \overline{Y}) = \sum_{k = 1}^{p} \frac{( X _{k} - Y _{k} ) ^{2}}{S _{k k}},

W_{n} = (\overline{X} - \overline{Y})^{⊤} D_{S}^{- 1} (\overline{X} - \overline{Y}) = \sum_{k = 1}^{p} \frac{( X _{k} - Y _{k} ) ^{2}}{S _{k k}},

E (W_{n}) = (μ_{1} - μ_{2})^{⊤} (μ_{1} - μ_{2}) + (\frac{1}{n} + \frac{1}{m}) \frac{n + m}{n + m - 2} p .

E (W_{n}) = (μ_{1} - μ_{2})^{⊤} (μ_{1} - μ_{2}) + (\frac{1}{n} + \frac{1}{m}) \frac{n + m}{n + m - 2} p .

T_{S D} = \frac{\frac{nm}{n + m} ( X - Y ) ^{⊤} D _{S}^{- 1} ( X - Y ) - \frac{( n + m ) p}{n + m - 2}}{2 ( tr R ^{2} - p ^{2} / n ) ( 1 + tr R ^{2} / p ^{3/2} )}

T_{S D} = \frac{\frac{nm}{n + m} ( X - Y ) ^{⊤} D _{S}^{- 1} ( X - Y ) - \frac{( n + m ) p}{n + m - 2}}{2 ( tr R ^{2} - p ^{2} / n ) ( 1 + tr R ^{2} / p ^{3/2} )}

\frac{λ _{m a x}}{tr Σ ^{2}} = \frac{p ^{ω}}{p + p ^{2 ω} - 1} \to 0, \frac{tr Σ ^{4}}{tr ^{2} Σ ^{2}} = \frac{p + p ^{4 ω} - 1}{( p + p ^{2 ω} - 1 ) ^{2}} \to 0.

\frac{λ _{m a x}}{tr Σ ^{2}} = \frac{p ^{ω}}{p + p ^{2 ω} - 1} \to 0, \frac{tr Σ ^{4}}{tr ^{2} Σ ^{2}} = \frac{p + p ^{4 ω} - 1}{( p + p ^{2 ω} - 1 ) ^{2}} \to 0.

U_{n}^{*} = \frac{1}{n ( n - 1 )} \sum_{i = \neq = j}^{n} X_{i}^{⊤} D_{Σ}^{- 1} X_{j} + \frac{1}{m ( m - 1 )} \sum_{i \neq = j}^{m} Y_{i}^{⊤} D_{Σ}^{- 1} Y_{j} - \frac{2}{n m} \sum_{i = 1}^{n} \sum_{j = 1}^{m} X_{i}^{⊤} D_{Σ}^{- 1} Y_{j}

U_{n}^{*} = \frac{1}{n ( n - 1 )} \sum_{i = \neq = j}^{n} X_{i}^{⊤} D_{Σ}^{- 1} X_{j} + \frac{1}{m ( m - 1 )} \sum_{i \neq = j}^{m} Y_{i}^{⊤} D_{Σ}^{- 1} Y_{j} - \frac{2}{n m} \sum_{i = 1}^{n} \sum_{j = 1}^{m} X_{i}^{⊤} D_{Σ}^{- 1} Y_{j}

S_{(i, j)}^{(1)} = \frac{( n - 3 ) S _{1 (i, j)} + ( m - 1 ) S _{2}}{n + m - 4}, S_{(i, j)}^{(2)} = \frac{( n - 1 ) S _{1} + ( m - 3 ) S _{2 (i, j)}}{n + m - 4}, S_{(i, j)}^{(12)} = \frac{( n - 2 ) S _{1 (i)} + ( m - 2 ) S _{2 (j)}}{n + m - 4},

S_{(i, j)}^{(1)} = \frac{( n - 3 ) S _{1 (i, j)} + ( m - 1 ) S _{2}}{n + m - 4}, S_{(i, j)}^{(2)} = \frac{( n - 1 ) S _{1} + ( m - 3 ) S _{2 (i, j)}}{n + m - 4}, S_{(i, j)}^{(12)} = \frac{( n - 2 ) S _{1 (i)} + ( m - 2 ) S _{2 (j)}}{n + m - 4},

U_{n} = \frac{n + m - 6}{n + m - 4} (\frac{1}{n ( n - 1 )} \sum_{i \neq = j}^{n} X_{i}^{⊤} D_{S_{(i, j)}^{(1)}}^{- 1} X_{j} + \frac{1}{m ( m - 1 )} \sum_{i \neq = j}^{m} Y_{i}^{⊤} D_{S^{(2)} (i, j)}^{- 1} Y_{j} - \frac{2}{n m} \sum_{i = 1}^{n} \sum_{j = 1}^{m} X_{i}^{⊤} D_{S_{(i, j)}^{(12)}}^{- 1} Y_{j}),

U_{n} = \frac{n + m - 6}{n + m - 4} (\frac{1}{n ( n - 1 )} \sum_{i \neq = j}^{n} X_{i}^{⊤} D_{S_{(i, j)}^{(1)}}^{- 1} X_{j} + \frac{1}{m ( m - 1 )} \sum_{i \neq = j}^{m} Y_{i}^{⊤} D_{S^{(2)} (i, j)}^{- 1} Y_{j} - \frac{2}{n m} \sum_{i = 1}^{n} \sum_{j = 1}^{m} X_{i}^{⊤} D_{S_{(i, j)}^{(12)}}^{- 1} Y_{j}),

var (U_{n}) = (\frac{n + m - 6}{n + m - 4})^{2} {\frac{2}{n ( n - 1 )} tr (R_{1}^{2}) + \frac{2}{m ( m - 1 )} tr (R_{2}^{2}) + \frac{4}{nm} tr (R_{1} R_{2})},

var (U_{n}) = (\frac{n + m - 6}{n + m - 4})^{2} {\frac{2}{n ( n - 1 )} tr (R_{1}^{2}) + \frac{2}{m ( m - 1 )} tr (R_{2}^{2}) + \frac{4}{nm} tr (R_{1} R_{2})},

tr (R_{1}^{2}) = \frac{1}{n ( n - 1 )} tr {\sum_{i = 1}^{n} \sum_{j \neq = i} X_{i}^{⊤} D_{S_{(i, j)}^{(1)}}^{- 1} (X_{j} - \overline{X}_{(i, j)}) X_{j}^{⊤} D_{S_{(i, j)}^{(1)}}^{- 1} (X_{i} - \overline{X}_{(i, j)})},

tr (R_{1}^{2}) = \frac{1}{n ( n - 1 )} tr {\sum_{i = 1}^{n} \sum_{j \neq = i} X_{i}^{⊤} D_{S_{(i, j)}^{(1)}}^{- 1} (X_{j} - \overline{X}_{(i, j)}) X_{j}^{⊤} D_{S_{(i, j)}^{(1)}}^{- 1} (X_{i} - \overline{X}_{(i, j)})},

tr (R_{2}^{2}) = \frac{1}{m ( m - 1 )} tr {\sum_{i = 1}^{m} \sum_{j \neq = i} Y_{i}^{⊤} D_{S_{(i, j)}^{(2)}}^{- 1} (Y_{j} - \overline{Y}_{(i, j)}) Y_{j}^{⊤} D_{S_{(i, j)}^{(2)}}^{- 1} (Y_{i} - \overline{Y}_{(i, j)})},

tr (R_{1} R_{2}) = \frac{1}{nm} tr {\sum_{i = 1}^{n} \sum_{j = 1}^{m} X_{i}^{⊤} D_{S_{(i, j)}^{(12)}}^{- 1} (Y_{j} - \overline{Y}_{(j)}) Y_{j}^{⊤} D_{S_{(i, j)}^{(12)}}^{- 1} (X_{i} - \overline{X}_{(i)})},

T_{P A} = \frac{U _{n}}{( \frac{n + m - 6}{n + m - 4} ) ^{2} { \frac{2}{n ( n - 1 )} tr ( R _{1}^{2} ) + \frac{2}{m ( m - 1 )} tr ( R _{2}^{2} ) + \frac{4}{nm} tr ( R _{1} R _{2} ) }}

T_{P A} = \frac{U _{n}}{( \frac{n + m - 6}{n + m - 4} ) ^{2} { \frac{2}{n ( n - 1 )} tr ( R _{1}^{2} ) + \frac{2}{m ( m - 1 )} tr ( R _{2}^{2} ) + \frac{4}{nm} tr ( R _{1} R _{2} ) }}

H_{0 : R} : R μ_{1} = R μ_{2} \mbox v s . H_{A : R} : R μ_{1} \neq = R μ_{2} .

H_{0 : R} : R μ_{1} = R μ_{2} \mbox v s . H_{A : R} : R μ_{1} \neq = R μ_{2} .

S = U Λ U^{⊤}, U = (u_{1}, \dots, u_{p}), Λ = diag (λ_{1}, \dots, λ_{p}),

S = U Λ U^{⊤}, U = (u_{1}, \dots, u_{p}), Λ = diag (λ_{1}, \dots, λ_{p}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models

Full text

High dimensional statistical inference: theoretical development to data analytics

Deepak Nag Ayyala

Department of Population Health Sciences, Medical College of Georgia, Augusta University

Augusta, Georgia 30912

Abstract

This article is due to appear in the Handbook of Statistics, Vol. 43, Elsevier/North-Holland, Amsterdam, edited by Arni S. R. Srinivasa Rao and C. R. Rao.

In modern day analytics, there is ever growing need to develop statistical models to study high dimensional data. Between dimension reduction, asymptotics-driven methods and random projection based methods, there are several approaches developed so far. For high dimensional parametric models, estimation and hypothesis testing for mean and covariance matrices have been extensively studied. However, practical implementation of these methods are fairly limited and are primarily restricted to researchers involved in high dimensional inference. With several applied fields such as genomics, metagenomics and social networking, high dimensional inference is a key component of big data analytics. In this chapter, a comprehensive overview of high dimensional inference and its applications in data analytics is provided. Key theoretical developments and computational tools are presented, giving readers an in-depth understanding of challenges in big data analysis.

keywords:

High-dimension , asymptotics , hypothesis testing , dependent data , multivariate analysis

1 Introduction
2 Mean vector testing
2.1 Independent observations
2.2 Projection based tests
2.3 Random projections
2.4 Other approaches
2.5 Dependent observations
3 Covariance matrix
3.1 Estimation
3.2 Hypothesis testing
4 Discrete multivariate models
4.1 Multinomial distribution
4.2 Compound Multinomial models
4.3 Other distributions
4.3.1 Bernoulli distribution
4.3.2 Binomial distribution
4.3.3 Poisson distribution
5 Conclusion

1 Introduction

High dimensional inference and big data analytics are gaining significant prominence in several applied fields such as genomics, imaging neuroscience, econometrics, astronomy and cyber-security [31]. With accelerated development of technology to study various biological processes and natural phenomenon, there is an exponential growth in the amount of data being generated. Publicly available data sets for genomics such as the Cancer Genome Atlas111https://portal.gdc.cancer.gov/ have massive amounts of data that are on the scale of petabytes (1 petabyte = 1024 terabytes). In terms of the number of variables collected (usually represented by $p$ ) and the number of samples or replicates (represented by $n$ ), big data can be broadly classified into two categories: (i) large $n$ data sets (ii) large $p$ small $n$ data sets. In data sets with large number of samples, typically arising in astronomy, challenges are mainly computational rather than statistical. Statistical problems in these data sets involve identifying an extremely small number of signals from a large number of observations, a.k.a. needle in a haystack problem. The large $p$ small $n$ paradigm is commonly encountered in biomedical research areas such as genomics, metagenomics and neuroimaging. The goal in these data sets is to draw inference on a large number of variables simultaneously using a small number of observations.

Traditional statistical tools are built on the assumption that there is more known than unkown, i.e. $n>p$ . When $p>n$ , asymptotic properties of estimates for parameters such as mean and variance will no longer be valid. For instance consider $p$ parameters $\theta_{1},\ldots,\theta_{p}$ and $\eta=\sum_{k=1}^{p}\theta_{k}$ be our parameter of interest. Let $\widehat{\theta}_{nk}$ be a first-order consistent for $\theta_{k}$ for $k=1,\ldots,p$ , i.e. $\widehat{\theta}_{nk}-\theta_{k}=o_{p}(n^{-1/2})$ . When $p$ is fixed and finite, $\widehat{\eta}=\sum_{k=1}^{p}\theta_{nk}$ will be first-order consistent for $\eta$ because

[TABLE]

But if the dimension is a linear function of the sample size, i.e. $p=O(n)$ , then this consistency fails because the infinite sum of errors will diverge,

[TABLE]

This problem can be solved by considering second-order consistent estimators, which require additional attention to the asymptotic properties of $\widehat{\theta}_{nk}$ to derive. In multivariate models, there are two main parameters of interest - mean vector and variance matrix. Estimation of these parameters and construction of hypothesis tests for high dimensional data require additional calculations to have good asymptotic behaviour.

Large data sets with discrete data are very commonly observed in various fields. In text mining, the distribution of words in a document are recorded by counting the number of occurrences of each word in the document. In genomics and metagenomics, recently developed high-throughput experimental procedures are making it possible to record counts of genes expression and bacterial abundance in samples. However, statistical literature on multivariate models for discrete data is very sparse. Most of the multivariate probability models that are encountered in high dimensional literature are continuous. Unlike the continuous distributions, multivariate analogues of standard discrete models such as Bernoulli, binomial, Poisson, etc. are not extensively developed. In the univariate case, mixture models such as beta-binomial and Poisson-gamma have been developed to address over-dispersion in count data. Multivariate models do not exist for all mixture distributions.

In this chapter, we will look at three topics of interest in high dimensional inference. In Section 2, we will look at hypothesis tests for the mean vector. Estimation and hypothesis tests for the covariance matrix will be addressed in Section 3. Formulation of standard discrete multivariate models and parameter estimation of hierarchical multivariate count models are presented in Section 4. Finally, we will conclude with some challenges that still lie ahead of us in high dimensional inference in Section 5.

2 Mean vector testing

The first moment, mean, is the most commonly studied parameter when exploring the properties of distributions. The mean or expected value of a random variable is a measure of location of the center of the data. When comparing $p$ dimensional two distributions, equality of means indicates that distributions are centered around the same point in the sample space. Given two variables characterized by their means, $\mathbf{X}\sim\mathcal{F}_{p}(\cdot,\boldsymbol{\mu}_{1})$ , and $\mathbf{Y}\sim\mathcal{G}_{p}(\cdot,\boldsymbol{\mu}_{2})$ , the hypothesis of comparing means can be stated as

[TABLE]

Given samples $\mathbf{X}_{1},\ldots,\mathbf{X}_{n}$ and $\mathbf{Y}_{1},\ldots,\mathbf{Y}_{m}$ from the two distributions, sample means $\overline{\mathbf{X}}=n^{-1}\mathop{\sum}_{i=1}^{n}\mathbf{X}_{i}$ and $\overline{\mathbf{Y}}=\mathop{\sum}_{j=1}^{m}\mathbf{Y}_{j}$ are natural unbiased estimators of $\boldsymbol{\mu}_{1}$ and $\boldsymbol{\mu}_{2}$ respectively. Hence the difference of sample averages $\overline{X}-\overline{Y}$ will be unbiased for $\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}$ . To calculate a test statistic, a functional needs to be defined to map the multivariate difference $\overline{\mathbf{X}}-\overline{\mathbf{Y}}$ on to the real line.

Hotelling’s $T^{2}$ [41] was the first such test constructed which uses the Mahalanobis distance as the functional,

[TABLE]

where $\mathcal{S}=(n+m-2)^{-1}\left\{\mathop{\sum}_{i=1}^{n}\left(\mathbf{X}_{i}-\overline{\mathbf{X}}\right)\left(\mathbf{X}_{i}-\overline{\mathbf{X}}\right)^{\top}+\mathop{\sum}_{j=1}^{m}\left(\mathbf{Y}_{j}-\overline{\mathbf{Y}}\right)\left(\mathbf{Y}_{j}-\overline{\mathbf{Y}}\right)^{\top}\right\}$ is the pooled sample covariance matrix with ${\rm rank}(\mathcal{S})=\min(p,n+m-2)$ . Under $H_{0}$ , the test statistic follows a $F_{p,n+m-p-1}$ distribution provided $\mathcal{F}_{p}$ and $\mathcal{G}_{p}$ are both homogeneous multivariate Gaussian distributions with common covariance matrix $\Sigma$ and $p<n+m-1$ .

The Hotelling’s $T^{2}$ test is developed for the two-sided alternative in (1). The functional in $T^{2}_{Hot}$ has a quadratic form and is always non-negative as $\mathcal{S}$ is positive definite. Hence $T^{2}_{Hot}$ does not differentiate between $\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}=\boldsymbol{\delta}$ and $\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}=-\boldsymbol{\delta}$ . To define a one-sided alternative, an order in $\mathbb{R}^{p}$ should first be determined. For example, the lexicographic order or the partial element-wise order can be considered. For the one sample case, Kudo [49] developed a likelihood ratio test (LRT) for the alternative $H_{A}:\boldsymbol{\mu}_{1}>\mathbf{0}$ where the inequality indicates $\mu_{1i}\geq 0$ for $i=1,\ldots,p$ with at least one $\mu_{1i}>0$ . Maximizing the parameters over the positive cone is done using quadratic programming. p-value calculation is computationally intensive due to the $2^{p}$ potential pairs of zeros and positive elements under the alternative. Also a two-sample extension for this test was not provided.

The Hotelling’s $T^{2}$ test has three major deficiencies when doing inference in high dimensions.

Issue I

The test is defined only when $p<n+m-2$ . Typically in data sets arising in genomics and other high throughput experiments, the dimension is in thousands and the number of samples are in tens or hundreds. 2. Issue II

The test holds only for comparing Gaussian distributions, an assumption that is not straightforward to verify in higher dimensions. In distributions with restricted sample space such as the Dirichlet, the mean vector is a location parameter. 3. Issue III

The test requires observations to be independently and identically distributed, i.e. *i.i.d. *In imaging studies such as fMRI experiments, the observations are not *i.i.d. *The inherent dependence structure in such data sets is ignored by $T^{2}_{Hot}$ , potentially leading to biased estimates.

To address these shortcomings, one has to use test statistics that take into account the bias due to high dimension and the dependence structure in the data. For instance to address Issue I, there are two approaches that can used. The first method is to study the asymptotic properties of a functional of $\overline{\mathbf{X}}-\overline{\mathbf{Y}}$ and construct a large-sample test. This method can also relax Issue II by accommodating non-Gaussian distributions through conditions on the moments of the distribution. Second method is to reduce dimension by projecting the $p$ -variate samples into a lower dimensional space such that traditional tests such as $T^{2}_{Hot}$ can be applied. The dependence structure in Issue III is complicated since the entire autocovariance function of the model needs to be considered. Parametrizing the autocovariance function and restricting the dependence structure can help reduce the complexity of the problem.

2.1 Independent observations

First let us address testing the hypothesis in (1) for *i.i.d. *samples in high dimension. When $p>n+m-2$ , the pooled sample covariance matrix $\mathcal{S}$ is rank-deficient and does not have a well-defined inverse. The Mahalanobis distance of $\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}$ is not a valid measure. To construct a test statistic, we need a functional of $\overline{\mathbf{X}}-\overline{\mathbf{Y}}$ which is zero in expectation when $H_{0}$ is true and non-zero when $H_{A}$ is true. A Natural choice of such functional which does not involve $\mathcal{S}$ is the $\ell_{d}$ -norm for $d>0$ . When $d=1$ , Chung and Fraser [24] proposed a permutation test using the sum of element-wise $t$ -test statistics,

[TABLE]

as the test statistic. The Euclidean norm222Abuse of notation: The squared Euclidean norm is referred to as the Euclidean norm unless otherwise stated is preferred over the $\ell_{1}$ -norm due to ease of calculation of moments. Dempster [30] developed the first test statistic using the Euclidean norm of difference of means, $\left(\overline{\mathbf{X}}-\overline{\mathbf{Y}}\right)^{\top}\left(\overline{\mathbf{X}}-\overline{\mathbf{Y}}\right)$ . The test statistic is given by

[TABLE]

where $\{\mathbf{W}_{k},k=1,\ldots,n+m-2\}$ are orthogonal vectors such that the set of vectors $\{(n+m)^{-1}(n\overline{\mathbf{X}}+m\overline{\mathbf{Y}}),\overline{\mathbf{X}}-\overline{\mathbf{Y}},\mathbf{W}_{1},\ldots,\mathbf{W}_{n+m-2}\}$ form an orthogonal basis for the space spanned by $\{\mathbf{X}_{1},\ldots,\mathbf{X}_{n},\mathbf{Y}_{1},\ldots,\mathbf{Y}_{m}\}$ . The Dempster test is non-exact and is distributed as an $F_{r,(n+m-2)r}$ under the null hypothesis. The parameter $r$ is unknown and is estimated from the data. However, both these tests ignore the covariance structure and are shown to not perform well even when $p$ is close to $n+m$ .

To construct a large sample test, the asymptotic properties of the Euclidean norm of $\overline{\mathbf{X}}-\overline{\mathbf{Y}}$ need to be studied. When the two distributions are homogeneous with covariance matrix $\Sigma$ , we have

[TABLE]

Without loss of generality, assume $n<m$ . Under $H_{0}$ , $\left(\overline{\mathbf{X}}-\overline{\mathbf{Y}}\right)^{\top}\left(\overline{\mathbf{X}}-\overline{\mathbf{Y}}\right)$ has expected value equal to $\mathcal{B}_{n}=2(1/n+1/m){\rm tr}\left(\Sigma\right)\leq 2n^{-1}p\,\lambda_{\max}$ , where $\lambda_{\max}$ is the largest eigenvalue of $\Sigma$ . If $p$ is fixed, then $\lim\limits_{n\rightarrow\infty}\mathcal{B}_{n}=0$ , implying $\left(\overline{\mathbf{X}}-\overline{\mathbf{Y}}\right)^{\top}\left(\overline{\mathbf{X}}-\overline{\mathbf{Y}}\right)$ is asymptotically unbiased. But if $p$ increases with $n$ , then the Euclidean norm needs to be adjusted for this bias. For instance, if we assume $p=Cn^{\alpha}$ for some $\alpha>0$ , then $\mathcal{B}_{n}=2n^{1-\alpha}\lambda_{\max}$ which diverges when $\alpha>1$ . Further note that properties of $\mathcal{B}_{n}$ are independent of the distributions of the two groups.

To adjust the bias, consider the pooled sample covariance matrix $\mathcal{S}$ , which is unbiased for $\Sigma$ . Since trace is a linear functional, ${\rm tr}(\mathcal{S})$ will be unbiased for ${\rm tr}(\Sigma)$ . This gives

[TABLE]

as an unbiased estimator of $\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right)^{\top}\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right)$ . Using its quadratic form, the variance of $\mathcal{M}_{n}$ can be calculated as ${\rm var}\left(\mathcal{M}_{n}\right)=2(1/n+1/m)^{2}\{1+1/(n+m-2)\}{\rm tr}\Sigma^{2}\{1+o(1)\}$ . The error term, $1+o(1)$ , vanishes under Gaussian assumption. To construct a test statistic using $\mathcal{M}_{n}$ , a ratio consistent estimator of ${\rm var}\left(\mathcal{M}_{n}\right)$ is needed.

In their seminal work, Bai and Saranadasa [7] used $\mathcal{M}_{n}$ to construct the test statistic

[TABLE]

The test statistic follows a standard normal distribution asymptotically under the following conditions:

(BS I)

$p/n\rightarrow\delta>0$ , indicating that $p$ increases faster than $n$ . 2. (BS II)

$n/(n+m)\rightarrow\kappa\in(0,1)$ meaning sample sizes from both groups have proportionate rates of increase. 3. (BS III)

$\lambda_{\max}=o\left(\sqrt{{\rm tr}\Sigma^{2}}\right)$ , which relates to the strength of the covariance structure. 4. (BS IV)

$(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2})^{\top}\Sigma(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2})=o\left((1/n+1/m)\,{\rm tr}\Sigma^{2}\right)$ is a local alternative condition to calculate the asymptotic power, under which the variance estimate remains ratio consistent.

Let us elaborate condition (BS III) to understand how strong the covariance structure can be. Consider the independent elements case, $\Sigma=\mathcal{I}$ with $\lambda_{\max}=1$ and ${\rm tr}\Sigma^{2}=p$ . Thus we have $\lambda/\sqrt{{\rm tr}\Sigma^{2}}=1/\sqrt{p}\rightarrow 0$ , which indicates the validity of the condition. If we consider a moving average covariance structure with $\Sigma_{ij}=\rho^{|i-j|}$ for $0<\rho<1$ . Then we have $\lambda_{\max}\leq(1+\rho)/(1-\rho)$ and ${\rm tr}\Sigma^{2}\approx p(1-\rho^{p})(1-\rho)^{-1}$ which also satisfies the condition for all values of $\rho$ . The condition, however, does not allow covariance structures from the other end of the spectrum: an exchangeable covariance structure with $\Sigma_{ij}=\rho$ for all $i,j$ and for some $0<\rho<1$ which has $\lambda_{\max}=1+(p-1)\rho$ and ${\rm tr}\Sigma^{2}=p+(p^{2}-p)\rho^{2}$ . This gives

[TABLE]

which does not satisfy the condition.

The Bai-Saranadasa test statistic is highly regarded in high dimensional mean-vector testing literature. In addition to extending the test to higher dimensions, it also relaxed the normality assumption on the samples. Instead, the observations are assumed to be coming from a factor model of the form

[TABLE]

where $\mathbf{Z}=(Z_{1},\ldots,Z_{p})$ and $Z_{i}$ ’s are continuous *i.i.d. *random variables with $E(Z_{i})=0$ and $E(Z_{i}^{4})=3+\Delta<\infty$ . The covariance structure is determined by $\Gamma$ through the relationship $\Sigma=\Gamma\Gamma^{\top}$ . When $\Delta=0$ , the elements of $\mathbf{Z}$ are normally distributed. When $0<\Delta<\infty$ , the $Z_{i}$ ’s have heavier tails than normal, yet have finite moments. Examples of distributions satisfying the moment conditions are Laplace or double exponential distribution and centered gamma distribution.

In equation (5), the trace term comes only from the inner products of $\mathbf{X}_{i}$ ’s and $\mathbf{Y}_{j}$ ’s. For any $i$ , we have $\mathbb{E}(\mathbf{X}_{i}^{\top}\mathbf{X}_{i})=\boldsymbol{\mu}_{1}^{\top}\boldsymbol{\mu}_{1}+{\rm tr}n^{-1}\Sigma$ and $\mathbf{E}(\mathbf{X}_{i}^{\top}\mathbf{X}_{j}^{\top})=\boldsymbol{\mu}_{1}^{\top}\boldsymbol{\mu}_{1}$ when $i\neq j$ . Hence subtracting the inner-product terms from $n^{2}\mathbf{E}(\overline{\mathbf{X}}^{\top}\overline{\mathbf{X}})$ and $m^{2}\mathbf{E}(\overline{\mathbf{Y}}^{\top}\overline{\mathbf{Y}})$ , we have

[TABLE]

Combining the terms in the above equation, the statistic

[TABLE]

has expected value equal to $(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2})^{\top}(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2})$ .

Chen and Qin [22] constructed a test statistic using $\mathcal{T}_{n}$ as the functional, which has zero expected value under $H_{0}$ . They assumed that the data follows the factor model in equation (7). Sample sizes are restricted similar to (BS II). A major criticism of $T_{BS}$ has been the restriction of homogeneity of the two populations, i.e. equal covariance structure. Addressing this issue is a major achievement of the Chen and Qin test, which relaxed this condition. The two populations are allowed to have unequal covariance structures, $\Sigma_{1}$ and $\Sigma_{2}$ respectively. This extension results in the local alternative condition in (BS IV) to be modified, with the rate holding with respect to both $\Sigma_{1}$ and $\Sigma_{2}$ . Strength of the covariance matrix as restricted by (BS III) is also modified to accommodate the heterogeneity. Another major accomplishment of the Chen and Qin test is removing a direct constraint between $p$ and $n$ as in (BS I).

The modified constraints on the model are summarized as follows:

(CQ III)

${\rm tr}\left(\Sigma_{a}\Sigma_{b}\Sigma_{c}\Sigma_{d}\right)=o\left[{\rm tr}^{2}\left\{\left(\Sigma_{1}+\Sigma_{2}\right)^{2}\right\}\right]$ for $a,b,c,d\in\{1,2\}$ . 2. (CQ IV)

$\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right)^{\top}\Sigma_{a}\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right)=o\left[(n+m-2)^{-1}{\rm tr}\left\{\left(\Sigma_{1}+\Sigma_{2}\right)^{2}\right\}\right]$ for $a=1,2$ .

Under the local alternative, variance of $\mathcal{T}_{n}$ is equal to

[TABLE]

As used in $T_{BS}$ , $\left\{{\rm tr}(\mathcal{S}_{1}^{2})-n^{-1}{\rm tr}^{2}\mathcal{S}_{1}\right\}$ can be used as a ratio consistent estimator of ${\rm tr}(\Sigma_{1}^{2})$ . Inspired by the removal of inner-product terms in $\mathcal{T}_{n}$ , Chen and Qin argue that similar rationale relaxes a direct relationship between $p$ and $n$ as in (BS I). They proposed ratio consistent estimators of the form

[TABLE]

where $\overline{\mathbf{X}}_{(i)}=(n-1)^{-1}\mathop{\sum}_{k\neq i}^{n}\mathbf{X}_{k}$ , $\overline{\mathbf{X}}_{(i,j)}=(n-2)^{-1}\mathop{\sum}_{k\neq i,j}^{n}\mathbf{X}_{k}$ , $\overline{\mathbf{Y}}_{(i)}=(n-1)^{-1}\mathop{\sum}_{k\neq i}^{n}\mathbf{Y}_{k}$ and $\overline{\mathbf{Y}}_{(i,j)}=(n-2)^{-1}\mathop{\sum}_{k\neq i,j}^{n}\mathbf{Y}_{k}$ . Finally, the Chen-Qin test statistic is given by

[TABLE]

which follows a normal distribution asymptotically under $H_{0}$ .

In $T_{BS}$ and $T_{CQ}$ , the Euclidean norm is used as the functional to avoid inverting the sample covariance matrix, which is singular when $p>n$ . While $\mathcal{S}$ is not invertible, the diagonal elements are all non-zeroes and invertible (a zero diagonal element implies the corresponding variable is a constant and it can be removed from the analysis). Using the diagonal elements, a modified Mahalanobis distance can be calculated as a weighted Euclidean norm,

[TABLE]

where $\mathcal{D}_{\mathcal{S}}$ is the $p\times p$ diagonal matrix of $\mathcal{S}$ . When the two groups are homogeneous, we have $\mathbb{E}\left(\overline{X}_{k}-\overline{Y}_{k}\right)^{2}=\left(\mu_{1k}-\mu_{2k}\right)^{2}+\left(1/n+1/m\right)\sigma_{kk}$ and $\mathbb{E}\left(\mathcal{S}_{kk}\right)=(n+m-2)/(n+m)\,\sigma_{kk}$ . As the ratio of these two expected values independent of the index $k$ , we have

[TABLE]

Similar to the calculations in the Euclidean norm, it is straightforward to show using the quadratic form that ${\rm var}\left(\mathcal{W}_{n}\right)=2{\rm tr}\left(\mathcal{R}^{2}\right)\left\{1+o(1)\right\}$ .

Srivastava and Du [69] developed a test statistic based on $\mathcal{W}_{n}$ as the functional, adjusting for its expected value. The test statistic is valid under the following assumptions:

(SD I)

The dimension increases at a polynomial rate with respect to $n$ , $n=O\left(p^{\delta}\right),1/2<\delta\leq 1$ . 2. (SD II)

Sample sizes of the two groups, $n$ and $m$ , are constrained as in (BS II). 3. (SD III)

If $\mathcal{R}$ is the population correlation matrix and $\lambda_{1}\geq\ldots\geq\lambda_{p}$ are its eigenvalues, then $\lim_{p\rightarrow\infty}{\rm tr}\left(\mathcal{R}^{k}\right)/p\in\left(0,\infty\right)$ for $k=1,2,3,4$ and $\lambda_{1}=o\left(\sqrt{p}\right)$ . 4. (SD IV)

Means of the two groups satisfy the local alternative condition: $\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right)^{\top}\mathcal{D}_{\Sigma}^{-1}\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right)\leq Mp/(n+m-2)(1/n+1/m)$ for some finite constant $M$ .

The Srivastava-Du test statistic is given by

[TABLE]

where $R=\mathcal{D}_{\mathcal{S}}^{-1/2}\mathcal{S}\mathcal{D}_{\mathcal{S}}^{-1/2}$ is the sample correlation matrix. The test statistic is asymptotically normal under the null hypothesis.

The condition imposed on the correlation structure in (SD III) is very restrictive compared to (BS III) and (CQ III). For example, consider $\Sigma=\mathcal{R}={\rm diag}(p^{\omega},1,\ldots,1)$ for some $1/4\leq\omega<1$ . Then ${\rm tr}\Sigma^{2}=p+p^{2\omega}-1$ , ${\rm tr}\Sigma^{4}=p+p^{4\omega}-1$ and $\lambda_{\max}=p^{\omega}$ . (BS III) and (CQ III) are satisfied as

[TABLE]

For (SD III), we have $\lambda_{\max}/\sqrt{p}=p^{\omega-1/2}\rightarrow 0$ but ${\rm tr}\mathcal{R}^{4}/p=\left(p+p^{4\omega}-1\right)/p=1+p^{4\omega-1}-p^{-1}$ , which is not bounded for $\omega>1/4$ .

Another major constraint of the Srivastava-Du test is that the observations are assumed to be normally distributed. Unlike $T_{BS}$ and $T_{CQ}$ , asymptotically equivalent expressions for ${\rm var}(\mathcal{W}_{n})$ are not established. Instead, exact variance is derived using the properties of the normal distribution. In a sequence of papers, the authors have provided extensions to $T_{SD}$ to reduce some of the assumptions. In Srivastava [73], the term ${\rm tr}R^{2}/p^{3/2}$ in the denominator of $T_{SD}$ was shown to converge to zero and hence dropped. In Srivastava-Kano [74], an extension to the heterogeneous case was developed. However this test is inexact in the sense that the functional $\mathcal{W}_{n}^{*}$ has expected value equal to $\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right)^{\top}\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right)$ only in limit.

Inspired by the idea of Chen and Qin [22], Park and Ayyala [64] modified the functional $\mathcal{W}_{n}$ by removing the inner product terms. Using the true covariance diagonal, the functional

[TABLE]

has expected value $\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right)^{\top}\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right)$ . Replacing the true covariances with consistent estimators, a leave-out approach has been implemented to maintain independence amongst the terms. For instance in $\mathbf{X}_{i}^{\top}\mathcal{D}_{\Sigma}^{-1}\mathbf{X}_{j}$ , the quantities $\mathbf{X}_{i},\mathbf{X}_{j}$ and $\widehat{\mathcal{D}_{\Sigma}}$ will be independent if $\widehat{\mathcal{D}_{\Sigma}}$ is constructed by leaving out $\mathbf{X}_{i}$ and $\mathbf{X}_{j}$ . The pooled sample covariance matrix $\mathcal{S}=\left((n-1)\mathcal{S}_{1}+(m-1)\mathcal{S}_{2}\right)/(n+m-2)$ , where $\mathcal{S}_{1}$ and $\mathcal{S}_{2}$ are the sample covariance matrices of the two groups respectively, is not useful because $\mathcal{S}_{1}$ contains $\mathbf{X}_{i}$ and $\mathbf{X}_{j}$ . If these two samples are removed from $\mathcal{S}_{1}$ , then $\mathcal{S}_{1(i,j)}=(n-3)^{-1}\mathop{\sum}_{k\neq i,j}^{n}\left(\mathbf{X}_{k}-\overline{\mathbf{X}}_{(i,j)}\right)\left(\mathbf{X}_{k}-\overline{\mathbf{X}}_{(i,j)}\right)^{\top}$ , where $\overline{\mathbf{X}}_{(i,j)}=(n-2)^{-1}\mathop{\sum}_{k\neq i,j}^{n}\mathbf{X}_{k}$ will be independent of $\mathbf{X}_{i}$ and $\mathbf{X}_{j}$ . Similarly, for the second and third terms, we can define $\mathcal{S}_{2(i,j)}$ , $\mathcal{S}_{1(i)}$ and $\mathcal{S}_{2(j)}$ respectively to maintain independence of terms. Then diagonals of the pooled sample estimators

[TABLE]

is used to construct the functional

[TABLE]

which has expected value $\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right)^{\top}\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right)$ .

From the quadratic form and independence of the terms, variance of $\mathcal{U}_{n}$ will be

[TABLE]

where $\mathcal{R}_{1}$ and $\mathcal{R}_{2}$ are the for notational convenience to identify that the terms are related to $\mathbf{X}$ and $\mathbf{Y}$ respectively. A similar leave-out approach is applied to modify the standard correlation matrix estimate $\widehat{\mathcal{R}_{1}}=\mathcal{D}_{\mathcal{S}_{1}}^{-1/2}\mathcal{S}_{1}\mathcal{D}_{\mathcal{S}}^{-1/2}$ . Centering the observations only once as in $T_{CQ}$ and rearranging the terms, the estimators

[TABLE]

are shown to be ratio consistent for the corresponding terms in ${\rm var}\left(\mathcal{U}_{n}\right)$ . Standardizing $\mathcal{U}_{n}$ by the variance estimator, the Park-Ayyala test statistic is given by

[TABLE]

Asymptotic normality of the test statistic was derived under the following assumptions:

(PA II)

Sample sizes of the two groups, $n$ and $m$ are constrained as in (BS II). 2. (PA III)

If $\mathcal{R}$ is the correlation matrix, then ${\rm tr}\left(\mathcal{R}^{4}\right)=o\left\{{\rm tr}^{2}\left(\mathcal{R}^{2}\right)\right\}$ . This condition is similar to (CQ III). 3. (PA IV)

The means $\boldsymbol{\mu}_{1}$ and $\boldsymbol{\mu}_{2}$ satisfy the local alternative condition $n\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right)^{\top}\mathcal{D}_{\mathcal{S}}^{-1/2}\mathcal{R}\mathcal{D}_{\mathcal{S}}^{-1/2}\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right)^{\top}=o\left\{{\rm tr}^{2}\left(\mathcal{R}^{2}\right)\right\}$

The assumptions in (PA II) - (PA IV) are milder than (SD I)-(SD IV) and hold for a much larger family of covariance structures. Another major advantage of $T_{PA}$ is that it does not require the normality assumption. Instead, the test is constructed assuming the factor model defined in equation (7).

The four test statistics have several key differences regarding their properties and performance. The Bai-Saranadasa test and Chen-Qin test are orthogonal-transform invariant, i.e. the operation $\mathbf{X}_{i}\mapsto\mathbf{U}\mathbf{X}_{i},i=1,\ldots,n$ and $\mathbf{Y}_{j}\mapsto\mathbf{U}\mathbf{Y}_{j},j=1,\ldots,m$ for some $p\times p$ orthogonal matrix $\mathbf{U}$ does not affect the test. The Srivastava-Du test and Park-Ayyala test are scale-transform invariant, wherein the operation described above does not affect the test when $\mathbf{U}={\rm diag}\left(u_{1},\ldots,u_{p}\right)$ is a diagonal matrix. In practice, scale transformation invariance is more useful than its orthogonal counterpart as they can bring variables on to a uniform scale. To better understand this difference, consider the contribution of each element towards the expected difference under the alternative when $\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}=\boldsymbol{\delta}$ . In $T_{BS}$ and $T_{CQ}$ , $k^{\rm th}$ element has a contribution of $\delta_{k}^{2}$ , whereas in $T_{SD}$ and $T_{PA}$ the contribution is $\delta_{k}^{2}/\sigma_{kk}$ . While the former depends on the scale of the variable, the latter is the coefficient of variation and is hence scale-free. In a scenario where the non-zero $\delta_{k}$ ’s correspond only to the values whose means are small, then $T_{PA}$ and $T_{SD}$ have higher power of detecting the difference.

Due to their similarities in construction and assumptions, $T_{CQ}$ and $T_{PA}$ are observed to be applicable to a broader range of models. This is mainly because of relaxed assumptions on the covariance structure and lack of direct relationship between $p$ and $n$ . However it is worth noting that the assumptions (BS I) and (SD I) are asymptotic and cannot be validated from a finite sample data set. For example, a data set with $p=10,000$ and $n=10$ can either imply the rate is polynomial ( $p=n^{4}$ ) or linear ( $p=1000n$ ). There is no practical means of determining the true rate with a single data set. Another aspect of this asymptotic rate that is worth considering is that the number of variables is generally deterministic. In genomics data sets such as DNA methylation or gene expression, the dimension is the number of genes, which is fixed. The sample size is the number of biological replicates, which can be increased by collecting more specimens. Hence rate of increase cannot be used as a means to prefer one test to the other. A better approach to determine which method best suits a data set is through a simulation study. A controlled simulation study should be designed using the properties of the data set such as dependence structure and sparsity. The empirical type I error obtained by specifying equal means can be used to compare the performance of the methods. This approach was used in Ayyala et al. [5] to determine that $T_{CQ}$ outperforms the other tests at controlling type I error rate and achieves reasonable power for immuno-precipitation based DNA methylation data.

2.2 Projection based tests

The driving motivation behind the tests in Section 2.1 is the fact that when $p>n$ , the Hotelling’s $T^{2}$ test statistic cannot be calculated. Another approach that has been considered to overcome this problem is to project the data into a lower dimensional space such that the assumptions of Hotelling’s $T^{2}$ are satisfied. For $k<p$ , consider a matrix $\mathcal{R}\in\mathbb{R}^{k\times p}$ with full row rank. The difference of means, $\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}$ , when projected onto the column space of $\mathcal{R}$ , is equal to zero if and only if the difference itself is zero, $\mathcal{R}\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right)=\mathbf{0}\,\,\,\Leftrightarrow\,\,\,\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}=\mathbf{0}.$ By this equivalence, the hypothesis in (1) is equivalent to

[TABLE]

The equivalence holds for all rank-sufficient matrices and for all dimensions with $k\leq p$ , which is an extremely large collection of matrices. In practice, we can only evaluate it for a very small set of matrices, based on which the conclusion can be drawn. Hence the two key factors that will affect the result of the test are $k$ and construction of $\mathcal{R}$ .

A natural choice of $\mathcal{R}$ for dimension reduction is using principal component analysis. Let $\Sigma=\mathbf{V}\Omega\mathbf{V}^{\top}$ be the eigenvalue decomposition of the common covariance matrix $\Sigma$ . The matrix $\mathbf{V}=\left(\mathbf{v}_{1},\ldots,\mathbf{v}_{p}\right)$ is orthogonal where the columns are the eigenvectors and $\Omega={\rm diag}(\omega_{1},\ldots,\omega_{p})$ is the diagonal with eigenvalues along the diagonal. Eigenvalue decomposition of the pooled sample covariance matrix $\mathcal{S}$ yields

[TABLE]

where $\lambda_{1}\geq\cdots\geq\lambda_{p}$ are the eigenvalues and $\mathbf{u}_{1},\ldots,\mathbf{u}_{p}$ are the orthogonal eigenvectors. Properties of the eigenvalues and eigenvectors will be discussed in detail in later sections. The eigenvalues give a measure of the amount of variability in the data in the direction of the corresponding eigenvector. The cumulative relative variance of any set of eigenvectors $\{\mathbf{u}_{a_{1}},\ldots,\mathbf{u}_{a_{m}}\}$ is given by $\left(\lambda_{a_{1}}+\ldots+\lambda_{a_{m}}\right)/\left(\lambda_{1}+\ldots+\lambda_{p}\right)$ . Any set of $k$ eigenvectors can be used to construct a $k$ -dimensional subspace to project the data. However using the first $k$ eigenvectors is most informative as it contains the maximum cumulative relative variance in the data, equal to $(\lambda_{1}+\ldots+\lambda_{k})/(\lambda_{1}+\ldots+\lambda_{p})$ . Define the matrix $\mathbf{U}_{(k)}=\left(\mathbf{u}_{1},\ldots,\mathbf{u}_{k}\right)$ of dimension $p\times k$ using the first $k$ columns of $\mathbf{U}$ and the projections as

[TABLE]

The sample means of the projected observations will be $\overline{\mathbf{X}}^{*}=\mathbf{U}_{(k)}^{\top}\overline{\mathbf{X}}$ and $\overline{\mathbf{Y}}^{*}=\mathbf{U}_{(k)}^{\top}\overline{\mathbf{Y}}$ respectively. The pooled sample covariance matrix of $\mathbf{X}^{*}$ and $\mathbf{Y}^{*}$ is $\mathbf{U}_{(k)}^{\top}\mathcal{S}\mathbf{U}_{(k)}$ , which, by orthogonality of the columns of $\mathbf{U}_{(k)}$ , is a diagonal matrix given by $\Lambda_{(k)}={\rm diag}(\lambda_{1},\ldots,\lambda_{k})$ . For any $k$ , we can calculate the Hotelling’s $T^{2}$ statistic using the projected data as

[TABLE]

When $k=p$ and $p<n+m-2$ , we have the original Hotelling’s $T^{2}$ statistic as defined in (2), i.e. $T^{2}_{Hot(p)}=T^{2}_{Hot}$ . For any $k\leq p$ , the null distribution of $T^{2}_{Hot(k)}$ will be $F_{k,n+m-k-1}$ .

While the motivation of projeting the samples into the principal component subspace is to reduce dimension and be able to use the Hotellings $T^{2}$ test statistic, one needs to be careful when choosing $k$ . If the alternative hypothesis is true, then choosing a small $k$ can potentially lead to a type II error. This is because in $T^{2}_{Hot(k)}$ , the summation will include only the first $k$ terms corresponding to the largest $\lambda$ . But if the difference between the means is uniform over all the components, then the ratio of $\left(\mu_{1j}-\mu_{2j}\right)/\lambda_{j}$ will be highlighted only for small $\lambda_{j}$ , which correspond to large $j$ . To illustrate this behaviour, consider the following study. Random samples are generated using $n=m=100,p=50$ and $\Sigma={\rm diag}(\sigma_{1},\ldots,\sigma_{p})$ where $\sigma_{i}\sim{\rm Unif}(2,3)$ . For the means, specify $\boldsymbol{\mu}_{1}=(0,\ldots,0)$ and $\boldsymbol{\mu}_{2}\sim(\delta,\ldots,\delta)$ . Figure 1 shows the $p$ -value for different values of $\delta$ and for all $k\leq p$ .

The first thing to note is that $T^{2}_{Hot}$ detects the difference for the complete data ( $k=p$ ), whereas projecting into a single dimension always fails to reject $H_{0}$ . The smallest $k$ for which the p-value supports rejecting $H_{0}$ for $\delta=0.2,0.4$ and $1$ are $26,15$ and $5$ respectively. The variance is kept constant for the three models, which implies the difference in results in due to $\delta$ . As $\delta$ increases, there is greater separation between the two means and hence smaller $k$ is sufficient. The converse - rejecting the hypothesis for small $k<p$ when $\delta=0$ and $H_{0}$ is rejected for $k=p$ , is very unlikely to happen. Thus the type I error will be preserved for all $k$ but the projection test will have lower power than $T^{2}_{Hot}$ .

In high dimensional setting, when $p>n+m-2$ , the eigenvalue matrix $\Lambda$ is singular with only the first $n+m-2$ eigenvalues non-zero. Defining a generalized inverse of $\Lambda$ as ${\rm diag}(\lambda_{1},\ldots,\lambda_{n+m-2},0,\ldots,0)^{-1}={\rm diag}(1/\lambda_{1},\ldots,1/\lambda_{n+m-2},0,\ldots,0)$ , the projected Hotelling’s $T^{2}$ test statistic defined in (15) can be calculated when $k\leq n+m-2$ . The full possible model, $T^{2}_{Hot(n+m-2)}$ is not the complete Hotelling’s $T^{2}$ test as it is still contains only a partial summation of terms in (15). However the $p$ -value of $T^{2}{Hot(k)}$ behaves differently for different values of $k$ in high dimensions. In Figure 1, we observed that the type II error of $T^{2}_{Hot(k)}$ decreases as $k$ increases. This is because the deviations corresponding to the smallest eigenvalues will be included in the summation for large enough $k$ , resulting in an increase in the test statistic value. But in high dimensions the smallest eigenvalues are set to zero. Therefore the projected $T^{2}$ test statistic can never achieve the value of $T^{2}_{Hot}$ , resulting in extremely high type II error rate even for $k=n+m-2$ .

To illustrate the effect of $k$ on $T^{2}_{Hot(k)}$ , consider the following simulation study. We set $p=500$ and varied the sample sizes as $n\in\{10,100,200\}$ and $m=2n$ . The mean vectors are set as $\boldsymbol{\mu}_{1}=(0,\ldots,0)$ and $\boldsymbol{\mu}_{2}=(1,\ldots,1)$ respectively. The $p$ -values of $T^{2}_{Hot(k)}$ for the three models and different values of $k$ within each model are presented in Figure 2. Note that when $p>n+m$ as in the first two sub-figures, the $p$ -value increases with $k$ . Whereas in the third sub-figure, the properties of the $p$ -value curve are similar to those seen in Figure 1. Similar results have been observed in a wide range of simulation models. Theoretical justification for this behaviour of the projection-based Hotelling’s $T^{2}$ test is still lacking.

2.3 Random projections

As seen in the previous section, projecting on to the eigenspace of the pooled sample covariance matrix has its limitations in high dimension. The results also indicate that the conclusion will be contrary to the truth when sample sizes are small. While the concept of dimension reduction is effective, PCA is not the best approach for this task. Alternatively, projections based on random matrices have been shown to have good performance. A random projection embeds the $p$ -dimensional variables into a lower $k$ -dimensional space ( $k<<p$ ) while preserving the distances between points. The seminal result that allows us construct such an embedding is the Johnson-Lindenstrauss lemma [46].

Theorem 2.1 (Johnson-Lindenstrauss lemma).

For any collection of points $\mathbf{X}_{1},\ldots,\mathbf{X}_{n}\in\mathbb{R}^{p}$ and $0<\varepsilon<1$ , there exists a $k\geq k_{0}=O\left(\varepsilon^{-2}\log n\right)$ and a linear map $\mathcal{R}:\mathbb{R}^{p}\rightarrow\mathbb{R}^{k}$ such that

[TABLE]

The theorem provides a method to determine the smallest dimension into which the original data can be embedded without altering the local properties of the data sets. Significance of this result is greatly enhanced by the fact that the dimension of the embedded space, $k$ , depends on the sample size $n$ and not on the dimension $p$ . While the result holds for any linear map, the most commonly used mapping is $\mathbf{X}\mapsto\mathcal{R}\mathbf{X}$ for some $k\times p$ matrix $\mathcal{R}$ . To avoid the pitfalls of principal component based projections, the alternative is to randomly generate the matrix independent of the data.

For a given $k\in\mathbb{Z}_{+}$ , a random projection matrix $\mathcal{R}\in\mathbb{R}^{k\times p}$ is a matrix whose elements are random variables. Two distinctions need to be made when calling them random and projection matrices. Firstly, unlike matrices generated from distributions the matrix space such as Wishart, these random matrices are not structured. Secondly, these matrices need not necessarily have the properties of a projection matrix, viz. orthogonality, idempotency, etc. For simplicity of generation, the variables are chosen to be independent and identically distributed. Additional conditions can be imposed to provide structure to the projected data. While any distribution can be used to generate the elements, one property that is desired is that it is symmetric with zero mean and unit variance. This property ensures that the Euclidean distance between a pair of observations is preserved in expectation. That is, if $\mathbf{u}=(u_{1},\ldots,u_{p})$ is a random vector with $\mathbb{E}(u_{i})=0$ and $\mathbb{E}(u_{i}^{2})=1$ , then

[TABLE]

The most trivial distribution that is symmetric around zero with unit variance is the standard normal distribution, $\mathcal{N}(0,1)$ . To further simplify random number generation, one can also consider a uniform distribution ${\rm Unif}\left(-\sqrt{3},\sqrt{3}\right)$ , where the limits are adjusted to satisfy the moment conditions.

Another class of projection matrices that has gained prominence recently is based on binary coins. Developed by Achlioptas [1], this method is found to be very useful for dimension reduction in machine learning [32], image processing [13]and language processing [63]. The idea is to generate the elements of $\mathcal{R}$ from the set $\Omega=\{-1,0,+1\}$ . Two distributions can be defined on $\Omega$ with zero mean and unit variance,

[TABLE]

Among these two distributions, $r^{(3)}_{ij}$ is preferred to $r_{ij}^{(1)}$ because it produces a sparse embedding. By construction, the contribution of two out of three variables (on average) will be set to zero. Furthermore, the computation time is significantly improved when using $r_{ij}^{(3)}$ . Extending from this work, Li et al. [54] generalized the procedure to define the distribution for any $\theta>0$ ,

[TABLE]

This generalization improves on $r_{ij}^{(3)}$ as defined in (17) by increasing the sparsity of $\mathcal{R}$ with $\theta$ , thereby reducing the computation cost of the projection. Li et al. [54] have shown that using $\theta$ as large as $p/\log(p)$ significantly reduces the computation cost with minimal loss of information (accuracy). However keeping in mind this trade-off between speed and information loss, the authors recommend $\theta=\sqrt{p}$ .

Given a random projection matrix $\mathcal{P}$ , the projected variables $\mathbf{X}^{*}=\mathcal{R}\mathbf{X}$ and $\mathbf{Y}^{*}=\mathcal{P}\mathbf{Y}$ have means $\mathcal{R}\boldsymbol{\mu}_{1}$ and $\mathcal{P}\boldsymbol{\mu}_{2}$ respectively. If the two populations are homogeneous, the common covariance matrix will be $\mathcal{R}\Sigma\mathcal{R}^{\top}$ . Additionally if the variables are normally distributed, then the distribution is also preserved, i.e. $\mathbf{X}^{*}\sim\mathcal{N}\left(\mathcal{R}\boldsymbol{\mu}_{1},\mathcal{R}\Sigma\mathcal{R}^{\top}\right)$ and $\mathbf{Y}^{*}\sim\mathcal{N}\left(\mathcal{R}\boldsymbol{\mu}_{2},\mathcal{R}\Sigma\mathcal{R}^{\top}\right)$ . The sample means of the two populations will be $\overline{\mathbf{X}}^{*}=\mathcal{R}\overline{\mathbf{X}}$ and $\overline{\mathbf{Y}}^{*}=\mathcal{R}\overline{\mathbf{Y}}$ respectively and the pooled sample covariance matrix is $\mathcal{S}^{*}=\mathcal{R}\mathcal{S}\mathcal{R}^{\top}$ . If $k<n+m-2$ , the Hotelling’s $T^{2}$ test statistic for the projected data can be defined as

[TABLE]

Under the null hypothesis $H_{0:\mathcal{R}}$ defined in (14), $T^{2}_{\mathcal{R}}$ follows a $F_{k,n+m-k-1}$ distribution conditional on $\mathcal{R}$ . The p-value of the test statistic will be

[TABLE]

At significance level $\alpha$ , the null hypothesis is rejected if $p_{\mathcal{R}}<\alpha$ .

In an unpublished work, Lopes et al. [55] first proposed (20) and suggested using $k=\lfloor(n+m)/2\rfloor$ (assuming $p>\lfloor(n+m)/2\rfloor$ ) for the dimension of the reduced space. They provide theoretical justification of conditions in which $T^{2}_{\mathcal{R}}$ has greater power than $T_{CQ}$ and $T_{SD}$ . The only criticism of their procedure is the choice of $\mathcal{R}$ . As the test statistic and $p$ -value are calculated conditional on $\mathcal{R}$ , the result of the test will be determined by the choice of the projection matrix $\mathcal{R}$ . The results based on different realizations of the projection matrix $\mathcal{R}_{1}$ and $\mathcal{R}_{2}$ need not necessarily be consistent. To get rid of this sampling artefact, one should generate multiple instances of the projection matrix and combine the $p$ -values of all the instances to draw inference. An exact method for combining the $p$ -values from different projection matrices was developed by Srivastava et al. [78]. Their method, RAPTT (stands for RAndom Projection T-Test)), uses the average $p$ -value from multiple independent projection matrices to accept or reject the null hypothesis. The method works as follows.

Consider $N$ random projection matrices $\mathcal{R}_{1},\ldots,\mathcal{R}_{N}$ generated independently and the corresponding $p$ -values calculated using equation (20),

[TABLE]

Then the average $p$ -value, $\overline{p}=N^{-1}\sum\limits_{j=1}^{N}p_{j}$ , is used to reject the null hypothesis at level $\alpha$ . If $\psi_{\alpha}$ a cut-off based on the null distribution of $\overline{p}$ such that $P\left(\overline{p}>\psi_{\alpha}|H_{0}\right)=1-\alpha$ , then we reject $H_{0}$ if $\overline{p}>\psi_{\alpha}$ . The cut-off $\psi_{\alpha}$ is obtained from the null distribution of $\overline{p}$ , which is not straightforward to derive. Instead, Srivastava et al. [78] have established that the null distribution is independent of the parameters $\boldsymbol{\mu}_{1},\boldsymbol{\mu}_{2}$ and $\Sigma$ . Hence, without loss of generality, the values $\boldsymbol{\mu}_{1}=\boldsymbol{\mu}_{2}=\mathbf{0}$ and $\Sigma=\mathcal{I}_{p}$ can be used to derive the null distribution. Using this property, they proposed computing the cut-off empirically using the following algorithm.

(RAPTT I)

Randomly generate $\mathbf{X}_{1},\ldots,\mathbf{X}_{n}\sim\mathcal{N}\left(\mathbf{0},\mathcal{I}\right)$ and $\mathbf{Y}_{1},\ldots,\mathbf{Y}_{m}\sim\mathcal{N}\left(\mathbf{0},\mathcal{I}\right)$ . 2. (RAPTT II)

Randomly generate $N$ projection matrices and calculate the $p$ -values $p_{1},\ldots,p_{N}$ using equation (20). Calculate $\overline{p}=N^{-1}\sum\limits_{j=1}^{N}p_{j}$ . 3. (RAPTT III)

Repeat (RAPTT I) and (RAPTT II) $M$ times to calculate $\overline{p}_{1},\ldots,\overline{p}_{M}$ . Sort them in increasing order such that $\overline{p}_{[1]}\leq\ldots\leq\overline{p}_{[M]}$ . Then the level $\alpha$ cut-off is estimated as

[TABLE]

In their work, Srivastava *et al. *propose two types of projection matrices to use

(i)

orthogonal matrices generated from the Haar distribution such that $\mathcal{R}\mathcal{R}^{\top}=\mathcal{I}_{k}$ . 2. (ii)

a block-weighted approach where for each of the $k$ dimensions in the projected space, non-zero weights are assigned for a unique set of $[p/k]$ elements of the original variables.

A comprehensive simulation study reported in their work shows differences in the empirical power between the two projection matrices under certain situations. The type I error rates reported are relatively consistent. This discrepancy in the performance and its dependence on projection could be attributed to the limited scope of the simulation study (calculated based on 1000 runs). The optimal choice of projection matrix and a comprehensive investigation of performance of the projection-based tests under all models of projection matrix still needs to be addressed.

A major bottleneck of the projection-based tests is the computation time. Tests such as RAPTT are exact and are known to have better performance over asymptotic tests when the sample sizes are small. But the lack of null distribution and the variability of the test across different projection matrices imposes a heavy computational cost of the procedure. For instance, constructing the empirical null distribution using $N$ projection matrices and $M$ bootstrap samples for the data has a computational cost of $O(NM\tau)$ , where $\tau$ is the cost of calculating the Hotelling’s $T^{2}$ test statistic and the corresponding $p$ -value. Considering $N=M=10^{3}$ leads to a cost of $O(10^{6}\tau)$ , which requires massively parallel computing to keep achieve reasonable computation time. Variability of $T^{2}_{R}$ and its $p$ -value over the distribution of the projection matrices can be studied to determine the number of bootstrap samples required to achieve a specified level of accuracy in empirical calculations.

2.4 Other approaches

In sections 2.1-2.3, the test statistics were based on the norm of the difference of mean vectors, either $(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2})^{\top}(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2})$ or $(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2})^{\top}\Sigma^{-1}(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2})$ . Asymptotics and projection based methods are two major approaches commonly considered, but the hypothesis in (1) can also viewed in a different light. Several tests have been proposed by either (i) aggregating the evidence across the individual elements or (ii) observing the maximum difference across the elements of $\mathbf{X}$ and $\mathbf{Y}$ . In this section, some aggregate tests based on univariate methods for individual elements are presented.

Pooled component test (PCT): Wu et al. [84] proposed a test statistic which is applicable when there is missing data, i.e. one or more variables are not for all the observations. PCT requires the two groups to be homogeneous and normally distributed. The test statistic is obtained by averaging the squares of $t$ -test statistics for the individual variables,

[TABLE]

where $n_{k}$ and $m_{k}$ are the number of observations for which the $k^{\rm th}$ variable is observed from the first and second samples respectively. The quantities $\overline{X}_{k},\overline{Y}_{k}$ and $S_{k}$ are also similarly estimates of $\mu_{1k},\mu_{2k}$ and $\Sigma_{kk}$ estimated using the observed values. Using the first two moments, the null distribution was established to be a scaled chi-squared, $T_{PCT}\stackrel{{\scriptstyle H_{0}}}{{\sim}}c\chi^{2}_{d}$ . The parameters $c$ and $d$ can be estimated from the individual $t$ -test statistics to obtain the approximating null distribution. 2. 2.

Generalized component test (GCT): Gregory et al. [35] proposed a test statistic for heterogeneous populations, replacing the pooled $t$ -test statistic in $T_{PCT}$ with the unpooled test statistic,

[TABLE]

where $\overline{X}_{k},s_{Xk},\overline{Y}_{k},s_{Yk}$ are the mean and standard deviations of the $k^{\rm th}$ component for the two samples respectively and $n_{k}$ and $m_{k}$ are as defined in 1. The quantities $\widehat{a}_{n}$ and $\widehat{b}_{n}$ are obtained by combining the higher order moments of the elements of $\mathbf{X}$ and $\mathbf{Y}$ . The denominator $\widehat{\zeta}_{n}$ is estimated using a window-based aggregate of the autocovariance function across the elements. The test statistic is shown to be asymptotically normal. A key assumption of GCT is that the elements of $\mathbf{X}$ and $\mathbf{Y}$ are ordered so that the autocovariance function across the elements diminishes with increasing lag (e.g. a moving average model). This assumption is very restrictive compared to the other tests. 3. 3.

Cai et al. [18] developed a test based on the maximum scaled difference across the elements of the variables. Under the assumption that the populations are homogeneous and normally distributed, the test statistic is given by

[TABLE]

where $\mathcal{S}_{\mathbf{X}}(\widehat{\Omega})$ and $\mathcal{S}_{\mathbf{Y}}(\widehat{\Omega})$ are the biased sample variance estimates of $\widehat{\Omega}\mathbf{X}_{1},\ldots\widehat{\Omega}\mathbf{X}_{n}$ and $\widehat{\mathbf{Y}}_{1},\ldots,\widehat{\mathbf{Y}}_{m}$ respectively. The precision matrix $\Omega=\Sigma^{-1}$ is estimated directly using constrained $\ell_{1}$ -minimization for inverse matrix estimation (CLIME [17]) to avoid inverting the singular pooled sample covariance matrix $\mathcal{S}$ . Asymptotic null distribution of $T_{CLX}$ is shown to be an extreme value distribution of type I and a level $\alpha$ test rejects $H_{0}$ when $T_{CLX}\geq 2\log p-\log\{\log p\}-\log\pi-2\log\{\log(1-\alpha)\}$ . 4. 4.

Zoh et al. [87] have developed a Bayesian hypothesis for the hypothesis in 1 using Bayes factor. Under the assumption of homogeneous normal distributiosn for $\mathbf{X}$ and $\mathbf{Y}$ , they considered a Jeffrey’s prior for $\left(\boldsymbol{\mu},\Sigma\right)$ and a conjugate normal prior for $\boldsymbol{\delta}=\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}$ ,

[TABLE]

Then the Bayes factor was shown to admit a closed form given by

[TABLE]

where $\eta=\frac{nm}{(n+m)\tau_{0}}$ . They also proposed calculating the Bayes factor $BF_{10}\left(\mathcal{R}\mathbf{X},\mathcal{R}\mathbf{Y}\right)$ for any random projection matrix $\mathcal{R}\in\mathbb{R}^{k\times p}$ by replacing $p$ with $k$ and $T^{2}_{Hot}$ with $T^{2}_{Hot:\mathcal{R}}$ in 25. The rejection is constructed by translating the Hotelling’s $T^{2}$ rejection region, $T^{2}_{Hot}>F_{\alpha,p,n+m-p-1}$ to $BF_{10}\left(\mathbf{X},\mathbf{Y}\right)$ . At significance level $\alpha$ , the null hypothesis is rejected if

[TABLE]

where $C_{n}=(pF_{\alpha,p,n+m-p-1})\left\{pF_{\alpha,p,n+m-p-1}+n+m-p-1\right\}^{-1}$ , $\tau_{\alpha}^{*}=nm\left\{(n+m)\tau_{\alpha}\right\}^{-1}$ and $\tau_{\alpha}=nm\left\{(n+m)F_{\alpha,p,n+m-p-1}-1\right\}^{-1}$ .

2.5 Dependent observations

(write motivation)

For testing equality of means of two populations as presented in (1), the observations from each population are assumed to be independently and identically distributed. Most of the test statistics presented so far have been developed on several assumptions constraining the dependence structure. The testing problem has also been addressed when the covariance matrices are structured (Zhong [86], Cai [18]). But what happens if the observations are identically distributed but are not independent? Suppose the observations have the following covariance structure parametrized as ${\rm cov}\left(\mathbf{X}_{i},\mathbf{X}_{j}\right)=\Sigma_{1}^{(i,j)}$ and ${\rm cov}\left(\mathbf{Y}_{i},\mathbf{Y}_{j}\right)=\Sigma_{2}^{(i,j)}$ . Then for any $i$ and $j$ , the expected value of inner products of the variables will be $\mathbb{E}\left(\mathbf{X}_{i}^{\top}\mathbf{X}_{j}\right)=\boldsymbol{\mu}_{1}^{\top}\boldsymbol{\mu}_{1}+{\rm tr}\left(\Sigma_{1}^{(i,j)}\right)$ and $\mathbb{E}\left(\mathbf{Y}_{i}^{\top}\mathbf{Y}_{j}\right)=\boldsymbol{\mu}_{2}^{\top}\boldsymbol{\mu}_{2}+{\rm tr}\left(\Sigma_{2}^{(i,j)}\right)$ respectively. Considering the functional based on the Euclidean norm of $\overline{\mathbf{X}}-\overline{\mathbf{Y}}$ , its expected value will be

[TABLE]

Since the samples are assumed to be identically distributed, we have $\Sigma_{1}^{(i,i)}=\Sigma_{1}$ , $\Sigma_{2}^{(i,i)}=\Sigma_{2}$ . In the independent case, additionally we had $\Sigma_{1}^{(i,j)}=\Sigma_{2}^{(i,j)}=\mathbf{0}_{p\times p}$ when $i\neq j$ . Under the dependence structure, we have additional $n(n-1)+m(m-1)$ covariance matrices in the model. An unstructured dependence structure will therefore be infeasible because for any $i$ and $j$ , we have only one pair of observations $(\mathbf{X}_{i},\mathbf{X}_{j})$ to estimate $\Sigma_{1}^{(i,j)}$ . To make estimation feasible, we assume second-order stationarity on the dependence structures,

[TABLE]

By symmetry, we have $\Sigma_{1}(-a)=\Sigma_{1}^{\top}(a)$ and $\Sigma_{2}(-a)=\Sigma_{2}^{\top}(a)$ for all $a\in\mathbb{Z}_{+}$ . In time series, $\{\Sigma_{1}(a),a\in\mathbb{Z}\}$ and $\{\Sigma_{2}(a),a\in\mathbb{Z}\}$ represent the autocovariance functions of the two populations repsectively. The matrices $\Sigma_{1}(a)$ and $\Sigma_{2}(a)$ represent the autocovariance at lag $a$ .

Using the autocovariance function, the expected value in (27) simplifies to

[TABLE]

A functional that is unbiased for the Euclidean norm of $\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}$ can be constructed using (28) as

[TABLE]

where $\widehat{\Sigma_{1}(a)}$ and $\widehat{\Sigma_{2}(a)}$ are the biased estimators of $\Sigma_{1}(a)$ and $\Sigma_{2}(a)$ defined as

[TABLE]

These estimators are the biased estimators (Brockwell and Davis [16]), which should be of no concern to us since we are only interested in their trace. When $p$ is finite, these estimators are known to be asymptotically unbiased. However in high dimensions, when $p$ increases with $n$ this property is no longer valid. For instance, the expected value of ${\rm tr}\left\{\widehat{\Sigma}_{1}(a)\right\}$ will be $\mathbb{E}\left[{\rm tr}\left\{\widehat{\Sigma}_{1}(a)\right\}\right]=\mathop{\sum}_{b=0}^{n-1}\theta_{n}(a,b){\rm tr}\left\{\Sigma_{1}(b)\right\}$ , where

[TABLE]

Asymptotic unbiasedness for finite $p$ follows from the leading term converging to 1 and the second and third terms, which are $O(n^{-1})$ , converging to zero as $n$ goes to infinity because ${\rm tr}\left\{\Sigma_{1}(a)\right\}=O(1)$ . In high dimension, if the autocovariance structure is proper with all eigenvalues being non-zero, then ${\rm tr}\left\{\Sigma_{k}(a)\right\}=O(p)$ for $k=1,2$ and all lags $a$ . Hence all three terms in the expression for $\theta_{n}(a,b)$ should be considered. The expected value of ${\rm tr}\left\{\widehat{\Sigma}_{1}(a)\right\}$ depends on the autocovariance matrices at all lags through the trace function, which is a univariate measure of the matrix. Expressing in vector form, we have $\mathbb{E}\left\{\widehat{\boldsymbol{\gamma}_{n}}\right\}=\Theta_{n}\boldsymbol{\gamma}$ where $\Theta_{n}=(\theta_{n}(a,b))_{a,b\in\{0,\ldots,n-1\}},\boldsymbol{\gamma}=\left({\rm tr}\left\{\Sigma_{1}(0)\right\},\ldots,{\rm tr}\left\{\Sigma_{1}(n-1)\right\}\right)$ and $\widehat{\boldsymbol{\gamma}}=\left({\rm tr}\left\{\widehat{\Sigma}_{1}(0)\right\},\ldots,{\rm tr}\left\{\widehat{\Sigma}_{1}(n-1)\right\}\right)$ respectively. This property can be used to construct unbiased estimators for ${\rm tr}\left\{\Sigma_{1}(0)\right\}$ as elements of the vector $\widehat{\boldsymbol{\gamma}^{*}}=\Theta_{n}^{-1}\widehat{\boldsymbol{\gamma}}_{n}$ . Denoting the elements of $\widehat{\boldsymbol{\gamma}^{*}}$ as $\widehat{{\rm tr}\left\{\Gamma(a)\right\}}$ , the functional can finally be constructed as

[TABLE]

Ayyala et al. [6] proposed a test statistic based on $\mathcal{M}_{n}$ defined in (32). In addition to the second-order stationary autocovariance structure, observations from the two populations are assumed to be realizations of two independent $M$ -dependent strictly stationary Gaussian processes with means $\boldsymbol{\mu}_{1}$ and $\boldsymbol{\mu}_{2}$ and autocovariance structures $\{\Sigma_{1}(a)\}$ and $\{\Sigma_{2}(a)\}$ respectively. The $M$ -dependence structures imposes the autocovariance matrices to be equal to zero for lags greater than $M$ . Properties of the test statistic are established based on the following assumptions:

(APR I)

The observations are realizations of $M$ -dependent strictly stationary Gaussian processes. 2. (APR II)

The rates of increase of dimension $p$ and order $M$ with respect to $n$ are linear and polynomial respectively,

[TABLE] 3. (APR III)

For any $k_{1},k_{2},k_{3},k_{4}\in\{1,2\}$ ,

[TABLE]

where $\Omega_{1}=\mathop{\sum}_{a=-M}^{M}(1-|a|/n)\Sigma_{1}(a)$ and $\Omega_{2}=\mathop{\sum}_{a=-M}^{M}(1-|a|/n)\Sigma_{2}(a)$ . 4. (APR IV)

The means $\boldsymbol{\mu}_{1}$ and $\boldsymbol{\mu}_{2}$ satisfy the local alternative condition

[TABLE]

Setting $M=0$ and $\Sigma_{1}(a)=\Sigma_{2}(a)=0$ for all $a\neq 0$ , it is straightforward to see that the conditions (APR III) and (APR IV) are similar to (CQ III) and (CQ IV). The test statistic is given by

[TABLE]

where the variance estimate is constructed similar to $T_{CQ}$ and $T_{PA}$ using a leave-out method for better asymptotic properties. For exact form of the estimator, please refer to Ayyala et al. [6]. Under the conditions (APR I)-(APR IV), $T_{APR}$ is shown to be asymptotically normal. While the test statistic and the empirical studies of Ayyala *et al. *are valid, Cho et al. [23] identified some theoretical errors in the proofs and provided some corrections to some results and assumptions in Ayyala *et al. *.

One issue that still needs to be addressed is the choice of $M$ . Simulation studies reported in Ayyala *et al. *indicate that over-estimating $M$ is better than underestimating. When the specified value of $M$ in the analysis is greater than the true order of dependency, the error is in estimating zero matrices for lags greater than the true $M$ . Under-specifying the value results in bias as autocovariances for several lags will not be estimated. Accurate estimation of $M$ using the data is not addressed and remains an open area of research. A large class of models can be approximated using $M$ -dependent strictly stationary processes. Tests for other classes of models such as second-order stationary processes or Non-Gaussian processes is another area of active research.

3 Covariance matrix

The covariance matrix of a multivariate random variable is a measure of dependence between the components of the variable. It is the second order central moment of the variable, defined as $\Sigma={\rm var}(\mathbf{X})=\mathbb{E}\left\{\left(\mathbf{X}-\boldsymbol{\mu}\right)\left(\mathbf{X}-\boldsymbol{\mu}\right)^{\top}\right\}$ , where $\boldsymbol{\mu}=\mathbb{E}(\mathbf{X})$ . The covariance matrix is often re-parameterized using its inverse, called the precision matrix, $\Omega=\Sigma^{-1}$ . Elements of the precision matrix are useful in determining conditional independence under normality. If $\mathbf{X}\sim\mathcal{N}(\boldsymbol{\mu},\Sigma)$ , then $\Omega_{ij}=0$ implies $X_{i}$ is independent of $X_{j}$ conditional on $\{X_{k}:k\neq i,j\}$ . The precision matrix is important because it can be used to construct an undirected graphical network model. Representing the components as nodes of the network, edges are defined by the elements of $\Omega=(\omega_{ij})$ , where $\omega_{ij}\neq 0$ indicates the presence of an edge and $\omega_{ij}=0$ indicates the absence of an edge between nodes $i$ and $j$ . In view of these properties of the covariance matrix and other distributional properties, normality of the variables is commonly used in covariance matrix estimation. Unless otherwise stated, we shall assume the variables are normally distributed for the remainder of this section.

Given an *i.i.d. *sample $\mathbf{X}_{i}\sim\mathcal{N}(\boldsymbol{\mu},\Sigma),i=1,\ldots,n$ , the biased sample covariance matrix is defined as

[TABLE]

with $\mathbb{E}(\mathcal{S})=(n-1)/n\Sigma$ and ${\rm rank}(\mathcal{S})=\min(n-1,p)$ . In traditional multivariate setting with $p<n$ , $\mathcal{S}$ is non-singular and consistent for $\Sigma$ . The sampling distribution of $\mathcal{S}$ is a Wishart distribution with $n-1$ degrees of freedom (Anderson [4], Muirhead [60]). Additionally, the eigenvalues of $\mathcal{S}$ are also consistent for the eigenvalues of $\Sigma$ . Asymptotically, the eigenvalues are normally distributed - a result that can be used to construct hypothesis tests. Estimation of eigenvalues of $\Sigma$ is of importance because they give the variance of the principal components, which are useful in constructing lower-dimensional embeddings of the data (dimension reduction). Hypothesis tests concerning the structure of the covariance matrix such as sphericity ( $H_{0}:\Sigma=\sigma^{2}\mathcal{I}$ ) and uniform correlation ( $H_{0}:\Sigma=\sigma^{2}\left[(1-\rho)\mathcal{I}+\rho\mathbf{1}\mathbf{1}^{\top}\right]$ ) are constructed using this property ([4, 60]). Testing equality of covariance matrices for two or more groups is also well-defined when using the sample covariance matrix and its Wishart properties.

Results from traditional multivariate analysis are valid only when $n>p$ and $p$ is assumed to be fixed. In high dimensional analysis, as seen in Section 2, $p$ is assumed to be increasing with $n$ . How can we construct consistent estimators for $\Sigma$ and test statistics to compare the covariance structures of two or more populations in high dimension? In high dimensional models with $p\geq n$ , the sample covariance matrix $\mathcal{S}$ is rank-deficient. Estimation of $\Sigma$ and $\Omega$ also suffer from the curse of dimensionality even when $p<n$ with $p/n\rightarrow c\in(0,1)$ . When $p\rightarrow\infty$ , $\mathcal{S}$ is no longer consistent for $\Sigma$ . Estimation of $\Sigma$ was not an issue in tests the mean vector since we were only interested in consistent estimator for a function of $\Sigma$ , e.g. ${\rm tr}\left(\Sigma\right)$ or ${\rm tr}\left(\Sigma^{2}\right)$ .

3.1 Estimation

To obtain consistent estimators for $\Sigma$ , two methods for reducing the parameter space dimension are used - structural constraints or regularization through sparsity. A banding approach, proposed by Bickel and Levina [11] sets elements outside a band around the diagonal to zero. For any $1\leq k\leq p$ , the banded estimator $\widehat{\Sigma}^{(k)}$ is defined as

[TABLE]

Here $k$ denotes the width of the band, clearly indicating $\widehat{\Sigma}^{(k)}$ as the diagonal estimator. The estimator is consistent for $\Sigma$ under the $\ell_{2}$ matrix norm and when $\log p/n\rightarrow 0$ . The optimal value of $k$ is chosen using $K$ -fold cross validation of the estimated risk. It is particularly effective when the components of $\mathbf{X}$ are ordered so that $\sigma_{ij}$ decreases as $|i-j|$ increases. Consistency of the estimator is also shown to hold for non-Gaussian variables whose elements have sub-exponential tails

Regularization is a more commonly used approach for covariance matrix estimation as it is easier to formulate mathematically. Under normality, likelihood of $\Sigma$ given a sample $\mathbf{X}_{1},\ldots,\mathbf{X}_{n}$ can be expressed as

[TABLE]

Expression of the second term follows by applying the matrix result that for any $p$ dimensional vector $\mathbf{x}$ and $p\times p$ matrix $B$ , we have $\mathbf{x}^{\top}B\mathbf{x}={\rm tr}\left(\mathbf{x}^{\top}B\mathbf{x}\right)={\rm tr}\left(B\mathbf{x}\mathbf{x}^{\top}\right)$ . Alternatively, the likelihood can be expressed in terms of the precision matrix $\Omega$ as

[TABLE]

Maximizing the likelihood in (36) with respect to $\Sigma$ yields $\widehat{\Sigma}=\mathcal{S}$ .

Regularization of the covariance matrix estimator is achieved by adding a penalty term to the likelihood in 36,

[TABLE]

for some penalty function $\mathcal{P}$ which can be defined to achieve a desired effect on $\widehat{\Sigma}$ . The penalty parameter $\lambda$ dictates the trade-off between maximizing the likelihood term and minimizing the penalty. Inspired by lasso (Tibshirani [82]), Bien and Tibshirani [12] proposed using a $\ell_{1}$ -penalty to induce sparsity in the estimator. The penalty function is given by $\mathcal{P}(\Sigma)=\|\mathcal{W}\circ\Sigma\|_{1}=\mathop{\sum}_{i,j}w_{ij}\sigma_{ij}$ , where $\circ$ denotes the Hadamard element-wise product. The matrix $\mathcal{W}=\mathbf{1}\mathbf{1}^{\top}$ penalizes all the elements of $\Sigma$ whereas $\mathcal{W}=\mathbf{1}\mathbf{1}^{\top}-\mathcal{I}$ penalizes only the off-diagonal terms. Another approach to address regularization was developed by Daniels and Kass [29] by shrinking the eigenvalues to make the estimator more stable.

While theoretically developing penalized estimates for the covariance matrix is important, it is practically more conducive to obtain sparse estimates of the precision matrix. Sparsity of precision matrix translates to absence of edges between nodes in the network model. Hence a sparse precision matrix can be used to isolate clusters of nodes which are strongly dependent within themselves and independent of the other clusters. The $\ell_{1}$ penalized precision matrix estimation is done by maximizing the function

[TABLE]

Termed by Friedman et al. [33] as glasso (short for graphical lasso), the problem has garnered great levels of interest. Several extensions and improvisations of the original glasso method have been proposed. Danaher et al. [27] and Guo et al. [36] studied joint estimation of $K>1$ precision matrices by imposing two levels of penalties. For sparse estimation of precision matrices $\Omega^{(1)},\ldots,\Omega^{(K)}$ , using (39) individually will not preserve the cluster structure across the groups. By introducing a penalty to merge the $K$ groups, the following penalty functions have been proposed:

[TABLE]

where Guo *et al. *parameterize the $K$ precision matrices as $\Omega^{(k)}=\Theta\circ\Gamma^{(k)}$ with $\Theta$ representing the overall network structure and $\Gamma^{(k)}$ ’s representing the group-specific difference in the structure.

3.2 Hypothesis testing

When studying the covariance matrix of a multivariate Gaussian population, there are two common hypotheses of interest:

[TABLE]

These hypotheses can be alternatively stated using eigenvalues. If $\lambda_{1},\ldots,\lambda_{p}$ are the eigenvalues of $\Sigma$ , the hypotheses in (40) are equivalent to

[TABLE]

Functionals of $\Lambda=(\lambda_{1},\ldots,\lambda_{p})$ which are equal to zero under the null hypothesis can be constructed by observing that under sphericity, the variance of $\Lambda$ is equal to zero. For the identity hypothesis, deviation of $\Lambda$ from one is zero. The functionals (John [John1972], Nagao [61]) can be defined as

[TABLE]

In the traditional setting when $p<n$ , the sample covariance matrix $\mathcal{S}$ (and its eigenvalues) are consistent for $\Sigma$ (and $\Lambda$ ). Hence the test statistics based on functionals in (42) are

[TABLE]

which are shown to follow chi-squared distributions asymptotically with $p(p+1)/2-1$ and $p(p+1)/2$ degrees of freedom respectively. Ledoit and Wolf [50] studied the properties of $U_{n}$ and $V_{n}$ when $p/n\rightarrow c>0$ and observed that $U_{n}$ performs well even in the high-dimensional case. For the identity hypothesis, they constructed a new test statistic,

[TABLE]

which is also asymptotically chi-squared with $p(p+1)/2$ degrees of freedom but has better properties than $V_{n}$ . Relaxing the assumption of normal distribution and a direct relationship between $n$ and $p$ , Chen *et al. *proposed test statistics $U_{n}^{*}$ and $V_{n}^{*}$ which are asymptotically normally distributed. These test statistics are in the same spirit as $T_{CQ}$ (9) and uses leave-out cross-validation type products to improve the asymptotic properties.

Next, consider testing equality of covariance matrices from two normal populations $\mathbf{X}_{i}\sim\mathcal{N}(0,\Sigma_{1}),i=1,\ldots,n$ and $\mathbf{Y}_{j}\sim\mathcal{N}(0,\Sigma_{2}),j=1,\ldots,m$ . The sample covariance matrices and pooled covariance matrix,

[TABLE]

are used to construct the likelihood ratio test statistic as

[TABLE]

Under $H_{0}:\Sigma_{1}=\Sigma_{2}$ , $\mathcal{L}$ asymptotically follows a chi-squared distribution with $p(p+1)/2$ degrees of freedom. Extending to $K$ groups, the test statistic is

[TABLE]

where $n_{g}$ is the sample size of the $g^{\rm th}$ group and $\mathcal{S}_{pl}=\left(\sum_{g=1}^{K}n_{g}\right)^{-1}\left(\sum_{g=1}^{K}n_{g}\mathcal{S}_{g}\right)$ . Under $H_{0}:\Sigma_{1}=\ldots=\Sigma_{K}$ , the LRT statistic $\mathcal{L}_{K}$ asymptotically follows a chi-squared distribution with $(K-1)p(p+1)/2$ degrees of freedom. However for the two sample case, LRT fails when $p>\min(n,m)$ because at least one of $\mathcal{S}_{1}$ or $\mathcal{S}_{2}$ will become singular. Bai et al. [8] and Jiang et al. [45] provided asymptotic corrections to the LRT when $n,p\rightarrow\infty$ with $c_{n}=p/n\rightarrow c\in(0,\infty)$ and proposed

[TABLE]

which is asymptotically normally distributed under the null hypothesis.

Another approach for testing equality of covariance matrices is to construct a functional $\mathcal{F}(\Sigma_{1},\Sigma_{2})$ which will be equal to zero when $\Sigma_{1}=\Sigma_{2}$ . Schott [70] used the squared Frobenius norm of the difference $\Sigma_{1}-\Sigma_{2}$ as the functional to base the test statistic. This method is readily extended to comparing $K$ covariance matrices, with the test statistic $\mathcal{T}=\mathcal{F}_{n}/\sqrt{\widehat{{\rm var}(\mathcal{F}_{n})}}$ , where

[TABLE]

and $\eta_{i}=(n_{i}+2)(n_{i}-1)$ . When $p/n_{i}\rightarrow c_{i}\in[0,\infty)$ , $\mathcal{T}$ is asymptotically normal under the null hypothesis. Srivastava et al. [75, 76, 77] developed test statistics using similar rationale but replacing normality assumption with constraints on moments of first four orders. Relaxing the direct relationship between $p$ and $n$ , Li and Chen [53] proposed a test statistic by using the ${\rm tr}\{(\Sigma_{1}-\Sigma_{2})^{2}\}$ as the functional. The test statistic was constructed using U-statistics of the form $\{n(n-1)\}^{-1}\sum_{i<j}(\mathbf{X}_{i}^{\top}\mathbf{X}_{j})^{2}$ to estimate ${\rm tr}\{\Sigma_{1}^{2}\}$ and so on. The leave-out cross-products in the proposed test statistic is similar in spirit to the variance estimate in $T_{CQ}$ (9). Assumptions for the test statistic are similar to (CQ III) and (CQ IV).

Covariance matrix estimation is an exciting field which direct applications in graphical network models. Most theory of regularization based sparse precision matrices is based on Gaussian distributions. Extending such estimation to distributions such as Dirichlet-Multinomial or multivariate Poisson where the covariance matrix is parameterized through the mean is very challenging. Hypothesis tests for covariance matrices have primarily been developed by studying the asymptotic properties of traditional test statistics. As seen in Section 2, random projection methods show good promise in mean vector testing. Using random projections for covariance matrices is an interesting question that is an active area of research. If $R\in\mathbb{R}^{k\times p}$ is an orthogonal random matrix, then projecting the data using $R$ preserves the hypotheses of sphericity and identity in equation (40). The hypotheses conditional on the random projections will be

[TABLE]

Theoretical properties of such tests are an active area of research.

4 Discrete multivariate models

Multivariate count data occur frequently in genomics and text mining. In high-throughput genomic experiments such as RNA-Seq (Wang et al. [83]), data is reported as the number of reads aligned to the genes in a reference genome. In text mining (Blei et al. [15]), the number of occurrences of a dictionary of words in a library of books is counted to study patterns of keywords and topics. In metagenomics (Holmes et al. [40]), abundances of bacterial species in samples is studied by recording the counts of reads assigned to different bacterial species. In all data sets, the data matrix consists of non-negative integer counts. Analyzing multivariate discrete data can be addressed two ways. The absolute counts can be modeled using discrete probability models or the data can be transformed (e.g. using relative abundances instead of absolute counts) and use continuous probability models such as Gaussian, etc. The research community is still divided in opinion on the loss of information due to this transformation (McMurdie and Holmes [56]) or the lack thereof. Transforming the variables will enable us to use hypothesis testing tools presented in Section 2. In this section, we will look at some discrete multivariate models.

4.1 Multinomial distribution

The Multinomial distribution is the most commonly used multivariate discrete model, extending the univariate binomial distribution to multiple dimensions. For $p\geq 2$ , the multinomial distribution is parameterized by a probability vector $\boldsymbol{\pi}=(\pi_{1},\ldots,\pi_{p})$ with $\pi_{1}+\ldots+\pi_{p}=1$ and the total count $N\in\mathbb{Z}_{+}$ . The probability mass function of $\mathbf{X}\sim{\rm Mult}\left(N,\boldsymbol{\pi}\right)$ is given by

[TABLE]

for all $\mathbf{x}\in\mathbb{Z}_{+}^{p}$ such that $x_{1}+\ldots+x_{p}=N$ . An alternative representation of the multinomial distribution can be obtained using independent Poisson random variables. Consider $p$ independent Poisson random variables, $X_{k}\sim{\rm Pois}(\lambda_{k}),k=1,\ldots,p$ . Then the vector $(X_{1},\ldots,X_{p})$ , conditional on $\mathop{\sum}_{k=1}^{p}X_{k}=N$ , follows a multinomial distribution with probability parameter $\boldsymbol{\pi}=(\lambda_{1},\ldots,\lambda_{p})/(\lambda_{1}+\ldots+\lambda_{p})$ . The re-parameterization using Poisson variable is scale invariant, i.e. the same multinomial distribution is obtained when $X_{k}\sim{\rm Pois}\left(s\lambda_{k}\right)$ for all $s>0$ . Levin [52] provide a very simple expression for the cumulative distribution function using this property,

[TABLE]

where $s>0$ is any positive number, $X_{k}\sim{\rm Pois}\left(s\pi_{k}\right)$ and $S=Y_{1}^{*}+\ldots+Y_{p}^{*}$ where $Y_{k}^{*}$ is a truncated Poisson variable, $Y_{k}\sim{\rm Pois}(s\pi_{k};\{0,\ldots,a_{k}\})$ . This alternative formulation and equation (49) reduce the computational cost of calculating the CDF significantly. Using the mass function, the calculation would include doing a comprehensive search in the sample space $\{\mathbf{X}:X_{1}+\ldots+X_{p}=N\}$ , which has a computational cost of exponential order with respect to $p$ .

The first two moments are functions of $\boldsymbol{\pi}$ , given by $\mathbb{E}(\mathbf{X})=N\boldsymbol{\pi}$ and ${\rm var}(\mathbf{X})=N\left\{{\rm diag}\left(\boldsymbol{\pi}+\boldsymbol{\pi}^{2}\right)-\boldsymbol{\pi}\boldsymbol{\pi}^{\top}\right\}$ . The constraint on the total sum implies the variables are always negatively correlated, with ${\rm cov}\left(X_{i},X_{j}\right)=-N\pi_{i}\pi_{j}$ . Parameter estimation for multinomial distributions is a well studied. Using the added constraint $\pi_{1}+\ldots+\pi_{p}=1$ , the maximum likelihood estimates can be easily derived as

[TABLE]

Starting with the works by Rao [66, 67] wherein consistency and asymptotic properties of the maximum likelihood estimator have been established, several extensions have been developed. When $\boldsymbol{\pi}$ is restricted to a convex region in the parameter space, Barmi and Dykstra [10] developed an iterative estimation method based on a primal-dual formulation of the problem. Jewell and Kalbfleisch [44] developed estimators when the multinomial parameters are ordered, i.e. $\pi_{1}\leq\pi_{2}\leq\ldots\leq\pi_{p}$ . Leonard [51] provided a Bayesian approach to parameter estimation by imposing a Dirichlet prior on the probability vector and derived the Bayesian estimates under a quadratic loss function.

When comparing two multinomial populations, $\mathbf{X}\sim{\rm Mult}(\boldsymbol{\pi}_{X})$ and $\mathbf{Y}\sim{\rm Mult}(\boldsymbol{\pi}_{Y})$ , the hypothesis of interest is

[TABLE]

Unlike the hypothesis tests in Section 2, we do not require replicates of the count vectors to construct the test statistic and study its asymptotic properties. Instead, sample sizes for 51 are $n=\mathop{\sum}_{k=1}^{p}X_{k}$ and $m=\mathop{\sum}_{k=1}^{p}Y_{k}$ . Traditional tests include the Pearson chi-squared test and the likelihood ratio test,

[TABLE]

where $\widehat{\pi}_{k}=(X_{k}+Y_{k})/(n+m),\widehat{\pi}_{Xk}=X_{k}/n,\widehat{\pi}_{Yk}=Y_{k}/m,\widehat{X}_{k}=n\widehat{\pi}_{k}$ and $\widehat{Y}_{k}=m\widehat{\pi}_{k}$ . Asymptotically, the tests follow a chi-squared distribution with $p$ degrees of freedom under $H_{0}$ . When $p$ is fixed, Hoeffding [38] provided asymptotically optimal tests for (51). Furthermore, he also provided conditions under which $T_{LRT}$ has superior performance compared to $T_{Pearson}$ . Morris [59] provided a general framework for deriving the limiting distributions of any general sums of the form

[TABLE]

when $\{f_{k},k=1,\ldots,p\}$ are polynomials of bounded degree, which generalize $T_{Pearson}$ and $T_{LRT}$ . For a comprehensive review of tests, refer to [9] and the references therein.

Distributional properties of these tests hold valid when all the counts are large, i.e. $X_{k}>0$ and $Y_{k}>0$ and number of categories $p$ is smaller than $n+m$ . When $p$ is larger than $n$ we encounter sparsity. This is because the minimum number of zero elements will be $p-(n+m)$ . Results derived by Morris hold when $p$ and $n+m$ both increase. When the data is large and sparse, i.e. $p>n+m$ , Zelterman [85] derived the mean and standard deviation of $T_{Pearson}$ and normalized the test statistic to construct an asymptotically normal test statistic. Using the $\ell_{1}$ norm of difference, $\|\boldsymbol{\pi}_{X}-\boldsymbol{\pi}_{Y}\|_{1}=\mathop{\sum}_{k=1}^{p}|\pi_{Xk}-\pi_{Yk}|$ , and the Euclidean norm $\|\boldsymbol{\pi}_{X}-\boldsymbol{\pi}_{Y}\|_{2}^{2}=\mathop{\sum}_{k=1}^{p}(\pi_{Xk}-\pi_{Yk})^{2}$ Chan et al. [20] the following functionals to use as test statistics:

[TABLE]

However, the sampling distributions of $\mathcal{T}_{1}$ and $\mathcal{T}_{2}$ were not provided. Instead, permutation based cut-off need to be calculated to do inference.

Studying the asymptotic properties of such functionals, Plunkett and Park [65] constructed a test statistic, given by

[TABLE]

The test statistic was shown to be asymptotically normal under the following conditions:

(PP I)

$\min(n,m)\rightarrow\infty$ and $n/(n+m)\rightarrow c\in(0,1)$ . This condition is the same as (BS II), (SD II) and (PA II). 2. (PP II)

The probabilities are not concentrated, i.e.

[TABLE]

This condition ensures that the number of components with non-zero probabilities is not bounded. For example, we cannot have $\boldsymbol{\pi}_{X}=(1/m,\ldots,1/m,0,\ldots,0)$ where the number of non-zero elements is equal to $m$ because $\max_{k}\pi_{Xk}^{2}=1/m^{2}$ and $\|\boldsymbol{\pi}_{X}\|_{2}^{2}=1/m$ resulting in the ratio being equal to 1/m. 3. (PP III)

The sample sizes $n$ and $m$ and dimension $p$ are restricted as

[TABLE]

To better understand this condition, consider $\boldsymbol{\pi}_{X}+\boldsymbol{\pi}_{Y}=(1/p,\ldots,1/p)$ . Then $(n+m)\|\boldsymbol{\pi}_{X}+\boldsymbol{\pi}_{Y}\|_{2}^{2}=(n+m)/p$ which implies $p$ can increase at most linearly with respect to $n$ . 4. (PP IV)

Asymptotic normality is valid in the local alternative

[TABLE]

4.2 Compound Multinomial models

Consider $n$ multivariate count vectors of dimension $p$ , $\mathbf{X}_{1},\ldots,\mathbf{X}_{n}$ . Such data commonly arises when multiple samples are collected, e.g. gene expression counts of $p$ genes collected from $n$ specimens. One common criticism of the standard multinomial distribution is that it does not address over-dispersion in the data. If we consider that the count vectors are *i.i.d. *from ${\rm Mult}(\boldsymbol{\pi})$ , we are inadvertently assuming that the population is homogeneous. To account for heterogeneity in the population, it is advised to assume a model with sample-specific parameter,

[TABLE]

The heterogeneity can further be modeled using a distribution on the $p$ -dimensional simplex $\mathcal{S}_{p}=\{\boldsymbol{\pi}\in\mathbb{R}:\pi_{1}+\cdots+\pi_{p}=1\}$ . In the univariate case, the beta distribution is the natural choice for the distribution on $\mathcal{S}_{2}$ . Extending to $p$ dimensions, the natural extension is the multivariate beta distribution or the Dirichlet distribution.

The Dirichlet distribution is characterized by a single parameter $\boldsymbol{\theta}=(\theta_{1},\ldots,\theta_{p})$ , with density function

[TABLE]

where $\theta_{0}=\theta_{1}+\cdots+\theta_{p}$ and $\Gamma(\cdot)$ is the gamma function. The compound Dirichlet-Multinomial(DirMult) distribution, constructed by the marginal of $\mathbf{X}_{i}|\boldsymbol{\pi}_{i}\sim{\rm Mult}(\boldsymbol{\pi}_{i})$ and $\boldsymbol{\pi}\sim{\rm Dir}(\boldsymbol{\theta})$ has the density function given by

[TABLE]

where $X_{0}=X_{1}+\cdots+X_{p}$ . The DirMult model was first introduced by Mosimann, who derived the properties of the distribution. The mean and variance of the DirMult distribution are $\mathbb{E}(\mathbf{X})=X_{0}\theta_{0}^{-1}\boldsymbol{\theta}$ and ${\rm var}(\mathbf{X})=n\left\{\theta_{0}^{-1}{\rm diag}(\boldsymbol{\theta})-\theta_{0}^{-2}(X_{0}+\theta_{0})/(1+\theta_{0})\boldsymbol{\theta}\boldsymbol{\theta}^{\top}\right\}$ . The variance matrix is the sum of a full-rank matrix (diagonal part) and a rank-one matrix. Using the result from Miller [57], the precision matrix can be calculated in closed form as

[TABLE]

For parameter estimation, the likelihood function of (55) does not admit a maximum for $\boldsymbol{\theta}$ in closed form. An approximate solution can be obtained using iterative methods such as the Newton-Raphson algorithm. One convenient feature for computation is that the second-order derivative of the log-likelihood function has a closed-form expression for the inverse (Sklar [72]). Thus the Newton-Raphson step has a linear computation cost. When $p$ is larger than $X_{0}$ , Danaher [28] derived parameter estimates the beta-binomial marginals and established their consistency.

While the density function is known to be globally convex, maximization can still lead us to a local maxima. A proper initial value specification is essential to have good performance of the estimator. Choice of optimal initial values has been an area of considerable interest, even for the Dirichlet distribution. The challenge lies in the fact that the method of moments (MM) estimator is not unique. This is because of the scaling in $\mathbb{X}_{k}=n\theta_{k}/\theta_{0}$ , which gives both $\widehat{\boldsymbol{\theta}}$ and $c\widehat{\boldsymbol{\theta}}$ as MM estimates for any $c>0$ . Ronning [68] proposed using the same initial value for all elements, $\widehat{\theta}_{k}=\min\limits_{ij}X_{ij}$ . This proposal was based on an observation that the method of moments estimates can lead to Newton-Raphson updates becoming inadmissible, i.e. $\widehat{\theta}_{k}<0$ for some $k$ . Hariharan [37] have done a comprehensive comparison of the different initial values under several models. However they concluded that none of the methods is uniformly consistent across all the models.

Dirichlet-Multinomial has been applied to study multivariate count data in several applications in biomedical research. In metagenomics, the study of bacterial composition of environmental (biological or ecological) samples, we are interested in modeling the abundance of different species of bacteria in samples. The Dirichlet-Multinomial model is apt for such data because (i) abundances of bacteria are constrained by the total number of bacteria sampled in the specimen and (ii) over-dispersion due to environmental variability is accounted for. Holmes et al. [40] used a Dirichlet multinomial mixture model to cluster samples by abundance profile, i.e. the DirMult parameter. Chen and Li [21] developed a $\ell_{1}$ -penalized parameter estimation for variable selection in the DirMult model. Sun et al. [80] used the DirMult model to construct a clustering algorithm for single-cell RNA-seq data.

The most celebrated application of DirMult distribution is latent dirichlet allocation (LDA), introduced by Blei et al. [15]. Developed for text mining for classifying documents by keywords, the model is a hierarchical Bayesian model with three levels. Firstly, the $p$ elements of $\mathbf{X}$ represent the words in the vocabulary. A word is represented as $\mathbf{X}=(x_{1},\ldots,x_{p})$ where $x_{k}\in\{0,1\}$ for all $k=1,\ldots,p$ and $\mathop{\sum}_{k=1}^{p}x_{k}=1$ . A collection of $q$ words represents a topic, which can be used to classify documents, which will also be a multinomial variable $\mathbf{T}=(t_{1},\ldots,t_{q})$ with $t_{k}\in\{0,1\}$ for all $k=1,\ldots,q$ . The number of topics, $K$ , is assumed to be fixed. It should be noted that while the words in the vocabulary are defined and observed, the topic corresponding to a word is a latent variable. Second, each document is defined as a sequence of $N$ words, $\mathcal{X}=\{\mathbf{X}_{1},\ldots,\mathbf{X}_{N}\}$ . The number of words in a document is assumed to have a Poisson distribution ( $N\sim{\rm Pois}(\lambda)$ ) and the topics follow a multinomial distribution with document-specific parameter. And finally, a corpus is defined as a collection of $M$ documents, $\mathcal{D}_{N}=\{\mathcal{X}_{1},\ldots,\mathcal{X}_{M}\}$ .

The LDA model is parameterized as follows. Each corpus is characterized by the probability of its keywords $\boldsymbol{\theta}_{m}$ , $\mathbf{T}\sim{\rm Mult}(\boldsymbol{\theta}_{m})$ . The probability parameters are assumed to be following a Dirichlet distribution, $\boldsymbol{\theta}_{m}\sim{\rm Dir}(\boldsymbol{\alpha}),m=1,\ldots,M$ . Conditional on the latent topics $\mathbf{T}$ , $\pi_{kt}=P\left(X_{k}=1|T_{t}=1\right)$ denotes the probability that $k^{\rm th}$ word in the vocabulary is observed, provided the word describes the topic. The collection of all such probabilities is parameterized as a $p\times q$ matrix $\boldsymbol{\Pi}=(\pi_{kt}:k=1,\ldots,p;t=1,\ldots,q)$ . Using these components, the complete likelihood can be written as

[TABLE]

In this model, $f(\cdot)$ is the Dirichlet density function, $g(\cdot)$ is the multinomial mass function and $h(\cdot)$ is obtained from $\boldsymbol{\Pi}$ . Parameter estimation is done by maximizing the likelihood using expectation-maximization (EM) algorithm by conditioning on the latent keywords.

Major focus on LDA research has been on developing faster algorithms (Hoffman et al. [39]) to be able to analyze larger corpora with large number of documents. Mimno et al. [58] considered sparsity in the model from the Gibbs sampling perspective to improve the efficiency of the algorithm. However most of the research has been from a machine learning and estimation perspective. Statistical properties of the estimators, which could be of potential interest for developing hypothesis tests, have not been established. One potential problem of interest could be comparing the Dirichlet parameters of two corpora,

[TABLE]

In computer science literature, the focus has been on developing methods for efficient analysis of corpora with large number of documents. Sample size is known to affect accuracy of the allocation (Crossley et al. [25]). A large $p$ small $n$ problem in this context would be efficient classification of small number of documents (small N) with a large vocabulary (large p). Understanding the efficiency of LDA in such large $p$ small $n$ scenarios is an open area of research.

4.3 Other distributions

The Dirichlet-Multinomial is a natural extension to the univariate beta-binomial distribution, which are the marginals of the DirMult distribution. This observation arises the following question: can we develop multivariate count distributions with known marginals? The theoretical answer to this question is to use Sklar’s theorem (Nelson [62]) and construct a copula to model the joint distribution. However parametric inference such as hypothesis testing is very tedious and sometimes intractable when using copula models. In this section, we shall look at some multivariate extensions to known univariate distributions which have useful parameterizations and are easy to do inference.

4.3.1 Bernoulli distribution

One of the earliest generalizations of the Bernoulli distribution using a parametric approach was developed by Teugels [81]. Using the moments of all orders $k=1,\ldots,p$ , the moment generating function of multivariate Bernoulli was constructed. They also provided an extension to the multivariate binomial distribution using the sum of independent Bernoulli variables. Using the joint probabilities, Dai et al. [26] proposed a multivariate Bernoulli distribution which has an analytical form of the mass function. Before generalizing the multivariate Bernoulli distribution, consider the case where elements of the variable $\mathbf{X}=(X_{1},\ldots,X_{p})$ are independent with $X_{k}\sim{\rm Ber}(\pi_{k}),k=1,\ldots,p$ . Then the joint probability of $\mathbf{X}=\mathbf{x}$ is given by

[TABLE]

When the variables are dependent, the joint probability cannot be factored into the product of marginals. Using the joint probabilities, the mass function can defined as

[TABLE]

where $\pi_{00\ldots 0}=P(X_{1}=0,\ldots,X_{p}=0)$ and so on. The marginals of $\mathbf{X}$ are Bernoulli with cumulative probability,

[TABLE]

Using this formulation, they computed the moments and also calculate maximum likelihood estimates using Newton-Raphson algorithm. However the main drawback is the dimension of the parameter space. To define the multivariate Bernoulli mass function, we require a total of $2^{p}-1$ parameters, which can be computationally infeasible for higher dimensions.

4.3.2 Binomial distribution

The bivariate binomial distribution (BBD) was first introduced by Aitken and Gonin [2] in the context of analysis $2\times 2$ contingency tables when the two outcomes are not independent. Several extensions have been provided since, including work by Krishnamoorthy [48] who derived the properties of BBD by extending the moment-generating function from the independent case to dependent variables. Hudson and Tucker [42] established limit theorems for BBD expressing them as sums of independent multivariate Bernoulli variables. Several other researchers have discussed the properties of BBD. For a recent list of all publications, please refer to Biswas and Hwang [14] and the references therein. The multivariate binomial distribution (MBD) also suffers from the same curse of dimensionality as the Bernoulli distribution. The total number of parameters required to define the $p$ -dimensional distribution is equal to $2^{p}-1$ .

The multivariate binomial distribution poses several questions that still need to be answered. For instance, it would of interest to simplify the distribution for a restricted parameter set. For instance, if we assume only $k$ -fold interactions are feasible, then the model can be reduced to have $2^{k}-1$ parameters. The generalized additive and multiplicative binomial distribution models proposed by Altham [3] can serve as motivation for building such reduced models. MBD can also be used to model several data sets in genomics. For instance when studying epigenomic modifications such as DNA methylation, co-methylation (mutual methylation of pairs of genes) is actively studied for understanding their association with different phenotypes (outcomes). MBD can be used to model the joint probability of methylation of pairs of genes. However the major bottleneck that needs to be solved first is the computational complexity. Currently, there are no existing tools to compute and model MBD. With improved computational capabilities, this task should be accomplished easily.

4.3.3 Poisson distribution

Constructing a multivariate Poisson distribution whose marginals are univariate Poisson variables is fairly easy. Consider the bivariate case. If $Z_{k}\sim{\rm Pois}(\lambda_{k}),k=1,2,3$ are independent Poisson variables, then $\mathbf{X}=(X_{1},X_{2})$ defined as

[TABLE]

gives a bivariate distribution with Poisson marginals, $X_{1}\sim{\rm Pois}(\lambda_{1}+\lambda_{3})$ and $X_{2}\sim{\rm Pois}(\lambda_{2}+\lambda_{3})$ . The joint mass function can be expressed as

[TABLE]

Extending to $p$ dimensions, the multivariate Poisson is defined through the latent $Z_{k}$ ’s as

[TABLE]

where $Z_{jk}\sim{\rm Pois}(\lambda_{jk})$ . Expressing the latent variables in matrix form $(Z_{jk})_{j,k=1,\ldots,p}$ , defining $\mathbf{X}$ requires $p(p+1)/2$ independent latent components. The mass function can be expressed as $p(p-1)/2$ summations and is computationally intensive for even moderate values of $p$ . A more general form of the multivariate Poisson requires $2^{p}-1$ latent components and is infeasible to express as in equation (60). The following trivariate Poisson should serve as a basic overview of the idea:

[TABLE]

The main drawback with this formulation of multivariate Poisson distribution is its restrictive dependence structure. In the bivariate case, the correlation between $X_{1}$ and $X_{2}$ is given by

[TABLE]

which is always positive. Extending the distribution to a larger class of correlation structures, Shin and Pasupathy [71] proposed using the normal to anything (NORTA) algorithm [19] for random number generation from multivariate Poisson with negative correlations. They define the iterative procedure for generating bivariate Poisson variables with correlation $\rho$ as follows. Let $U_{1},U_{2},U_{3}\sim U(0,1)$ be *i.i.d. *variables. A bivariate Poisson distribution with marginals $X_{1}\sim{\rm Pois}(\lambda_{1})$ and $X_{2}\sim{\rm Pois}(\lambda_{2})$ can be obtained using

[TABLE]

where $F^{-1}_{\lambda}(x)=\inf\{y:F_{\lambda}(x)\geq y\}$ is the inverse Poisson cumulative distribution function with parameter $\lambda$ . The parameter $\lambda^{*}\in[0,\lambda_{1}]$ assuming $\lambda_{1}\leq\lambda_{2}$ . If $\lambda_{1}\geq\lambda_{2}$ , $X_{1}$ and $X_{2}$ can be inter-changed. While this formulation gives a method for generating random samples from bivariate Poisson variables with negative correlations, it is unusable for inference as the likelihood function is not available. Obtaining the likelihood function for the bivariate case using (62) and parameter estimation using the derived likeliho0d are a few open problems in using this construction of multivariate Poisson variables.

Karlis [47] developed another approach to characterize multivariate Poisson random variables by compounding independent Poisson components through a multivariate distribution on their parameters. If $\boldsymbol{\lambda}=(\lambda_{1},\ldots,\lambda_{p})\sim\mathcal{G}(\Theta)$ is a multivariate distribution, then dependence structure on $\mathbf{X}$ can be imposed by taking the a mixture of independent Poisson distributions with $\mathcal{G}$ ,

[TABLE]

A popular choice for $\mathcal{G}$ is the log-normal distribution, since the distribution should be defined on $\mathbb{R}^{p}_{+}$ . This formulation has two advantages. Firstly, the covariance structure on $\boldsymbol{\lambda}$ will impart a dependence structure on $\mathbf{X}$ . Secondly, the mixture model ensures that the variablesl of $X_{k}$ is greater than $\lambda_{k}$ for all components, thereby addressing issues of over-dispersions. For more details, readers may refer to Inouye et al. [43] and the references therein for more papers published studying the multivariate Poisson distribution.

Multivariate Poisson distributions are fairly new and have a lot of problems that need to be addressed. The framework for hypothesis testing is not extensively developed. There is very limited literature in this regard. For example, Stern [79] developed a test for the bivariate Poisson model in 59 testing for $H_{0}:\lambda_{3}=0$ versus $H_{A}:\lambda_{3}\neq 0$ using a Bayesian significance test. Testing hypotheses comparing two or more multivariate Poisson families is not addressed. High dimensional tools for multivariate Poisson are extremely hard to develop due to the exponential computation cost: $2^{p}-1$ latent variables required to define the distribution. Restricted models, such as using only pairwise correlations in (61), have a quadratic computation cost and are easier to study. These could potentially be a good starting point for studying the complete model.

5 Conclusion

High dimensional inference is a very exciting field of statistics with many theoretical challenges and practical uses. Availability of large-scale and high-dimensional data is increasing leaps and bounds. Conducting large-scale analysis has become practical with the availability of high performance computing facilities. There is an urgent need to develop statistical tools that can tackle these large dimensional data sets efficiently and accurately. Statistical methodology and computational tools need to progress in conjuction with each other, leaving the onus on statisticians to develop more accurate methods for estimation and inference.

In this chapter, we have addressed three areas of high dimensional inference that are being actively developed. Hypothesis tests for the population mean is one of the more standard inference problems, which has been well studied in high dimensions. We looked at the two main approaches - asymptotics-based tests and random projection based tests have been presented. The asymptotics based tests have been fairly well-studied in comparison to the random projection based tests. Projections into lower-dimensional spaces using random matrices is an active area of research in mean vector testing. We should consider other methods for dimension reduction to study their use in high dimensional inference. Convolutional neural networks (CNN) [34], which are commonly used in deep learning, is another exciting dimension reduction technique that is currently not used for high dimensional inference.

Sparse covariance matrix estimation has found practical use in understanding the graphical network structure of variables in high dimensions. We looked at different approaches to construct the regularization and the computational tools developed for optimization. While Gaussianity of variables is commonly assumed in sparse precision matrix estimation due to its properties, extension to non-Gaussian distributions is to be studied. We have looked at hypothesis testing for comparing two or more covariance matrices in the high dimensional setting. One approach we can identify that is lacking is the use of random projections in covariance matrix testing. This poses an interesting challenge to see the versatility of random projections in high dimensional inference.

Finally, we looked at development of discrete multivariate models and the challenges therein. Only two distributions have been extensively studied - multinomial and Dirichlet-multinomial. We looked at high-dimensional hypothesis tests for the multinomial parameters. The hierarchical models and sparse regression models for the Dirichlet-multinomial distribution are also well studied. However a lot of work needs to be done for other distributions. The theoretical developments in multivariate Bernoulli models need to be supplemented with computational tools for estimation and inference. A generalized multivariate Poisson distribution needs to be developed, which can lead to potential extensions such as multivariate Poisson-gamma mixtures.

Bibliography87

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Achlioptas [2003] D. Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences , 66(4):671–687, 2003. ISSN 00220000. doi: 10.1016/S 0022-0000(03)00025-4 .
2Aitken and Gonin [1936] A. C. Aitken and H. T. Gonin. XI.—On Fourfold Sampling with and without Replacement. Proceedings of the Royal Society of Edinburgh , 55:114–125, 1936. doi: 10.1017/S 0370164600014413 .
3Altham [1978] P. M. E. Altham. Two Generalizations of the Binomial Distribution. Journal of the Royal Statistical Society. Series C (Applied Statistics) , 27(2):162–167, 1978. ISSN 00359254. doi: 10.2307/2346943 . URL http://www.jstor.org/stable/2346943 .
4Anderson [2003] T. W. Anderson. An Introduction to Multivariate Statistical Analysis, 3rd edition . John Wiley and Sons, 2003.
5Ayyala et al. [2015] D. N. Ayyala, D. E. Frankhouser, G. Marcucci, J.-O. Ganbat, P. Yan, R. Bundschuh, and S. Lin. Statistical methods for detecting differentially methylated regions based on Methyl Cap-seq data. Briefings in Bioinformatics , 17(6):926–937, 10 2015. ISSN 1467-5463. doi: 10.1093/bib/bbv 089 . URL https://doi.org/10.1093/bib/bbv 089 . · doi ↗
6Ayyala et al. [2017] D. N. Ayyala, J. Park, and A. Roy. Mean vector testing for high-dimensional dependent observations. Journal of Multivariate Analysis , 153:136–155, 2017. ISSN 0047-259X. doi: 10.1016/j.jmva.2016.09.012 . URL http://www.sciencedirect.com/science/article/pii/S 0047259 X 16300999 .
7Bai and Saranadasa [1996] Z. Bai and H. Saranadasa. Effect of High Dimension: By an Example of a Two Sample Problem. Statistica Sinica , 6:311–329, 1996. ISSN 10170405.
8Bai et al. [2009] Z. Bai, D. Jiang, J. F. Yao, and S. Zheng. Corrections to LRT on large-dimensional covariance matrix by RMT. Annals of Statistics , 37(6 B):3822–3840, 2009. ISSN 00905364. doi: 10.1214/09-AOS 694 .

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

High dimensional statistical inference: theoretical development to data analytics

Abstract

keywords:

Contents

1 Introduction

2 Mean vector testing

2.1 Independent observations

2.2 Projection based tests

2.3 Random projections

Theorem 2.1** (Johnson-Lindenstrauss lemma).**

2.4 Other approaches

2.5 Dependent observations

3 Covariance matrix

3.1 Estimation

3.2 Hypothesis testing

4 Discrete multivariate models

4.1 Multinomial distribution

4.2 Compound Multinomial models

4.3 Other distributions

4.3.1 Bernoulli distribution

4.3.2 Binomial distribution

4.3.3 Poisson distribution

5 Conclusion

Theorem 2.1 (Johnson-Lindenstrauss lemma).