A Bayesian Semiparametric Gaussian Copula Approach to a Multivariate Normality Test
Luai Al-Labadi, Forough Fazeli Asl, Zahra Saberi

TL;DR
This paper introduces a Bayesian semiparametric method combining Dirichlet processes and Gaussian copulas to test multivariate normality, demonstrating strong performance on simulated and real data.
Contribution
It develops a novel Bayesian multivariate normality test using a copula approach with theoretical insights and practical validation.
Findings
Excellent performance on simulated data
Effective detection of non-normality in real data
Theoretical properties established for the method
Abstract
In this paper, a Bayesian semiparametric copula approach is used to model the underlying multivariate distribution . First, the Dirichlet process is constructed on the unknown marginal distributions of . Then a Gaussian copula model is utilized to capture the dependence structure of . As a result, a Bayesian multivariate normality test is developed by combining the relative belief ratio and the Energy distance. Several interesting theoretical results of the approach are derived. Finally, through several simulated examples and a real data set, the proposed approach reveals excellent performance.
| True distribution |
|
True distribution |
|
||
|---|---|---|---|---|---|
|
1 |
0.00524 |
1 |
0.00528 |
||
|
|
5 |
0.00559 |
|
5 |
0.00551 |
|
10 |
0.00581 |
10 |
0.00573 |
||
|
1 |
0.00517 |
1 |
0.00533 |
||
|
|
5 |
0.00538 |
|
5 |
0.00575 |
|
10 |
0.00563 |
10 |
0.00581 |
||
|
1 |
0.00564 |
1 |
0.00520 |
||
|
|
5 |
0.00568 |
|
5 |
0.00557 |
|
10 |
0.00597 |
10 |
0.00563 |
| True distribution | Gaussian rank | Kendall’s | Spearman’s |
|---|---|---|---|
|
|
0.00557 |
0.00528 |
0.00567 |
|
|
0.00542 |
0.00524 |
0.00556 |
| Alternative distribution | E-test’s p-value | Alternative distribution | E-test’s p-value | ||||||
|---|---|---|---|---|---|---|---|---|---|
|
1 |
3.54 |
0.823 |
1 |
0.04 |
0.005 |
||||
|
|
5 |
3.26 |
0.832 |
0.8794 |
|
5 |
0.00 |
0.000 |
|
|
10 |
2.48 |
0.997 |
10 |
0.00 |
0.000 |
||||
|
1 |
3.76 |
0.999 |
1 |
0.80 |
0.110 |
||||
|
|
5 |
2.82 |
0.859 |
0.8442 |
|
5 |
0.62 |
0.151 |
|
|
10 |
2.22 |
0.884 |
10 |
0.60 |
0.240 |
||||
|
1 |
0.18 |
0.017 |
1 |
2.92 |
0.854 |
||||
|
|
5 |
0.18 |
0.033 |
|
|
5 |
2.74 |
0.999 |
0.7035 |
|
10 |
0.06 |
0.007 |
10 |
2.26 |
0.878 |
||||
|
1 |
0.10 |
0.000 |
1 |
0.34 |
0.017 |
||||
|
|
5 |
0.02 |
0.002 |
0.0050 |
|
5 |
0.22 |
0.026 |
0.03518 |
|
10 |
0.00 |
0.000 |
10 |
0.04 |
0.002 |
||||
|
1 |
3.64 |
1.000 |
1 |
0.58 |
0.099 |
||||
|
|
5 |
3.32 |
0.834 |
0.8744 |
|
5 |
0.40 |
0.070 |
0.4020 |
|
10 |
2.34 |
0.874 |
10 |
0.30 |
0.095 |
||||
|
1 |
0.80 |
0.210 |
1 |
0.63 |
0.160 |
||||
|
|
5 |
0.60 |
0.181 |
0.0452 |
|
5 |
0.44 |
0.092 |
0.0502 |
|
10 |
0.20 |
0.010 |
10 |
0.40 |
0.020 |
| Distribution | POR | Distribution | POR |
|---|---|---|---|
|
|
0.057 |
|
0.794 |
|
|
0.070† |
|
0.077‡ |
|
|
0.801‡ |
|
0.106‡ |
|
|
0.798‡ |
|
0.782‡ |
|
|
0.499‡ |
|
0.657‡ |
|
|
0.999‡ |
|
0.551‡ |
|
|
0.080(0.004) |
0.040(0.006) |
0.00(0.000) |
|---|---|---|---|
|
|
18.46(1.000) |
9.180(0.989) |
9.08(0.969) |
|
|
19.00(1.000) |
10.71(1.000) |
8.32(0.582) |
|
|
|
|
|
|---|---|---|---|
|
|
0.00672 |
|
0.00633 |
|
|
0.00661 |
|
0.00658 |
| Notation: Description |
|---|
| 1. , , and . |
| 2. : An exponentional distribution with rate . |
| 3. : A -Studen distribution with degrees of freedom. |
| 4. : A Beta distribution with shape 1 parameter and shape 2 parameter . |
| 5. : A chi-square distribution with degrees of freedom. |
| 6. : A pearson type (aka -Student) distribution with location parameter 1, scale parameter 1 and degrees of freedom. |
| 7. : A bivariate distribution with two independent marginal distributions and . |
| 8. : A bivariate -student distribution with location parameter , scale parameter and degrees of freedom. |
| 9. : A bivariate lognormal distribution with mean vector and covariance matrix . |
| 10. : A bivariate spherical distribution with lognormal distribution for radii. |
| 11. : |
| 12. : . |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A Bayesian Semiparametric Gaussian Copula Approach to a Multivariate Normality Test
Luai Al-Labadi Corresponding author: [email protected] Department of Mathematical and Computational Sciences, University of Toronto Mississauga, Mississauga, Ontario L5L 1C6, Canada.
Forough Fazeli Asl [email protected] Department of Mathematical Sciences, Isfahan University of Technology, Isfahan 84156-83111, Iran.
Zahra Saberi [email protected] Department of Mathematical Sciences, Isfahan University of Technology, Isfahan 84156-83111, Iran.
Abstract
In this paper, a Bayesian semiparametric copula approach is used to model the underlying multivariate distribution . First, the Dirichlet process is constructed on the unknown marginal distributions of . Then a Gaussian copula model is utilized to capture the dependence structure of . As a result, a Bayesian multivariate normality test is developed by combining the relative belief ratio and the Energy distance. Several interesting theoretical results of the approach are derived. Finally, through several simulated examples and a real data set, the proposed approach reveals excellent performance.
Keywords: Dirichlet process, Energy distance, Multivariate normality test, Relative belief inferences, Semiparametric Gaussian copula model.
MSC 2010 62F15, 62G10, 62H15
1 Introduction
Semiparametric copulas are useful tools in multivariate data analysis. They are used for modelling a multivariate distribution whose dependence structure is induced by a known copula and whose marginal distributions are estimated; see, for example, Sancetta and Satchell (2004), Segers et al. (2014) and the references therein. We point out to the interesting work of Rosen and Thompson (2015) who proposed a semiparametric methodology for modeling a multivariate distribution whose dependence structure is induced by a Gaussian copula and whose marginal distributions are estimated nonparametrically via mixtures of B-spline densities. The authors take a Bayesian approach, using Markov chain Monte Carlo methods for inference.
In the present paper, a Bayesian Semiparametric copula approach based on the Dirichlet process and the Gaussian copula is proposed to model the underlying multivariate distribution . In addition, recognizing that many recent applications of research are developed based on the assumption of multivariate normality (Fernandez, 2010 and Zhu et al., 2014), a test to assess this assumption is developed. Recent procedures tackling this problem can be found in Kim and Park (2018), Madukaife and Okafor (2018), Henze and Visagie (2019) and Al-Labadi et al. (2019a). We highlight that, while most available works in the area of the hypothesis testing using copula approaches are related to assess independence (Genest and Rémillard, 2004; Kojadinovic and Holmes, 2009; Medovikov, 2016; Belalia et al., 2017), the proposed test is Bayesian and considers modeling the dependence structure and the marginal behaviors of the data separately to assess the multivariate normality assumption. Briefly, all univariate marginal distributions of are assumed to have the Dirichlet process to define posterior-based and prior-based models of . A Gaussian copula model is then utilized to induce the dependence structure. The test follows by comparing the concentration of the posterior-based model to the concentration of the prior-based model about the family of multivariate normal distributions (hypothesized model) via the so-called relative belief ratio. In this comparison, a Bayesian counterpart of the Energy distance is developed. The proposed test is easy to implement with a powerful performance and allows to state evidence for or against the null hypothesis. Also, unlike the test presented in Al-Labadi et al. (2019a), which is restricted to assess the family of multivariate normal distributions for the hypothesized model, the proposed test can be extended to assess all families of multivariate distributions (model checking problems).
The rest of the paper is structured as follows. A relevant background containing some definitions and generic properties are reviewed in Section 2. In Section 3, a Bayesian semiparametric Gaussian copula approach based on the Dirichlet process is proposed for modeling multivariate distributions. The choice of the hyperparameter of the Dirichlet process and the estimation method of the parameter of the Gaussian copula are discussed in Section. In Section 5, a Bayesian multivariate normality (MVN) test based on the proposed approach and the Energy distance is developed. The main steps of a computational algorithm to implement the MVN test are outlined in Section 6. The performance of the approach and its application to the MVN test is clarified through some simulation studies and a real data example in Section 7. The results show that the proposed test works well in all covered cases and it is very powerful. Finally, Section 8 concludes the paper with a summary of the results. Some notations related to the Section 7 are given in the Appendix.
2 Relevant background
2.1 Copula-based Model
In multivariate analysis, copula models are introduced by Sklar (1959) as a common tool to model multivariate distributions. Following Nelsen (2006), an -dimensional copula (-copula) is a nondecreasing and right continuous function from into such that, for every
- (i)
, for ( is grounded). 2. (ii)
, for ( has margins). 3. (iii)
For every such that for all , the -volume ( is -increasing), where -cube is product of and is the -box .
From (iii), it is obvious that every -copula is nondecreasing in each variable and satisfies in the Lipschitz condition. That is, for every point ,
[TABLE]
Hence, any -copula is uniformly continuous on .
The following key theorem of Sklar (1959) illustrates the role of the -copulas to model the multivariate distribution functions through their univariate margins.
Theorem 1
(Sklar’s theorem) Let be an -variate distribution function with marginal distribution functions . Then there exists an -copula such that for all
[TABLE]
where is product of the extended real line . If are all continuous, then is unique and can be written as
[TABLE]
for any where , otherwise; is uniquely determined on , where denotes the range of the distribution function for . Conversely, if is an -copula and are distribution functions, then , defined by (2), is an -variate distribution function with marginal distribution functions .
In general, the dependence structure of multivariate distributions is modeled by the copula. For this purpose, some families of copulas have been developed. See, for example, the work of Joe (1997), Trivedi and Zimmer (2005), and Nelsen (2006). One instance of special interest is the family of Gaussian copulas. Beside that it satisfies both Fréchet-Hoeffding lower and upper bounds (Nelsen 2006, Theorem 2.10.2), it has only one dependence parameter restricted to the symmetric interval , which makes it simple to apply. Formally, let be the inverse of the cumulative distribution function (cdf) of the univariate standard normal distribution and be the cdf of the multivariate normal distribution with zero mean vector and correlation matrix for , then the family of Gaussian copulas is defined by
[TABLE]
where . Following Chen et al. (2006), assume that, for any , , then the multivariate distribution is of a semiparametric Gaussian copula model , with unknown parameter and unknown marginal cdf , for . A detailed discussion about the semiparametric Gaussian copula model will be presented in Section 3 based on using the Dirichlet process.
The following algorithm shows the steps of generating a sample of random vectors from an -variate distribution with marginal cdf’s using a Gaussian copula model with correlation matrix .
2.2 Dirichlet Process
The Dirichlet process prior, introduced by Ferguson (1973), is the most commonly used prior in Bayesian nonparametric inferences. A remarkable collection of nonparametric inferences have been devoted to this prior. Here we only present the most relevant definitions and properties of this prior. Consider a space with a -algebra of subsets of , let be a fixed probability measure on called the base measure, and be a positive number, called the concentration parameter. A random probability measure is called a Dirichlet process on with parameters and denoted by if for every measurable partition of with the joint distribution of the vector has the Dirichlet distribution with parameter . Also, it is assumed that implies with probability one. Consequently, for any , beta, {E}(P(A))=H(A)\and Accordingly, the base measure plays the role of the center of while the concentration parameter controls the variation of around . One of the most well-known properties of the Dirichlet process is the conjugacy property. That is, when the sample is drawn from , the posterior distribution of given , denoted by , is also a Dirichlet process with concentration parameter and base measure
[TABLE]
where denotes the empirical cumulative distribution function (cdf) of the sample . Note that, is a convex combination of the base measure and the empirical cdf . Therefore, as while as A guideline about choosing the hyperparameters and will be covered in Section 4. Following Ferguson (1973), P\sim{DP}(a,H)\can be represented as
[TABLE]
where with exponential independent of the with and the Dirac delta measure. The series representation (4) implies that the Dirichlet process is a discrete probability measure even for the cases with an absolutely continuous base measure . Note that, by imposing the weak topology, the support of the Dirichlet process could be quite large. Recognizing the complexity when working with (4), Zarepour and Al-Labadi (2012) proposed the following finite representation as an efficient method to simulate the Dirichlet process. They showed that the Dirichlet process can be approximated by
[TABLE]
with the monotonically decreasing weights where and are defined as before, is a positive large integer and denotes the complement-cdf of the distribution. Note that, is the -th quantile of the distribution. The following algorithm describes how the approximation (5) can be used to generate a sample from .
The Dirichlet process can also be obtained from the following finite mixture models developed by Ishwaran and Zarepour (2002). Let has the from given (5) with Dirichlet. Then in distribution as , for any measurable function with and . In particular, converges in distribution to , where and are random values in the space of probability measures on endowed with the topology of weak convergence. To generate put , where is a sequence of i.i.d. gamma random variables independent of . This form of approximation leads to some results in Section 5.
2.3 Relative Belief Inferences
The relative belief ratio, developed by Evans (2015), becomes a widespread measure of statistical evidence. See, for example, the work of Al-Labadi and Evans (2018), Al-Labadi et al. (2017, 2018), and Al-Labadi et al. (2019a, 2019b) for implementation of the relative belief ratio on different stimulating univariate hypothesis testing problems. In details, let be a collection of densities on a sample space and let be a prior on the parameter space . Note that the densities may represent discrete or continuous probability measures but they are all with respect to the same support measure . After observing the data the posterior distribution of , denoted by , is a revised prior and is given by the density , where is the prior predictive density of For a parameter of interest let be the marginal prior probability measure and be the marginal posterior probability measure. It is assumed that satisfies regularity conditions so that the prior density and the posterior density of exist with respect to some support measure on the range space for . The relative belief ratio for a value is then defined by where is a sequence of neighborhoods of converging nicely to as (Evans, 2015). When and are continuous at the relative belief ratio is defined by
[TABLE]
the ratio of the posterior density to the prior density at Therefore, measures the change in the belief of being the true value from a priori to a posteriori.
Since is a measure of the evidence that is the true value, if , then the probability of the being the true value from a priori to a posteriori is increased, consequently there is evidence based on the data that is the true value. If , then the probability of the being the true value from a priori to a posteriori is decreased. Accordingly, there is evidence against based on the data that being the true value. For the case there is no evidence either way.
Obviously, measures the evidence of the hypothesis . Large values of provides strong evidence in favor of . However, there may also exist other values of that had even larger increases. Thus, it is also necessary, however, to calibrate whether this is strong or weak evidence for or against A typical calibration of is given by the strength
[TABLE]
The value in (6) indicates that the posterior probability that the true value of has a relative belief ratio no greater than that of the hypothesized value Noticeably, (6) is not a p-value as it has a very different interpretation. When , there is evidence against then a small value of (6) indicates strong evidence against . On the other hand, a large value for (6) indicates weak evidence against . Similarly, when , there is evidence in favor of then a small value of (6) indicates weak evidence in favor of , while a large value of (6) indicates strong evidence in favor of .
2.4 Energy Distance
The Energy distance, presented by Székely (2003), is an appropriate tool to determine the equality of distributions. In general, the Energy distance between two -variate distribution function and is defined by
[TABLE]
where , and denotes Euclidean norm of vector . Székely and Rizzo (2013) showed that such that equality holds if and only if . Note that, from (Székely, 2003), the Energy distance (7) is rotation invariant. This property makes it appropriate for testing goodness-of-fit problems in higher dimensions. Specifically, let be the hypothesized distribution and be the observed sample from . Then, the one sample Energy distance corresponding to (7) is defined by
[TABLE]
where , for , and the expectations are taken with respect to the distribution . The special important case occurs when is a multivariate normal distribution where the package energy is usually used for implementing (8). For further discussion about Energy distance consult Székely (2003) and Székely and Rizzo (2013).
3 A Bayesian Semiparametric Gaussian Copula Approach for Modeling Multivariate Distributions
In this section, we propose a Bayesian semiparametric copula approach based on the Gaussian copula as a flexible model for modeling multivariate distributions. For a brevity, we refer to this procedure as the BSPGC (Bayesian semiparametric Gaussian copula) approach. Specifically, let be a sample of size from an unknown -variate distribution with maginal cdf’s . Note that, the subscript may be omitted whenever it is clear in the context. To model based on the BSPGC approach we use the prior , where is the -th marginal cdf of a given -variate distribution . So, by (3), for a given choice of , for . Consider the joint cdf corresponding to marginal cdf’s with correlation matrix . Then, the posterior-based model is defined by
[TABLE]
The next lemma shows that approaches to the true distribution when the sample size increases.
Lemma 2
Let be a sample from -variate distribution function with unknown marginal cdf’s . Assume that , for . For any , as .
Proof. For any , the triangle inequality implies
[TABLE]
From (3), for any , as . Hence, the continuous mapping theorem implies as . On the other hand, from Sklar’s theorem 1 and Lipschitz condition (1), we have
[TABLE]
Note that, from the property of the Dirichlet process, for any and , Chebyshev’s inequality implies
[TABLE]
Substituting , for and , gives
[TABLE]
Since converges, then \sum_{k=1}^{\infty}Pr\left\{\big{|}F^{\ast}_{i}(t_{i})-H^{\ast}_{i}(t_{i})\big{|}\geq\epsilon\right\}<\infty. Hence, by the first Borel Cantelli lemma, \big{|}F^{\ast}_{i}(t_{i})-H^{\ast}_{i}(t_{i})\big{|}\xrightarrow{a.s.}0, as or . This completes the proof.
4 Selecting , and the Method of Estimation of in the BSPGC Approach
The proposed method for modeling multivariate distributions depends on , and . Hence, it is necessary to look carefully at the impact of these parameters on the approach. For instance, from (3), a large value of can increase the effect of the -variate distribution instead of in the posterior-based model (9). The following lemma shows the effect of the value of on the model (9).
Lemma 3
Let be a sample from -variate distribution function with unknown marginal cdf’s . Also, let be a known -variate cdf with marginal cdf’s and be the -variate cdf, defined in (3), with marginal cdf’s . Assume that , for . Then, for any , as .
Proof. The proof is similar to the proof of Lemma 2. For this, assume that for and a fixed positive number . For any fixed , replace and , respectively by, and in the proof of Lemma 2. Then the result follows.
It follows from Lemma 3 that increasing the value of can lead to some errors. To avoid this issue, we propose to choose to be at most as recommended in Al-Labadi and Zarepur (2017).
The choice of is also very significant and there are two main issues to reflect. The first one is the independence of the approach to the choice of (invariance property). As pointed in Table 6, the approach is invariance to the choice of any continuous multivariate distribution. The second issue is the compatibility between and the data. This is typically called prior-data conflict (Evans and Moshonov, 2006; Al-Labadi and Evans, 2017, Al-Labadi and Evans, 2018, Al-Labadi and Wang, 2019). As illustrated in Section 7.2, the existence of prior-data conflict yields to a failure of the approach and thus should be avoided. Since , where for , a reasonable choice of that ensures the avoidance of prior-data conflict is the -variate normal distribution , where and .
To carry on the approach, it is essential to estimate the correlation matrix . For this, we first generate a sample from the mixture distribution in (3). Then, based on the generated sample, is estimated by one of the following three common procedures: the Gaussian correlation rank, the Kendal’s or the Spearman’s . In Section 7, we performed a simulation study to compare the effect of these three methods on the quality of the approach. As a result, we recommend using Kendal’s correlation coefficients with in the proposed model.
5 A MVN Test Based on the BSPGC Approach
Let be a sample of size from an unknown -variate distribution . The problem to be addressed in this section is to test the hypothesis
[TABLE]
where . Note that, whenever and are unknown, they are to be estimated by the sample mean vector and sample covariance matrix , respectively. Thus, for , is the best representative of the family to compare with distribution . Hence, testing (11) is equivalent to test
[TABLE]
Now, we continue as follows. Let with marginal cdf’s . Here, for , is the cdf of the univariate normal distribution with mean and variance , where and are the -th element of and -th diagonal element of , respectively. Assume that . For any , let
[TABLE]
be the prior-based model, where is the correlation matrix of the -variate distribution and to be estimated as discussed in Section 4. Note that, as pointed out earlier, setting ensures compatibility between the data and the prior which will certainly avoid prior-data conflict. More details about the effect of the prior-data conflict on the approach is clarified in Section 7, where it is revealed that the existence of prior data conflict leads to erroneous result of the test.
Recalling the posterior-based model as defined in Section 3, to proceed with the approach, the energy distance is used to compute the distance between this model and (posterior distance) and the distance between the prior-based model and (prior distance). The next lemma proposes a Bayesian counterpart of the distance (8) as an appropriate tool to measure dissimilarities between the proposed models and the null distribution. This is considered a very convenient tool for assessing MVN in high dimensional problems ().
Lemma 4
Let be an -variate distribution function with unknown marginal cdf’s and be a known -variate distribution function with marginal cdf’s . Assume that , and ’s are the approximation of Dirichlet process ’s, given by Ishwaran and Zarepour (2002). Consider the Energy distance between and as
[TABLE]
where , , is an observed sample from and is the correlation matrix of . Then, as
[TABLE]
where is defined in (8) with , and
Proof. From properties of Dirichlet distribution, and . Then
[TABLE]
The proof is immediately followed by letting in (15).
The next lemma allows us to use the approximation of the Dirichlet process in the prior-based and posterior-based models for approximating the distribution of the posterior and the prior distances computed by (14).
Lemma 5
Let be an -variate distribution function with unknown marginal cdf’s and be a known -variate distribution function with marginal cdf’s . Assume that and ’s are the approximation of the Dirichlet process ’s, given in (5). Then, for any , , as , where is the correlation matrix of .
Proof. From Lipschitz condition (1), we have
[TABLE]
Since , for (Zarepur and Al-Labadi, 2012), the result follows.
The procedure is continued by considering as the Energy distance between the prior-based model (13) and the null distribution using formula (14). Similarly, consider for the posterior-based model (9) and the null distribution. Then, the relative belief ratio is used to compare the concentration of the posterior distribution to the prior distribution about zero. As shown in the next lemma, if is true, the distribution of the posterior distance should be more concentrated about 0 than the distribution of the prior distance; otherwise, the distribution of the prior distance should be more concentrated at 0 than the distribution of posterior distance. The comparison is made by computing the relative belief ratio with the interpretation as discussed in Section 2.
Lemma 6
Let be a sample from -variate distribution function with unknown marginal cdf’s . Assume that and as . Let , for . For any as
- (i)
\big{|}C_{R^{\ast}}(F^{\ast}_{1}(t_{1}),\ldots,F^{\ast}_{m}(t_{m}))-F_{\boldsymbol{\theta}_{\mathbf{x}}}(t_{1},\ldots,t_{m})\big{|}\xrightarrow{a.s.}0,* when is true.* 2. (ii)
\liminf\big{|}C_{R^{\ast}}(F^{\ast}_{1}(t_{1}),\ldots,F^{\ast}_{m}(t_{m}))-F_{\boldsymbol{\theta}_{\mathbf{x}}}(t_{1},\ldots,t_{m})\big{|}\displaystyle{\overset{a.s.}{>}}0,* when is not true,*
where is the correlation matrix of , defined in (3).
Proof. To prove (i), substitute in (3) by . From (3), for any , as . If is true, then . Hence, the proof of (i) immediately follows from the proof of Lemma 2. To prove (ii), Consider as in (3). Applying the triangle inequality gives
[TABLE]
Similar to the proof of Lemma 2, and \big{|}H^{\ast}(\mathbf{t})-F_{\boldsymbol{\theta}_{\mathbf{x}}}(\mathbf{t})\big{|}\xrightarrow{a.s.}\big{|}F_{true}(\mathbf{t})-F_{\boldsymbol{\theta}_{0}}(\mathbf{t})\big{|} as . Since is not true, \big{|}F_{true}(\mathbf{t})-F_{\boldsymbol{\theta}_{0}}(\mathbf{t})\big{|}\displaystyle{\overset{a.s.}{>}}0 which completes the proof of (ii).
The effect of the value of on the posterior-based model was considered in Lemma 3. It is also interesting to consider the effect of the value of on the proposed MVN test.
Lemma 7
Let be a sample from -variate distribution function with unknown marginal cdf’s . Let be the cdf of with marginal cdf’s . If , for , then as , where is the correlation matrix of .
Proof. The proof is similar to the proof of Lemma 3 and is omitted.
Lemmas 3 and 7 show that for a too large value of (relative to the sample size) both the posterior-based and prior-based models are approaching to the null model . Hence, the comparison between the posterior and prior distance to detect the normality can lead to an error in which we may accept when it is not true and reject when it is true. As recommended in Section 7, we should consider at most .
At the end of this section, it is worth pointing out that the proposed test can be extended to assess any family of multivariate distributions. For this, it is enough to consider a different family of multivariate distributions in the hypothesis (11) and use its best representative distribution as in the methodology, which may be more challenging for some multivariate models.
6 Main Steps for Testing the MVN
The following computational algorithm summaries the main steps to test . This algorithm is viewed as a generalized version of Algorithm B of Al-Labadi and Evans (2018). Observe that, since closed forms of the densities of and are typically not available, relative belief ratios need to be approximated via simulation.
Algorithm 3 Relative belief algorithm based on the BSPGC approach for testing MVN
Use Algorithm 2 to generate (approximately) marginal cdf ’s from , for . 2. 2.
Generate a sample of values from the -variate distribution and estimate the correlation matrix , denoted by , as discussed in Section 4. 3. 3.
Use the generated marginal cdf’s and set in Algorithm 1 to get a sample of values from prior-based model (13). 4. 4.
Use (14) for the sample generated in steps 3 to compute the prior distance . 5. 5.
Repeat steps (1)-(4) to obtain a sample of values from the prior of . 6. 6.
Repeat steps (1)-(5) by replacing by , by , by , by , by , by , by and prior by posterior. This yields to a sample of values from the posterior of . 7. 7.
Let be a positive number. Let denote the empirical cdf of based on the prior sample in step (5) and for let be the estimate of the -th prior quantile of Here , and is the largest value of . Let denote the empirical cdf of based on the posterior sample in step (10). For , estimate by
[TABLE]
the ratio of the estimates of the posterior and prior contents of Thus, we estimate by where and is chosen so that is not too small (typically . 8. 8.
Estimate the strength DP_{D_{\mathcal{E}}}\big{(}RB_{D_{\mathcal{E}}}(d\,|\,\mathbf{x})\leq RB_{D_{\mathcal{E}}}(0\,|\,\mathbf{x})\,|\,\mathbf{x}\big{)} by the finite sum
[TABLE]
For fixed as then converges almost surely to and (16) and (17) converge almost surely to and DP_{D_{\mathcal{E}}}\big{(}RB_{D_{\mathcal{E}}}(d\,|\,\mathbf{x})\leq RB_{D_{\mathcal{E}}}(0\,|\,\mathbf{x})\,|\,\mathbf{x}\big{)}, respectively. The consistency of the proposed test is achieved by Proposition 6 of Al-Labadi and Evans (2018). As a recommendation, one should try different values of to make sure the right conclusion has been obtained. However, we found out that setting gives adequate results. More details about implementing the approach is discussed in the following section.
7 Simulation Studies
This section is divided into two subsections. In the first subsection, the quality of the approach to model multivariate distributions is investigated, where different choices of , and are considered. The evaluation technique relies on using the mean of the Energy distance based on replications. Note that, from Lemma 4, one may consider using the package energy available in to compute the distance. We generated samples each of size from a variety of bivariate distributions. The notations of the used distributions are listed (Table 7) in Appendix A. In this study, we set in Algorithm 3 with steps (1)-(6). Note that, for the methodology to work well, we expect to be close to zero. In the second subsection, the proposed test is illustrated through several examples.
7.1 Checking the Quality of the Posterior-based Model
The performance of the posterior-based model (i.e. the quality of estimating the model) is illustrated by considering the bivariate distributions given in Table 1 with some choices of . The results are reported based on the Kendall’s correlation coefficients. From Table 1, the close values of to zero indicates to the good performance of the methodology to model bivariate distributions, particularly when . Note that, as mentioned in Lemma 3, with increasing the value of , the accuracy of the methodology will be decreased. For more illustration, part (a) of Figure 1 gives the boxplots of the energy distance between and its corresponding for Boxplots of marginal distributions are also given in part (b) of Figure 1. Also, the marginal densities of and its are given in Figure 2. The bivariate scatter plots are shown below the diagonal, histograms on the diagonal and the Kendall correlation above the diagonal. Correlation ellipses and loess smooths (red lines) are also shown.
Next, we inspect the effect of choosing different correlation coefficients such as the Gaussian rank, the Kendall’s and the Spearman’s on the posterior-based model. Consider and as two true distributions (consult Table 7 in Appendix A for the notations). Table 2 reports the results for . Note that, the package rococo is used to estimate based on the Gaussian rank correlation coefficients. It follows from Table 2 that the performance of the methodology is approximately the same for different correlation coefficients.
7.2 Checking MVN Based on the BSPGC Approach
The proposed normality test is illustrated through some interesting examples discussed in Henze and Visagie (2019). Note that, is a skewed heavy-tailed and is a symetric heavy-tailed distribution. Also, , for is a symetric distribution and has very similar behavior with a bivariate normal distribution. For a given sample of size , generated from distributions in Table 3, the bivariate normality assumption is checked. For all cases, we set and in Algorithm 3. To study the sensitivity of the approach, various values of are considered. The results of the proposed test are reported in Table 3. The results are also compared to the Energy (E)-test (Székely and Rizzo, 2013). Reminding that we want and the strength close to 1 when is true and and the strength close to 0 when is false, it is seen from Table 4 that the proposed test has an excellent performance to accept or reject the bivariate normality assumption. The type I error and the power of the test are also reported in Table 4. They show that the proposed test is powerful in both accepting and rejecting .
The next example uses a real data set.
Real data example (Swiss Heads): In this example, we consider the data of six readings on the dimensions of the heads of 200 twenty year old Swiss soldiers given by Flury and Riedwyl (1988). The variables are minimal frontal breadth, breadth of angulus mandibulae, true facial height, length from glabella to apex nasi, length from tragion to nasion, and length from tragion to gnathion. The problem is to assess the six-variate normality assumption for this data set. The E-test’s p-value is , which shows strong evidence to reject the six-variate normality assumption. The proposed test presents and strength based on the Kendall’s and which follows the methodology also presents strong evidence to reject the six-variate normality assumption.
We end this subsection by investigating the effect of the prior-data conflict on the approach. This is in fact highlights the effect of the choice of in the prior-based model. For this, consider the results of MVN test when for different choices of in Table 5. Clearly, when the results are correct; otherwise, they are incorrect. Another concern is to check the effect of the double use of the data by considering as in the prior distance. Particularly, Table 6 gives the mean of the prior distance for various choices of . It is obvious from this table that the prior distance is invariant with respect to the choice of .
8 Concluding Remarks
A BSPGC approach and its application to the MVN test have been suggested. In this procedure, a Gaussian copula model has been utilized to induce the dependence structure of the underlying multivariate distribution . The Dirichlet process then has been constructed on the unknown margins of to define the prior-based and posterior-based models, respectively. The test has been developed by using the relative belief ratio for comparing the concentration of the distribution of the distance between the posterior-based model and the null distribution versus the concentration of the distribution of the distance between the prior-based model and the null distribution at zero. The Energy distance has been applied to compute distances as an appropriate tool especially in high dimensional problems. The methodology has been examined by a simulation study to clarify its excellent performance. Finally, application of the test including a real data example has been presented. A main advantage of the procedure is that it takes into account the dependence structure of the data in the MVN test. The extension of the procedure to different areas of the multivariate data analysis by considering various families of copula will be a part of a future research work.
Appendix A Relevant Notations
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Al-Labadi, L., Baskurt, Z., and Evans, M. (2017). Goodness of fit for the logistic regression model using relative belief. Journal of Statistical Distributions and Applications , 4(1), 1.
- 2[2] Al-Labadi, L., Baskurt, Z., and Evans, M. (2018). Statistical reasoning: choosing and checking the ingredients, inferences based on a measure of statistical evidence with some applications. Entropy , 20(4), 289.
- 3[3] Al-Labadi, L. and Evans, M. (2017). Optimal robustness results for relative belief inferences and the relationship to prior-data conflict. Bayesian Analysis , 12(3), 705–728.
- 4[4] Al-Labadi, L. and Evans, M. (2018). Prior-based model checking. Canadian Journal of Statistics , 46(3), 380–398.
- 5[5] Al-Labadi, L., Fazeli Asl, F., and Saberi, Z. (2019 a). A Bayesian nonparametric test for assessing multivariate normality. Technical Report ar Xiv:1904.02415.
- 6[6] Al-Labadi, L., Patel, V., Vakiloroayaei, K., and Wan, C. (2019 b). Kullback-Leibler divergence for Bayesian nonparametric model checking. Technical Report ar Xiv:1903.00669.
- 7[7] Al-Labadi, L., and Wang, C. (2019). Measuring Bayesian robustness using Rényi’s divergence and relationship with Prior-Data conflict. Technical Report ar Xiv:1905.05945.
- 8[8] Al-Labadi, L. and Zarepour, M. (2017). Two-sample Kolmogorov-Smirnov test using a Bayesian nonparametric approach. Mathematical Methods of Statistics , 26(3), 212–225.
