A Regularized MANOVA Test for Semicontinuous High‐Dimensional Data

Elena Sabbioni; Claudio Agostinelli; Alessio Farcomeni

PMC · DOI:10.1002/bimj.70054·April 29, 2025

A Regularized MANOVA Test for Semicontinuous High‐Dimensional Data

Elena Sabbioni, Claudio Agostinelli, Alessio Farcomeni

PDF

Open Access

TL;DR

This paper introduces a new MANOVA test for semicontinuous data that works even when the number of variables is larger than the sample size.

Contribution

The novelty lies in the use of a regularized likelihood ratio test for high-dimensional semicontinuous data.

Findings

01

The proposed test avoids computational overheads by using closed-form regularized estimators.

02

Simulation studies show the test maintains good power and level performance.

03

The method is applied to microRNA expression and plant invasion data, demonstrating its practical utility.

Abstract

We propose a MANOVA test for semicontinuous data that is applicable also when the dimension exceeds the sample size. The test statistic is obtained as a likelihood ratio, where the numerator and denominator are computed at the maxima of penalized likelihood functions under each hypothesis. Closed form solutions for the regularized estimators allow us to avoid computational overheads. We derive the null distribution using a permutation scheme. The power and level of the resulting test are evaluated in a simulation study. We illustrate the new methodology with two original data analyses, one regarding microRNA expression in human blastocyst cultures, and another regarding alien plant species invasion in the island of Socotra (Yemen).

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species2

Homo sapiens Socotra

Tables4

TABLE 1. Observed test level for the regularized MANOVA test (scMAN) and for Chen et al. (2011) (Chen) in the presence of semicontinuous high‐dimensional data for K=2 and different values of n, p, πj1, ρ.

$c_{1}$	$c_{2}$	$π_{j 1}$	$ρ$	Test	$n_{k} = 5$				$n_{k} = 10$
$c_{1}$	$c_{2}$	$π_{j 1}$	$ρ$	Test	$p = 50$	$p = 100$	$p = 150$	$p = 200$	$p = 50$	$p = 100$	$p = 150$	$p = 200$
0	0	0.2	0	scMAN	0.045	0.038	0.058	0.055	0.041	0.058	0.062	0.046
				Chen	0.045	0.046	0.045	0.055	0.035	0.049	0.052	0.058
0	0	0.2	0.4	scMAN	0.048	0.048	0.048	0.053	0.044	0.055	0.043	0.058
				Chen	0.051	0.042	0.046	0.064	0.040	0.045	0.052	0.049
0	0	0.5	0	scMAN	0.046	0.051	0.038	0.043	0.031	0.053	0.047	0.052
				Chen	0.049	0.079	0.033	0.051	0.316	0.080	0.051	0.051
0	0	0.5	0.4	scMAN	0.039	0.043	0.040	0.043	0.042	0.050	0.045	0.053
				Chen	0.050	0.056	0.040	0.056	0.312	0.078	0.049	0.047
0	0	0.8	0	scMAN	0.050	0.051	0.045	0.040	0.048	0.060	0.057	0.050
				Chen	0.756	0.536	0.376	0.245	1.000	1.000	1.000	1.000
0	0	0.8	0.4	scMAN	0.046	0.048	0.047	0.031	0.046	0.055	0.055	0.047
				Chen	0.757	0.533	0.380	0.251	1.000	1.000	1.000	1.000

TABLE 2. Observed power for the regularized MANOVA test (scMAN) and for Chen et al. (2011) (Chen) in the presence of semicontinuous high‐dimensional data for K=2 and different values of n, p, c1, πj1, ρ.

$c_{1}$	$c_{2}$	$π_{j 1}$	$ρ$	Test	$n_{k} = 5$				$n_{k} = 10$
$c_{1}$	$c_{2}$	$π_{j 1}$	$ρ$	Test	$p = 50$	$p = 100$	$p = 150$	$p = 200$	$p = 50$	$p = 100$	$p = 150$	$p = 200$
1	0	0.2	0	scMAN	0.985	0.994	0.990	0.984	1.000	1.000	1.000	1.000
				Chen	0.851	0.989	0.998	1.000	0.594	0.824	0.935	0.967
1	0	0.2	0.4	scMAN	0.505	0.550	0.529	0.531	0.736	0.837	0.876	0.891
				Chen	0.402	0.481	0.494	0.509	0.382	0.549	0.644	0.702
1	0	0.5	0	scMAN	0.755	0.820	0.714	0.725	0.988	0.959	0.939	0.916
				Chen	0.205	0.268	0.314	0.326	0.323	0.108	0.104	0.133
1	0	0.5	0.4	scMAN	0.420	0.469	0.451	0.443	0.770	0.767	0.769	0.794
				Chen	0.203	0.248	0.295	0.268	0.326	0.114	0.109	0.131
1	0	0.8	0	scMAN	0.134	0.186	0.237	0.219	0.243	0.406	0.548	0.634
				Chen	0.758	0.534	0.392	0.249	1.000	1.000	1.000	1.000
1	0	0.8	0.4	scMAN	0.122	0.161	0.169	0.172	0.250	0.359	0.45	0.514
				Chen	0.758	0.537	0.392	0.258	1.000	1.000	1.000	1.000
5	0	0.2	0	scMAN	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
				Chen	1.000	1.000	1.000	1.000	0.994	1.000	1.000	1.000
5	0	0.2	0.4	scMAN	0.999	0.999	1.000	1.000	1.000	1.000	1.000	1.000
				Chen	1.000	1.000	1.000	1.000	0.995	1.000	1.000	1.000
5	0	0.5	0	scMAN	0.999	1.000	0.999	1.000	1.000	1.000	1.000	1.000
				Chen	0.767	0.837	0.929	0.963	0.327	0.123	0.138	0.199
5	0	0.5	0.4	scMAN	0.999	1.000	0.999	0.988	1.000	1.000	1.000	1.000
				Chen	0.775	0.857	0.944	0.964	0.327	0.123	0.138	0.199
5	0	0.8	0	scMAN	0.688	0.584	0.550	0.521	0.976	0.976	0.987	0.977
				Chen	0.755	0.540	0.401	0.263	1.000	1.000	1.000	1.000
5	0	0.8	0.4	scMAN	0.639	0.556	0.543	0.482	0.972	0.978	0.991	0.985
				Chen	0.755	0.540	0.398	0.264	1.000	1.000	1.000	1.000

TABLE 3. Observed power for the regularized MANOVA test (scMAN) and for Chen et al. (2011) (Chen) in the presence of semicontinuous high‐dimensional data for K=2 and different values of n, p, c2, πj1, ρ.

$c_{1}$	$c_{2}$	$π_{j 1}$	$ρ$	Test	$n_{k} = 5$				$n_{k} = 10$
$c_{1}$	$c_{2}$	$π_{j 1}$	$ρ$	Test	$p = 50$	$p = 100$	$p = 150$	$p = 200$	$p = 50$	$p = 100$	$p = 150$	$p = 200$
0	0.15	0.2	0	scMAN	0.157	0.132	0.089	0.060	0.546	0.576	0.502	0.426
				Chen	0.079	0.110	0.149	0.164	0.139	0.215	0.278	0.334
0	0.15	0.2	0.4	scMAN	0.107	0.075	0.056	0.059	0.455	0.308	0.168	0.117
				Chen	0.073	0.079	0.089	0.089	0.129	0.174	0.197	0.210
0	0.15	0.5	0	scMAN	0.179	0.100	0.051	0.016	0.360	0.270	0.146	0.084
				Chen	0.093	0.096	0.099	0.125	0.914	0.755	0.589	0.445
0	0.15	0.5	0.4	scMAN	0.149	0.107	0.060	0.043	0.284	0.157	0.078	0.064
				Chen	0.089	0.097	0.102	0.121	0.912	0.755	0.594	0.438
0	0.15	0.8	0	scMAN	0.269	0.379	0.426	0.474	0.936	0.948	0.838	0.671
				Chen	0.994	0.985	0.978	0.965	1.000	1.000	1.000	1.000
0	0.15	0.8	0.4	scMAN	0.261	0.403	0.438	0.473	0.926	0.896	0.778	0.648
				Chen	0.994	0.986	0.977	0.967	1.000	1.000	1.000	1.000
0	0.3	0.2	0	scMAN	0.259	0.139	0.077	0.032	0.853	0.655	0.488	0.414
				Chen	0.304	0.539	0.659	0.777	0.490	0.714	0.842	0.917
0	0.3	0.2	0.4	scMAN	0.170	0.096	0.062	0.050	0.717	0.317	0.146	0.101
				Chen	0.235	0.282	0.331	0.373	0.450	0.648	0.746	0.807
0	0.3	0.5	0	scMAN	0.556	0.303	0.075	0.002	0.824	0.543	0.260	0.148
				Chen	0.273	0.341	0.375	0.438	1.000	0.995	0.991	0.988
0	0.3	0.5	0.4	scMAN	0.491	0.330	0.142	0.040	0.694	0.348	0.164	0.113
				Chen	0.258	0.333	0.373	0.406	1.000	0.995	0.990	0.987

TABLE 4. Proportion of rejections for the regularized MANOVA test for semicontinuous high‐dimensional data for K=4 and different values of n, p, c1, c2, πj1, ρ.

$c_{1}$	$c_{2}$	$π_{j 1}$	$ρ$	$p = 50$	$n_{k} = 5$				$n_{k} = 10$
$c_{1}$	$c_{2}$	$π_{j 1}$	$ρ$	$p = 50$	$p = 100$	$p = 150$	$p = 200$	$p = 50$	$p = 100$	$p = 150$	$p = 200$
0	0	0.2	0	0.062	0.067	0.066	0.066	0.050	0.042	0.046	0.053
0	0	0.2	0.4	0.056	0.055	0.057	0.055	0.048	0.040	0.061	0.053
0	0	0.5	0	0.041	0.050	0.054	0.050	0.060	0.055	0.049	0.058
0	0	0.5	0.4	0.045	0.037	0.048	0.039	0.067	0.055	0.055	0.059
0	0	0.8	0	0.036	0.042	0.048	0.065	0.053	0.053	0.055	0.061
0	0	0.8	0.4	0.037	0.048	0.042	0.058	0.046	0.049	0.049	0.047
1	0	0.2	0	0.845	0.991	1.000	1.000	0.998	1.000	1.000	1.000
1	0	0.2	0.4	0.349	0.408	0.444	0.481	0.546	0.692	0.760	0.770
1	0	0.5	0	0.487	0.781	0.846	0.860	0.941	1.000	1.000	1.000
1	0	0.5	0.4	0.271	0.334	0.325	0.356	0.615	0.764	0.836	0.816
1	0	0.8	0	0.059	0.084	0.102	0.127	0.146	0.247	0.355	0.436
1	0	0.8	0.4	0.056	0.084	0.086	0.092	0.147	0.228	0.281	0.343
5	0	0.2	0	1.000	0.991	0.984	0.996	1.000	1.000	1.000	1.000
5	0	0.2	0.4	1.000	0.992	0.982	0.998	1.000	1.000	1.000	1.000
5	0	0.5	0	0.881	0.977	0.996	1.000	0.993	1.000	1.000	1.000
5	0	0.5	0.4	0.848	0.946	0.976	0.988	0.992	1.000	0.999	1.000
5	0	0.8	0	0.246	0.340	0.361	0.374	0.894	0.952	0.985	0.984
5	0	0.8	0.4	0.225	0.308	0.329	0.333	0.857	0.925	0.965	0.967
0	0.15	0.2	0	0.157	0.224	0.215	0.225	0.400	0.486	0.509	0.526
0	0.15	0.2	0.4	0.129	0.115	0.122	0.088	0.357	0.370	0.339	0.248
0	0.15	0.5	0	0.129	0.153	0.138	0.141	0.320	0.394	0.341	0.317
0	0.15	0.5	0.4	0.115	0.095	0.08	0.067	0.274	0.237	0.205	0.134
0	0.15	0.8	0	0.000	0.006	0.014	0.040	0.712	0.901	0.903	0.822
0	0.15	0.8	0.4	0.000	0.006	0.019	0.036	0.685	0.859	0.843	0.763
0	0.3	0.2	0	0.343	0.438	0.460	0.486	0.852	0.867	0.818	0.796
0	0.3	0.2	0.4	0.266	0.217	0.184	0.156	0.814	0.708	0.498	0.348
0	0.3	0.5	0	0.474	0.543	0.484	0.367	0.917	0.930	0.881	0.796
0	0.3	0.5	0.4	0.439	0.400	0.318	0.219	0.862	0.755	0.515	0.313

Funding4

—Italian Ministry of University and Research
—BaC INF‐ACT S4‐BEHAVE‐MOD PE00000007 PNRR M4C2 Inv. 1.3 ‐ NextGenerationEU
—SmartData@PoliTO
—GNAMPA (INdAM ‐ Istituto Nazionale di Alta Matematica)

Keywords

penalized methodspermutation testszero‐inflation

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenetically Modified Organisms Research · Pharmacological Effects of Medicinal Plants

Full text

Introduction

1

A well‐known generalization of ANOVA tests for the case of multivariate data is the MANOVA test (Wilks 1938). The test involves a joint null hypothesis claiming equal mean vectors of two or more multivariate Gaussian distributions, usually under an assumption of equal covariance matrices. While classical tests are devised for low‐dimensional settings, there are several contributions for continuous data when the dimension exceeds the sample size, with recent reviews Harrar and Kong (2022). More in detail, some works involve some form of regularization (Cai and Xia 2014; Chen et al. 2011; Dong et al. 2017), in order to guarantee positive definiteness of the (possibly common) variance covariance matrix; other works involve estimation of the traces of covariance matrices (Chen and Qin 2010; Hu et al. 2017; Srivastava and Kubokawa 2013; Yamada and Himeno 2015).

In this work, we are interested in semicontinuous data, that involve positive continuous measurements plus a certain fraction of exactly zero observations. Semicontinuous data are measured in several important applications like abundance estimation in ecology Farcomeni (2016), omics data Taylor and Pollard (2009), cost‐effective analysis in medical research (Tu and Zhou 1999). Inference involves separately modeling the mass probability at zero, and the positive continuous observations conditionally on a non‐zero measurement (Chai and Bailey 2008; Lachenbruch 2001, 2002). A common assumption is that the positive measurements are log‐normal. Tests have appeared for different formulations of the null hypothesis (Tu and Zhou 1999; Xiao‐Hua and Tu 1999; Zhou and Tu 1999), which inevitably involve both the occurrence probabilities and the conditional moments of the continuous part. Formal MANOVA tests for semicontinuous data have appeared for the case in which the dimension does not exceed the sample size (Farcomeni 2016). Here, we propose a MANOVA test for semicontinuous data, based on regularization, that is applicable also when the dimension exceeds the sample size. This is a quite common occurrence in modern applications, including the areas discussed above. To motivate our contribution, we present two original data examples. One involves testing homogeneity of microRNA expression in human blastocysts, where many zeros are observed due to the very early stage of development, and the dimension is large. The other application involves comparing overall abundance of several invasive species in the island of Socotra (Yemen), where many alien species might not be observed in a few sampling spots. To the best of our knowledge, our MANOVA test for high‐dimensional semicontinous data is the first test of this kind to appear in the literature.

The rest of the paper is as follows. In the next section, we set up the general framework and notation. A penalized version of the maximum likelihood estimator (MLE) for the general model is described in Section 2.1. The null hypothesis and the resulting likelihood‐ratio type test are described in Section 2.2. The resulting test is evaluated through a simulation study in Section 3, where we report on the observed level and power over different scenarios. We then illustrate the new methodology through two original real data examples. The microRNA data are described and analyzed in Section 4.1; while the alien species data in Section 4.2. Some concluding remarks are given in Section 5.

The methods discussed in this paper have been implemented in an R package called semicontMANOVA, whose source and compiled version are available along with this paper on the publisher's website.

General Framework

2

Assume we have K groups, where group k has nk observations, where k=1,⋯,K. Let n:=∑k=1Knk denote the total number of observations. Let Xk be an nk×p matrix that contains nk p‐dimensional observations of the kth group. In our framework, each element Xijk can be either zero, or positive, and we expect a positive proportion of zeros.

Consider Yijk:=1(Xijk>0), where 1(·) is the indicator function and let a:=(a1,⋯,ap)∈{0,1}p be any of the possible configurations of presences (Yijk=1) and absences (Yijk=0) over p components. We model Yik=Yi1k,⋯,Yipk with a multivariate Bernoulli distribution, that is fully specified by 2p−1 free parameters of the kind πk(a):=PYi1k=a1,⋯,Yipk=ap, assuming ∑a∈{0,1}pπk(a)=1.

For all k=1,…,K, i=1,…,nk, and j=1,…,p, we let X∼ijk:=logXijk when Yijk=1, otherwise X∼ijk is not observed. We assume that X∼ik follows a multivariate p‐dimensional Gaussian distribution, with mean μk and covariance matrix Σ when all the components are observed. We thus make a homoskedasticity assumption.

Estimating all the parameters of the Bernoulli distribution is reasonable when p is small, while their number is definitely not manageable in situations of high dimensionality of the data, especially when p>n. We reduce the number of parameters by assuming that the joint probability πk depends only on the number of positive components that are present in one observation. Formally, we assume

[eqn]

To simplify the notation, we define πk(s):=πk(a) for each a such that s:=∑j=1paj, where s=0,⋯,p. Assuming (1), the number of parameters describing the presence/absence part of the data decreases from K(2p−1) to Kp. Assumption (1) above can be verified up front through a simple Chi‐square test. We let θ:=π,μ,Σ be the set of all unknown parameters in the parametric space Θ=(0,1)Kp×RKp×Sp, where Sp is the set of all symmetric positive definite p×p matrices. The total number of parameters is ϑ=2Kp+p(p+1)/2.

Likelihood Inference

2.1

In this section, we introduce a penalized version of the maximum likelihood estimator (MLE) for all the parameters θ. The penalty is necessary for the case in which p>n, and might be useful otherwise (e.g., an increased bias might correspond to a decrease in variance). We define the set V(Yik):={j∈{1,⋯,p}:Yijk=1}, containing the indices of all positive (observed) components of Xik. In the following discussion, given a p‐dimensional vector b and a set V⊆{1,⋯,p}, bV is the |V|‐dimensional vector with the components of b that are indexed by the elements of V. Similarly, given a p×p matrix A, AV indicates the |V|×|V| matrix built considering only the rows and the columns of A that have indices belonging to V. Hence, let nik:=∑j=1pYijk the number of observed components for the ith observation in the kth group. We have

[eqn]

i.e., conditionally on the presence‐absence vector Yik, the logarithm of the positive components of Xik follows a multivariate Gaussian distribution, with dimension equal to the number of non‐zero components and parameters obtained as the corresponding subset of the p‐dimensional parameters μk and Σ. Combining the information from X∼ik and Yik, the likelihood is given by

[eqn]

and the correspoding log–likelihood is

[eqn]

When p>n the usual MLE of πk, μk and Σ is not well defined, since the sample variance and covariance matrix are not positive definite. We regularize the problem through a Ridge‐like penalty. Specifically, the regularized log‐likelihood ℓλ(θ) can be written as

[eqn]

where the penalization term P(λ,Σ) is defined as

[eqn]

with λ=(λ1,λ2,…,λp)⊤∈R+p and Λ=diag(λ). In the low‐dimensional setting with n>p one can set λ=0. The penalized MLEs of the parameters involved in the regularized likelihood (3) can be obtained in the closed form (see details in Appendix A) and are given by

[eqn]

[eqn]

[eqn]

with j=1,…,p and k=1,…,K. Note that the fraction in the variance and covariance matrix estimator is considered to be entry‐wise and ∘ denotes the Hadamard product. When λ=0 the expression above coincides with the MLE Farcomeni (2016). A general requirement is that ∑k=1K∑i=1nkYij1kYij2k>0 for each j1≠j2,j1,j2=1,⋯,p, that is, each couple of variables is simultaneously present in at least one observation in the sample. In real applications with p>>n there might be couples of variables that do not satisfy the previous condition. Those variables must be removed from the dataset starting from those with the largest number of absent components, until all the others satisfy the condition. This process reduces the dimension, but it has an impact only on the continuous part of the likelihood.

Hypothesis Testing

2.2

In this section, we define a MANOVA test in our possibly high‐dimensional semicontinuous setting. The null hypothesis we wish to test, under the framework previously described, can be summarized as

[eqn]

which corresponds to assuming the probability of having s positive components, with s=0,⋯,p, is equal in all the groups, and, given a presence, all expectations are the same. We let Θ0⊆Θ be the parametric space under the null hypothesis. For the sake of generality we admit the possibility of a penalty parameter λ0=(λ01,λ02,…,λ0p)⊤ under the null hypothesis that can differ from λ, the penalty parameter under the alternative hypothesis. The log‐likelihood under H0 can be written as

[eqn]

and its penalized version is then

[eqn]

It can be shown, similarly to the unconstrained case, that the expression above is maximized by

[eqn]

[eqn]

[eqn]

where Λ0=diag(λ0). In order to test H0 we define a test statistic which has the form of a likelihood ratio test (LRT) statistic, given by

[eqn]

When n>p and λ=λ0=0 the expression above reduces to the classical LRT statistic, which Wilks' theorem Wilks (1938) guaranteed to be asymptotically distributed like a Chi‐squared, with degrees of freedom given by the difference in the number of free parameters. This result is not bound to hold in general, therefore we rely on a permutation test (Mielke and Berry 2007; Pesarin 2001), based on the test statistic (12). Namely, we compute the test statistic on the observed data, and on versions in which group labels are permuted uniformly at random. The final p‐value corresponds to the proportion of resamples which yield a larger test statistic than the one observed. The choice of the values of λ and λ0 used in Dλ,λ0 is discussed in Section 2.3.

Choice of λ and λ0

2.3

The selection of λ and λ0 can be performed by cross‐validation or by minimizing an information criteria of the form

[eqn]

The second method is faster and what we suggest to use in practice. Measures of model complexity have been extensively studied (Janson et al. 2015). We use the trace of the Fisher information matrix (Bozdogan 1987; Takeuchi 1976), as we detail below.

Let Φ and Φ0 denote the sets of λ and λ0 that guarantee the positive definiteness of the variance and covariance matrix estimators (7) and (11), respectively. The choice of λ∈Φ is based on the minimization of

[eqn]

The expressions M(λ,π^,μ^,Σ^) are obtained by reformulating our problem as a weighted regression problem, and using the trace of the Fisher information matrix as a measure of the complexity of the model Bozdogan (1987) (see Appendix B for the complete derivation). In the same spirit, we choose λ0∈Φ0 such that

[eqn]

The LRT test statistic (12) is finally denoted as Dλ^,λ^0.

Simulation Study

3

In this section, we present a simulation study to evaluate the performance of the new test. We consider K=2 and K=4 groups, with balanced sample sizes n1=⋯=nK, with nk={5,10}, and different dimension p={50,100,150,200}. For the variance and covariance matrix we assume the logarithms of the positive observations have unit variance and constant covariance, equal to ρ, with ρ={0,0.4}. We set the mean vector of the first group μ1 equal to the null vector, while the other groups have mean components μjk=μj1+c1(k−1)/(K−1), with c1={0,1,5}, for j=1,⋯,p. The marginal probability of having zeros in the first group is varied as πj1={0.2,0.5,0.8} for j=1,⋯,p, while for the kth group πjk=πj1+c2(k−1)/(K−1), where c2={0,0.15,0.3}, whenever πjk<1.

Considering all the parameters' combinations we finally evaluate 448 scenarios. For each scenario we generate data 1000 times, we fix a nominal test size at α=0.05, and use 1000 permutations to evaluate the significance level of our proposed test statistic. During the estimation procedure, we additionally choose λ:=λ1=⋯=λp and λ0=λ01=⋯=λ0p.

Tables 1, 2, 3 show the proportions of rejections averaged over the 1000 replicates for each scenario with K=2, while Table 4 presents the rejections when K=4. When testing the null hypothesis with two groups, we compare our results to those of Chen et al. (2011). Although Chen's method was not originally intended to address missing data, it is the sole approach in the literature that can be adapted to manage high‐dimensional semicontinuous data. However, this method is limited to two‐group scenarios, meaning the comparison cannot be extended to cases with K=4 groups. Note that for all scenarios in which c1=c2=0 the null hypothesis is true, and the proportion of rejections corresponds to the observed test level. In all other cases, the proportion of rejections corresponds to the observed power. The actual size of the regularized MANOVA test closely aligns with the nominal size of 0.05. In some instances, the proportion of rejections slightly exceeds 0.05, but this excess constantly remains below the Monte Carlo error and could be reduced by increasing the number of repetitions. On the other hand, the test proposed by Chen et al. (2011) significantly exceeds the nominal size when the number of observations is small (nk=5) and the proportion of zeros is high (πj1=0.8). This could be because the method was not primarily developed for hypothesis testing in the presence of semicontinuous data. When a large number of components are missing and insufficient information can be recovered from the remaining observations in the sample, it faces difficulties in reaching the correct conclusion. As could be expected the power of the regularized MANOVA test increases with the sample size and with c1 and c2 growth, which leads to more separation between the two groups. With respect to Chen et al., in the configurations in which the nominal size was guaranteed, our method obtains in general higher or comparable power. The effect of positive correlation among the components leads in the new proposed method to a small decrease in the proportion of rejections. In the cases in which c2=0, we additionally observe that higher values of πj1 lead to lower power. This can be explained by the fact some variables are removed when there is a high probability of zero measurements. Indeed, as described in Section 2.1, some components can be removed during the estimation procedure, resulting on estimates of dimension p∗, with p∗≤p. Tables C1 and C2 in Appendix C report for each scenario the mean over 1000 replicates of p∗ and of the selected values of λ and λ0 for K=2 and K=4, respectively. The mean value of p∗ is notably influenced by the marginal probabilities of missing data in the two groups πj1 and πj2. Consequently, we observe a significant decrease in p∗ when πj1 and/or c2 grow. As expected, when the number of removed variables is too large, the observed power drops significantly. For instance, in the scenario with (K,nk,c1,c2,πj1,ρ)=(4,5,0,0.15,0.8,0) the rejection rate is extremely low. This can be attributed to the very small value of p∗ in this setting. On the other hand, when the number of observations slightly grows (nk=10), our method is able to recover a satisfactory power. Therefore, it can be concluded that, when p∗ is sufficiently large, the regularized MANOVA test demonstrates good power.

We simulate also two additional scenarios that mimic the real example described in Section 4.1. Hence, we test a situation with n1=n2=5, p=339, both under the null (scenario A) and under the alternative hypothesis (scenario B). The mean of the normal distribution used to simulate the data in scenarios A and B is equal to the sample mean of the blastocyst data presented in Section 4.1 under the null and under the alternative hypothesis respectively. In the second case, some of the components of the sample means of the different groups were missing, therefore we imputed the missing values with the corresponding components under H0. In a similar vein, the marginal probability of a missing component is equal to its sample version in the two different scenarios. Also, the sample variance and covariance matrices had some missing entries. Hence, we have imputed the missing elements along the main diagonal with 1 and the ones that were not on the main diagonal with 0. To achieve the positive definiteness, we additionally added 0.1 to each element on the main diagonal. The proportion of rejections in scenario A is equal to 0.051, while under scenario B it is 0.874. The mean dimension of components used for the estimation is respectively 251.74 and 261.01.

Real‐Data Analyses

4

We describe in this section the analysis of two real‐data examples.

MicroRNA Differential Expression in Blastocyst Cultures

4.1

The study of microRNA expression is particularly useful for the assessment of biological pathways, especially in the early‐stage development (Capalbo et al. 2016). In our original data example, we have n=10 samples coming from human blastocysts (day 5 after in vitro fertilization). Simplifying the setting, it can be said that blastocysts are composed by about 120 cells, about one‐third of which differentiated into an inner cell mass (ICM), and the remaining forming an outer TrophoEctoderm (TE). The ICM will proceed to develop into a human being, and the TE will develop into a placenta and other annexes.

In our data, we have n1=5 samples from ICM and n2=5 samples from TE. MicroRNA expression of p=339 sequences was obtained from each sample, with technical details that can be found elsewhere (Capalbo et al. 2016). MicroRNA are short sequences that are known to coordinate the expression of pathways of genes. Since many pathways are not yet activated at the blastocyst stage, many microRNA sequences will not be found in the samples, leading to zero‐inflation. This is confirmed in our data, where only about 39% of the sequences are non‐zero in all samples.

The main idea is to test the biological hypothesis that even if cells are not drastically differentiated at the blastocyst stage, their microRNA expression already is showing different pathway activation. In our framework and notation, this precisely would correspond to rejecting

[eqn]

indicating that there exist at least one microRNA that is expressed with differential probability or level of expression when comparing ICM with trophectoderm. Due to the very large number of zeros in the data it is clear that any test assuming multivariate normality, without taking into account the probability masses at zero, would be biased even if well calibrated for the p>n setting.

We use our proposed test to verify (15), and obtain a test statistic of 15.15 and, after permutation based on 1000 replicates, the p‐value is 0.008. We then reject the null hypothesis and conclude that there is an overall difference in microRNA expression levels when comparing ICM and TE at day 5 after fertilization. Interestingly, 53.4% of microRNA measurements are zero in the ICM, while 35.7% in the TE, which indicates an higher level of the biological activity for the TE. Similarly, 80.8% estimated mean expressions are larger for TE than for ICM.

Alien Species Invasion in Socotra Island

4.2

Socotra is an island located in the Arabian sea and administratively belonging to Yemen. Its rich biodiversity and the presence of many endemic species make it a unique environment (e.g., Attorre et al. 2014; Riccardi et al. 2020). Socotra environment is endangered by many threats, including climate change, anthropic activities, and alien species invasion. In this section, we use an original dataset about the determinants of alien species invasion in the island. See Terzano et al. (2018) for a similar assessment related to South Africa.

Researchers explored the island to record the abundance of p=299 alien species in n=103 plots that were approximately square shaped with an area of about 0.5 ha each. The abundance measure was the widely used importance value (IV), which can be calculated as

[eqn]

where NS(s) denotes the number of stems for species s that were counted in the plot, and BA(s) the basal area occupied by species s as the sum of areas occupied at breast height by each individual stem. IV ranges then from 0 (species is absent) to 200 (the plot is entirely covered only by the species). Despite the species considered being potentially very invasive, they are (still) absent in many plots, with about 93% of entries being equal to zero.

Plots can be classified as being limestone, alluvial, or granite. Our main question involves the assessment of association of overall abundance of invasive alien species and the geological characteristics of the plots. We conducted a comparison across the three types of plots, yielding a p‐value of 0.04 and a test statistic of 139.427. As a result, we reject the null hypothesis, concluding that there is a significant overall difference in the distribution of alien species across the different plot types.

Conclusions

5

In this paper, we have presented, to the best of our knowledge, the first MANOVA test for high‐dimensional semicontinous data. The test depends on a penalty parameter, which can be chosen in different ways. Importantly, the permutation procedure used to compute the significance level allows the user to naturally take into account additional heterogeneity arising from any data‐driven penalty parameter choice. The derivation of the formal asymptotic distribution, especially for unknown λ, would be very cumbersome (Han and Yin 2023). A limitation of the permutation approach is, clearly, its computational complexity. We shall report however that our implementation is rather efficient, both thanks to the closed form solution for the MLE and the use of parallel computing. A future contribution might go beyond the parametric assumptions (Zhang and Feng 2024), maybe based on a test statistic that involves the estimated traces of the covariance matrices (Chen and Qin 2010; Hu et al. 2017; Srivastava and Kubokawa 2013; Yamada and Himeno 2015).

Software

Software in the form of R package semicontMANOVA is available on the publisher's website and on CRAN.

Conflicts of Interest

The authors declare no conflicts of interest.

Open Research Badges

This article has earned an Open Data badge for making publicly available the digitally‐shareable data necessary to reproduce the reported results. The data is available in the Supporting Information section.

This article has earned an open data badge “Reproducible Research” for making publicly available the code necessary to reproduce the reported results. The results reported in this article could fully be reproduced.

Supporting information

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Attorre, F. , A. Issa , L. Malatesta , et al. 2014. “Analysing the Relationship Between Land Units and Plant Communities: The Case of Socotra Island (Yemen).” Plant Biosystems 148: 529–539.
2Boisbunon, A. , S. Canu , D. Fourdrinier , W. Strawderman , and M. T. Wells . 2014. “AIC, Cp and Estimators of Loss for Elliptically Symmetric Distributions.” International Statistical Review 84: 422–439.
3Bozdogan, H. 1987. “Model Selection and Akaike's Information Criterion (AIC): The General Theory and Its Analytical Extensions.” Psychometrika 52, no. 3: 345–370.
4Cai, T. , and Y. Xia . 2014. “High‐Dimensional Sparse MANOVA.” Journal of Multivariate Analysis 131: 174–196.
5Capalbo, A. , F. Ubaldi , D. Cimadomo , et al. 2016. “mi RN As in Spent Blastocyst Culture Medium Derive From Trophectoderm Cells and Can be Explored for Human Embryo Reproductive Competence Assessment.” Fertility and Sterility 105: 225–235.26453979 10.1016/j.fertnstert.2015.09.014 · doi ↗ · pubmed ↗
6Chai, H. , and K. Bailey . 2008. “Use of Log‐Skew‐Normal Distribution in Analysis of Continuous Data With a Discrete Component at Zero.” Statistics in Medicine 27: 3643–3655.18186536 10.1002/sim.3210 PMC 2758628 · doi ↗ · pubmed ↗
7Chen, L. S. , D. Paul , R. L. Prentice , and P. Wang . 2011. “A Regularized Hotelling's T 2 Test for Pathway Analysis in Proteomic Studies.” Journal of the American Statistical Association 106, no. 496: 1345–1360.23997374 10.1198/jasa.2011.ap 10599 PMC 3755504 · doi ↗ · pubmed ↗
8Chen, S. X. , and Y.‐L. Qin . 2010. “A Two‐Sample Test for High‐Dimensional Data With Applications to Gene‐Set Testing.” Annals of Statistics 38: 808–835.