Objective Bayesian analysis for the multivariate skew-t model
Antonio Parisi, Brunero Liseo

TL;DR
This paper introduces a Bayesian approach for the multivariate skew-t model, including a new parameterization, priors, and a sampler, with extensions to regression and frontier models, and provides an R package for implementation.
Contribution
It presents a novel Bayesian framework for the multivariate skew-t model, extending previous skew-normal models and offering practical tools like an R package.
Findings
Successful Bayesian inference for multivariate skew-t models
Extension to regression and stochastic frontier models demonstrated
Implementation via the mvst R package available
Abstract
We perform a Bayesian analysis of the p-variate skew-t model, providing a new parameterization, a set of non-informative priors and a sampler specifically designed to explore the posterior density of the model parameters. Extensions, such as the multivariate regression model with skewed errors and the stochastic frontiers model, are easily accommodated. A novelty introduced in the paper is given by the extension of the bivariate skew-normal model given in Liseo & Parisi (2013) to a more realistic p-variate skew-t model. We also introduce the R package mvst, which allows to estimate the multivariate skew-t model.
| Model | Student- | SN | ST | |
|---|---|---|---|---|
| 6.60e-14 | 2.22e-01 | 3.87e-11 | 7.78e-01 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Objective Bayesian analysis for the multivariate skew-t model111Preprint submitted to Statistical Methods and Applications.
Antonio Parisi
DEF, University of Rome Tor Vergata
Brunero Liseo
MEMOTEF, Sapienza University of Rome
(April 30, 2017)
Abstract
We perform a Bayesian analysis of the -variate skew-t model, providing a new parameterization, a set of non-informative priors and a sampler specifically designed to explore the posterior density of the model parameters. Extensions, such as the multivariate regression model with skewed errors and the stochastic frontiers model, are easily accommodated. A novelty introduced in the paper is given by the extension of the bivariate skew-normal model given in Liseo & Parisi (2013) to a more realistic -variate skew-t model. We also introduce the R package mvst, which allows to estimate the multivariate skew-t model.
Keywords: Multivariate skew-t model, Multivariate skew-normal model, Objective Bayes inference, Population Monte Carlo sampler, skewness.
1 Introduction
In the last two decades there has been an explosion of interest around the possibility of constructing models which generalize the Gaussian distributions in terms of skewness and extra-kurtosis. Interest can be partially explained with the empirical observations of phenomena, in different disciplines which could not be easily represented via Gaussian distributions. See Genton (2004) and Azzalini & Capitanio (2014) for general accounts. In this perspective, different proposals of skew-Student distributions have been proposed and now they play a prominent role as empirical models for heavy-tailed data, particularly in finance (Rachev et al., 2008).
Among the various proposals we mention the skew-t distribution obtained as a scale mixture of skew-normal densities (Azzalini & Capitanio, 2003); the “two-piece” distributions of Hansen (1994) and Fernandez & Steel (1999); the skew-t distribution arising from a conditioning argument (Branco & Dey, 2001; Azzalini & Capitanio, 2003); the skew-t distribution of Jones & Faddy (2003), obtained by transforming a beta random variable, and the skew-t distribution arising from a sinh-arcsinh transformation (Rosco et al., 2011). In practice, the most used of these are the Azzalini-type skew-t distribution, in the form arising from scale mixing Azzalini’s skew-normal distribution (Azzalini & Capitanio, 2003) and the “two-piece” distribution.
In the paper we will concentrate on the Azzalini-type skew-t distribution. For a Bayesian analysis of the “two-piece” distribution one can refer to Rubio et al. (2015) and Leisen et al. (2016) where a new objective prior is introduced for the degrees of freedom parameter.
Following Azzalini & Capitanio (2014), their version of the multivariate skew-t distribution can be obtained as a scale mixture of multivariate skew-normal distributions. Let , where is the correlation matrix of the multivariate normal density appearing in the density of , and .
Let ; integrating out , one obtains the density of a -variate skew-t random vector as
[TABLE]
where .
The joint estimation of the skewness vector and the degrees of freedom parameter is hard even in the scalar case. For the symmetric Student’s distribution, it is known that the likelihood function tends to infinity when goes to zero (Fernandez & Steel, 1999). Fonseca & al. (2008) gave a condition for the existence of the MLE of in that case. For the skew-t distribution, the deviance approach has been implemented in Azzalini & Genton (2008), where now the replacement of the MLE of is based on the null hypothesis and on a distribution. However, simulation results have shown that this procedure provides only a partial solution to the problem. Alternatively, the modified score function approach has been applied to the skew-t distribution by Sartori (2006), although no proof of the finiteness of the resulting shape estimator has been provided; besides, this method requires the degrees of freedom parameter to be fixed. Branco et al. (2011) provides an objective Bayesian solution to this problem in the scalar case.
In this paper we propose a method which generalizes both the results in Branco et al. (2011) and Liseo & Parisi (2013). In fact we describe a Bayesian analysis of the -variate skew-t (ST) model, providing a parameterization, a set of non-informative priors and a sampler specifically designed to explore the posterior density of the parameters of the model. Extensions of the model, such as the multivariate regression model with skewed errors and the stochastic frontiers model, are straightforward.
The main novelty of the present paper is given by the extension of the bivariate skew-normal (SN) model given in Liseo & Parisi (2013) to a more realistic -variate ST model. Several issues arise in this extension, the most important of which is related to the elicitation of the prior distribution for the shape parameter and the sampling strategy for an additional set of latent variables.
This paper also introduces the R (R Core Team, 2015) package mvst, which is available in the CRAN repository.
Several other packages are available for dealing with skew-symmetric distributions; among others, the R packages sn (Azzalini, 2015), EMMIXuskew (Lee & McLachlan, 2013), mixsmsn (Prates et al., 2013) and the Stata (StataCorp., 2015) suite of commands st0207 by Marchenko & Genton (2010): however, most of them only rely upon the frequentist approach.
The rest of the paper is organized as follows: the second section introduces the model and the notation, along with the complete likelihood function and complete maximum likelihood estimators. It finally provides the prior distributions and the proof that the posterior distribution is proper.
The third section introduces the sampler and describes a set of proposal distributions.
Results from a simulation study are given in section four.
Throughout the paper, we will switch between three different parameterizations, characterized by the sets of parameters , and ; the former allows us to provide the proofs of our main results, the second one is the most sensible to elicit the prior distributions, while the latter is useful for the sampling strategy.
2 The model
The density of the multivariate skew-t random vector has been given in (1). For inferential purposes it is often necessary to introduce location and scale parameters, via the transformation . We then finally say that a random vector is distributed as a -variate skew-t distribution, denoted by , if its pdf is given by
[TABLE]
where and are -dimensional location and shape parameters, is a diagonal matrix with the marginal scale parameters, so that represents the scale matrix and represents the number of degrees of freedom. Moreover,
[TABLE]
There exist a useful stochastic representation of the random vector which is given in the following proposition.
Proposition 2.0.1
Let
[TABLE]
and let be the indicator function of the set ; define
[TABLE]
and
[TABLE]
with independent of . Then, (a) the random vector
[TABLE]
and (b) the joint density of is given by
[TABLE]
Proof: the result is a direct consequence of the definition of the skew-t distribution. Details can be found in Appendix A.
2.1 Augmented likelihood function
The above stochastic representation suggests to express the density of a skew-t random vector as the marginal density of the augmented vector given in (2.0.1).
It is useful to define the parameter vectors and , where
[TABLE]
Using the new parameterization , and in the presence of a sample of i.i.d. observations from a -dimensional , the augmented likelihood function is
[TABLE]
where , , .
2.1.1 Complete maximum likelihood estimators
The complete maximum likelihood (CML hereafter) estimators are obtained as if we had observed the values of the latent variables ’s and ’s. We will make use of the CML estimatates for the initialization of the sampling strategy, described below. They incorporate an additional piece of information, hence they could also be useful as a benchmark to evaluate and compare different estimators in a simulation experiment.
Given and , the likelihood (2.1) gets transformed into
[TABLE]
After straightforward calculations, the CML estimators are obtained as:
[TABLE]
where
[TABLE]
The estimator for have not a closed form expression: it is the solution of the following equation
[TABLE]
where denotes the digamma function.
2.2 Prior distributions
We assume the following prior structure for the parameters
[TABLE]
As pointed out in Liseo & Parisi (2013), when , even following an objective Bayesian approach, and cannot be considered a priori independent of each other. This depends on the expression of : in order to guarantee the positive definiteness of , one should consider, both in the analytical expression and in the computations, the constraint .
We further consider the decomposition
[TABLE]
and we assume a flat prior for and a conjugate Inverse Wishart prior for . This way we adopt the “usual” objective priors for the location and scale parameters as in the multivariate Normal model, which is nested in the multivariate ST model, as and . In practice, we set
[TABLE]
In real applications, we will take and . In §2.3, we prove that the use of an improper prior on produces proper posterior distributions, provided that the prior on the degrees of freedom parameter is proper and discrete over . Then we assume a uniform prior for over a set of 20 values ranging from 1 to 100.
Finally, we need to specify . For each value of , the parameter lies in a -dimensional region whose shape only depends on or . In particular, given the expression of , it is easy to verify that
[TABLE]
must hold, so the conditional parameter space is an ellipsoid, say , given by expression (6), centered at the origin and contained in the hyper-cube . In any simulation based approach care must be taken that the proposed values actually satisfy (6). For computational convenience we prefer to directly include this constraint on the prior. In the bivariate case, Liseo & Parisi (2013) used an approximation of the Jeffreys’ prior, normalized over . This normalization step, for large , may become computationally demanding. For this reason, we propose to adopt a uniform prior over , whose volume can be evaluated in a closed form, so the normalizing constant is analytically tractable. Then we assume: that is
[TABLE]
In the practical application of the ST model, we will use the parameterization for our sampling strategy. Hence, we need to compute the Jacobian of the transformation , which is given by
[TABLE]
2.3 Posterior propriety
Proposition 2.3.1
The posterior distribution of the model is proper.
Proof: Let , using the parameterization in (2),
[TABLE]
Since the c.d.f. is bounded by 1, one obtains
[TABLE]
Notice that the parameter only appears in the prior distribution; then it can be integrated out to obtain
[TABLE]
The above expression is proportional to the posterior density of the parameters of a multivariate Student- model, with priors given as in §2.2. Theorem 1 in Fernandez & Steel (1999) then guarantees that the posterior distribution of our ST model is proper as soon as the prior on is proper and except, possibly, for a set of Lebesgue measure zero in . The finite precision of the data recording process can lead, under some choices for the prior distributions, to improper posterior distributions. However, it is possible to verify this condition for any given dataset, and we refer to the cited article for details.
3 The sampler
In the following, we describe the sampling strategy. We have used a Population Monte Carlo algorithm (PMC hereafter, see Cappé et al., 2004), which improves and generalizes the one used in Liseo & Parisi (2013) for the bivariate SN model.
As a Monte Carlo method, the PMC sampler doesn’t rely on convergence arguments, hence it can overcome the problem of multimodality of the posterior distribution; moreover, it offers a great flexibility in choosing the proposal density functions. For example, we use (approximations of) the full conditional distributions as proposal densities.
The outline of the algorithm for the ST model is as follows:
- •
At iteration 0, a population of particles , containing the values of , and , is initialized. A possible initialization is described in §3.1.
- •
At a generic iteration
- –
new values for the particles are proposed following a proposal distribution , whose parameters possibly depend on the populations of particles in the previous iterations,
- –
the importance weights are computed as
[TABLE]
where and denote the unnormalized posterior density function and importance weights.
- –
A set of quantities are obtained on the basis of the current particles and weights. This set includes the estimates of the parameters , a quantity related to the performance of the sampler in the -th iteration
[TABLE]
and all the other objects of interest.
- –
the particles are multinomially resampled using the weigths .
- •
After iterations, the final estimates are obtained as a weighted mean of the estimates with (unnormalized) weights given by .
A quantity of special interest which can be easily obtained using the PMC is the marginal likelihood of each model. It can be estimated as
[TABLE]
3.1 Initial values for parameters
The initial points are sampled by mimicking the stochastic representation of the model. Then
the values of are sampled from the prior distribution; 2. 2.
given the values of the latent variables and are sampled by the respective sampling distributions described in Proposition 2.0.1; 3. 3.
given , and , the parameters , and are obtained as the CML estimates of the parameters, as described in §2.1.1.
3.2 Proposals
For the common parameters of SN and ST models, the proposal distributions are similar to those reported in Liseo & Parisi (2013); our versions are given in appendix B. The ST model, however, also includes the parameter and the latent variables ’s.
The parameter assumes values on a finite set, hence it is easy to simulate from its full conditional distribution
[TABLE]
Instead, to our knowledge, there is no simple way to draw values from the full conditional distribution of , which is given by
[TABLE]
where
[TABLE]
and is the normalizing constant.
When (for example in the symmetric case, where is a null vector), then the full conditional for has a Gamma distribution. Otherwise, the sign of determines the right tail behaviour: when is positive (negative), the right tail of the full conditional distribution is thicker (lighter) than the right tail of a Gamma distribution.
Hence, we cannot propose values from a Gamma distribution, as it could jeopardize the validity of the method when . On the other hand, proposing from a distribution with a thick tail could represent a huge loss in the efficiency of the sampler. For these reasons, we propose values using a rejection sampler (see, for example, Robert & Casella, 2004, §2.3) having the full conditional distribution as target density. We will
define the distribution of the instrumental variable of the rejection sampler; 2. 2.
choose the parameters of this distribution by minimizing the Kullback-Leibler divergence with respect to the target distribution; 3. 3.
obtain the constant required by the rejection sampling algorithm; 4. 4.
obtain the normalizing constant , required by the PMC algorithm.
Details are as follows:
define , with ; the instrumental density function is
[TABLE]
this density has a right tail which is thicker than the one of the target distribution; 2. 2.
If we set (see Appendix C), the divergence, as a function of , has a minimum (in ) in
[TABLE]
Using the parameters and we will optimise the efficiency of the rejection sampler. 3. 3.
the Rejection Sampling algorithm requires a constant for which
[TABLE]
The value of can be found by defining the ratio ; given the parameters of the instrumental density, this function has a maximum in
[TABLE]
The value of can be finally obtained as . 4. 4.
To obtain the value of
[TABLE]
we use eq. 3.462 (1) in Gradshteyn & Ryzhik (1994, GR hereafter), with , , ,
[TABLE]
where is the parabolic cylinder function (GR, eq. 9.240) with and , hence
[TABLE]
where denotes the confluent hypergeometric function (GR, eq. 9.210).
4 Simulation study
In this section we use simulated data to evaluate the performances of the proposed approach. Since the multivariate ST model may be considered an encompassing model including, as special cases, the multivariate Student- model, the multivariate SN model and the multivariate normal one, it is of primary importance to verify the ability of the proposed approach to discriminate among these nested models.
For each of the four models, we have generated 50 samples; for each sample, we compute the posterior probabilities of each candidate model. These posterior probabilities are estimated using (7) together with a uniform prior over the model space.
In our simulations, each sample consists of observations with and
[TABLE]
Samples from the SN and ST models have been generated using . Data generated from the Student- and ST models have .
For each sample we have run the PMC algorithm using 20000 particles for each of 6 iterations. Results are summarized in the following four plots.
The barplot in Fig. 1 depicts the results for the Normally distributed samples. Each column stacks the posterior probabilities of the 4 candidate models estimated for a single sample. To improve the readability of the plot, bars have been rearranged in order to have a decreasing probability for the true model. Here, the true model is correctly identified in 47 cases; in the remaining cases (3 our of 50), the Student- model is preferred. The posterior probabilities for the remaining models are always very low.
The situation is even more extreme when data come from a Student- distribution (Fig. 2): here the true model is always correctly identified, with small to negligible probabilities for the other models.
The worst performance of our approach happens when the data are generated from a SN distribution. In Fig. 3, it is possible to notice that the procedure detects the correct SN model in about 25% of the cases, and it more often prefers the multivariate normal model: this can be justified by the fact that the multivariate SN model is notoriously the most difficult to deal with, because of the multimodality phenomenon, described in Liseo & Parisi (2013).
Also in the ST case (Fig. 4), the true model has been correctly identified in 44 cases. In almost all the other cases, the SN model has been preferred.
4.1 The mvst package
The simulation results have been obtained in R, using the package mvst. It contains functions to estimate the parameters of the ST (and nested) models, and to simulate data from them. It uses the model and the proposals described above, even if it allows to define customized prior and proposal distributions.
It makes use of the GNU Scientific Library (see Gough, 2009) to speed up the heaviest parts of the code and, in particular, for the computation of (8). Besides, it requires three R packages: mvtnorm (Genz et al., 2015), MCMCpack (Martin et al., 2011) and mnormt (Azzalini & Genz, 2016). It also makes use of three scripts available in the RcppGSL package (Eddelbuettel & Romain, 2015).
5 A real dataset
As a final illustration of the proposed algorithm, we consider the wine data of the Grignolino cultivar, used in §6.2.6 of Azzalini & Capitanio (2014). The dataset contains 71 observations on 3 variables (chloride, glycerol and magnesium). Data are available in the sn package.
We have performed a PMC sampler with 6 iterations, 20000 particles each. The posterior probabilities for the four models are given in Table (1). Models with light tails have negligible probabilities, while the preferred model is skew-t.
Given this model, the posterior mean for is approximately equal to 3.22, while the ML estimate in Azzalini & Capitanio (2014) is equal to 3.4.
Appendix A Proof of Proposition 2.0.1
(a): From one of the possible definitions of a multivariate ST r.v., it is known that ; since is a simple transformation of , its distribution is readily obtained.
(b): Start from . By assumption, is a standard Gaussian density, and
[TABLE]
Then, by using simple results on conditional Gaussian densities, one gets
[TABLE]
Hence the result in (2.0.1).
Appendix B Proposal distributions
We use the full conditional distributions as proposals for the latent variables and : each has the following full conditional distribution
[TABLE]
where
[TABLE]
The variables can be drawn as the product of , a normal r.v. with parameters and truncated in 0 and the sign , uniform on . To generate values a rejection sampler has been employed (see Robert, 1995).
The parameter has the following full conditional density:
[TABLE]
The parameters and have untractable full conditional distributions. To obtain a proposal distribution, they are approximated using only the contribution of the likelihood to the full conditional density.
The parameter has the following full conditional distribution
[TABLE]
where denotes the indicator function. By ignoring the first two factors, we obtain the following proposal distribution
[TABLE]
The proposal distribution has a positive density on , while the full conditional is bounded on . This feature improves the ability of the sampler to explore the parameter space; moreover, particles which don’t respect the constraint (6) will be automatically discarded, as they have null prior (and posterior) probability density, hence a null importance weight.
The parameter has the following full conditional density
[TABLE]
Ignoring the prior term we obtain
[TABLE]
Appendix C Details about the Rejection Sampler
For a generic latent variable , the Kullback Leibler divergence is given by
[TABLE]
which has an analytical solution for :
[TABLE]
This divergence has always one (and only one) minimum in , given by
[TABLE]
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Azzalini & Capitanio (2014) Azzalini, A. (2014). The Skew-Normal and related families , (with the collaboration of A. Capitanio). Cambridge: Cambridge University Press.
- 2Azzalini (2015) Azzalini, A. (2015). The R package sn : The Skew-Normal and Skew- t 𝑡 t distributions (version 1.3-0) . Università di Padova, Italia.
- 3Azzalini & Capitanio (2003) Azzalini, A. & Capitanio, A. (2003). Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t distribution. Journal of the Royal Statistical Society, B , 65, 367–389.
- 4Azzalini & Genton (2008) Azzalini, A. & Genton, M. (2008). Robust likelihood methods based on the skew-t and related distributions. International Statistical Review , 76, 106–119
- 5Azzalini & Genz (2016) Azzalini, A. & Genz, A. (2016). The R package mnormt : The multivariate normal and t 𝑡 t distributions (version 1.5-4). http://azzalini.stat.unipd.it/SW/Pkg-mnormt
- 6Branco & Dey (2001) Branco, M. D. & Dey, D. (2001). A general class of multivariate skew-elliptical distributions. Journal of Multivariate Analysis , 77, 1–15.
- 7Branco et al. (2011) Branco, M.D. , Genton, M.G. & Liseo, B. (2011). Objective Bayesian Analysis of Skew-t Distributions. Scandinavian Journal of Statistics , 40 (1), 63–85
- 8Cappé et al. (2004) Cappé, O. , Guillin, A. , Marin, J. M. & Robert, C. P. (2004). Population Monte Carlo. J. Comput. Graph. Statist. 13 , 907–929.
