Some computational aspects of maximum likelihood estimation of the   skew-$t$ distribution

Adelchi Azzalini; Mahdi Salehi

arXiv:1907.10397·stat.CO·July 25, 2019

Some computational aspects of maximum likelihood estimation of the skew-$t$ distribution

Adelchi Azzalini, Mahdi Salehi

PDF

Open Access

TL;DR

This paper explores computational challenges in maximum likelihood estimation for the skew-$t$ distribution, focusing on issues like local maxima and developing efficient initialization methods for better parameter estimation.

Contribution

It introduces a quick, reliable initialization technique for MLE of the skew-$t$ distribution, addressing issues of local maxima and improving estimation stability.

Findings

01

Identification of multiple local maxima in the likelihood function.

02

Development of a new initialization method for MLE.

03

Enhanced stability and reliability in parameter estimation.

Abstract

Since its introduction, the skew- $t$ distribution has received much attention in the literature both for the study of theoretical properties and as a model for data fitting in empirical work. A major motivation for this interest is the high degree of flexibility of the distribution as the parameters span their admissible range, with ample variation of the associated measures of skewness and kurtosis. While this high flexibility allows to adapt a member of the parametric family to a wide range of data patterns, it also implies that parameter estimation is a more delicate operation with respect to less flexible parametric families, given that a small variation of the parameters can have a substantial effect on the selected distribution. In this context, the aim of the present contribution is to deal with some computational aspects of maximum likelihood estimation. A problem of interest is…

Figures8

Click any figure to enlarge with its caption.

Tables2

Table 1. Table 1: Coefficients used for interpolation of the G 𝐺 G and M 𝑀 M tabulated values.

$ν^{*}$	${M \|}_{δ = 0}$	$η_{1}^{(ν)}$	$η_{2}^{(ν)}$	$η_{3}^{(ν)}$
0.30	9.946	2.213831	$-$ 0.315418	$-$ 0.007641
0.32	8.588	2.022665	$-$ 0.240821	$-$ 0.012001
0.35	7.110	1.790767	$-$ 0.164193	$-$ 0.021492
0.40	5.525	1.506418	$-$ 0.090251	$-$ 0.047034
0.45	4.543	1.305070	$-$ 0.050702	$-$ 0.087117
0.50	3.888	1.156260	$-$ 0.028013	$-$ 0.143526
0.60	3.088	0.952435	$-$ 0.005513	$-$ 0.307509
0.70	2.630	0.819371	0.004209	$-$ 0.536039
0.80	2.339	0.724816	0.008992	$-$ 0.818739
0.90	2.142	0.653206	0.011596	$-$ 1.142667
1.00	2.000	0.596276	0.013136	$-$ 1.495125
1.50	1.652	0.417375	0.015798	$-$ 3.365100
2.00	1.517	0.314104	0.016371	$-$ 5.011929
3.00	1.403	0.192531	0.016274	$-$ 7.304089
4.00	1.354	0.123531	0.015682	$-$ 8.676470
5.00	1.327	0.080123	0.014987	$-$ 9.546498
7.00	1.298	0.030605	0.013674	$-$ 10.561206
10.00	1.277	$-$ 0.003627	0.012113	$-$ 11.335506
15.00	1.262	$-$ 0.024611	0.010334	$-$ 11.977601
20.00	1.254	$-$ 0.030903	0.009149	$-$ 12.343369
30.00	1.247	$-$ 0.031385	0.007650	$-$ 12.789281
40.00	1.244	$-$ 0.027677	0.006721	$-$ 13.074983
50.00	1.241	$-$ 0.023285	0.006079	$-$ 13.284029
100.00	1.237	$-$ 0.005288	0.004478	$-$ 13.874691
$\infty$	1.233

Table 2. Table 3: Frequency table of computing time and their differences in the case of simple random samples. The values of t 2 subscript 𝑡 2 t_{2} and t 3 subscript 𝑡 3 t_{3} include the time t 1 subscript 𝑡 1 t_{1} for their initialization.

	$(- \infty, - 0.25]$	$(- 0.25, - 0.1]$	$(- 0.1, - 0.05]$	$(- 0.05, 0]$	$(0, 0.05]$	$(0.05, 0.1]$	$(0.1, 0.25]$	$(0.25, \infty]$
$t_{0}$	0	0	0	0	39561	21049	10815	575
$t_{1}$	0	0	0	0	71998	0	2	0
$t_{2}$	0	0	0	0	49086	20047	2866	1
$t_{3}$	0	0	0	0	44178	21880	5942	0
$t_{2} - t_{0}$	163	2010	4629	51101	13790	294	13	0
$t_{2} - t_{3}$	0	30	1148	50694	19679	438	11	0
$t_{3} - t_{0}$	121	1487	2783	46038	21451	116	4	0

Equations58

t (z; λ, ν) = 2 t (z; ν) T (λ z \frac{ν + 1}{ν + z ^{2}}; ν + 1), z \in R,

t (z; λ, ν) = 2 t (z; ν) T (λ z \frac{ν + 1}{ν + z ^{2}}; ν + 1), z \in R,

t(z;\nu)=\frac{\Gamma\big{(}\mbox{$\textstyle\frac{1}{2}$}(\nu+1)\big{)}}{\sqrt{\pi\,\nu}\>\Gamma\big{(}\mbox{$\textstyle\frac{1}{2}$}\nu\big{)}}\>\left(1+\frac{z^{2}}{\nu}\right)^{-(\nu+1)/2},\qquad z\in\mathbb{R},

t(z;\nu)=\frac{\Gamma\big{(}\mbox{$\textstyle\frac{1}{2}$}(\nu+1)\big{)}}{\sqrt{\pi\,\nu}\>\Gamma\big{(}\mbox{$\textstyle\frac{1}{2}$}\nu\big{)}}\>\left(1+\frac{z^{2}}{\nu}\right)^{-(\nu+1)/2},\qquad z\in\mathbb{R},

t_{Y} (x; θ) = ω^{- 1} t (z; λ, ν), z = ω^{- 1} (z - ξ),

t_{Y} (x; θ) = ω^{- 1} t (z; λ, ν), z = ω^{- 1} (z - ξ),

μ

μ

σ^{2}

γ_{1}

γ_{2}

\delta=\delta(\lambda)=\frac{\lambda}{(1+\lambda^{2})^{1/2}}\in(-1,1),\hskip 20.00003ptb_{\nu}=\frac{\sqrt{\nu}\>\Gamma\left(\mbox{$\textstyle\frac{1}{2}$}(\nu-1)\right)}{\sqrt{\pi}\>\Gamma\left(\mbox{$\textstyle\frac{1}{2}$}\nu\right)}\quad\hbox{if\leavevmode\nobreak\ }\nu>1.

\delta=\delta(\lambda)=\frac{\lambda}{(1+\lambda^{2})^{1/2}}\in(-1,1),\hskip 20.00003ptb_{\nu}=\frac{\sqrt{\nu}\>\Gamma\left(\mbox{$\textstyle\frac{1}{2}$}(\nu-1)\right)}{\sqrt{\pi}\>\Gamma\left(\mbox{$\textstyle\frac{1}{2}$}\nu\right)}\quad\hbox{if\leavevmode\nobreak\ }\nu>1.

t_{d} (z; \overset{ˉ}{Ω}, ν) = \frac{Γ (( ν + d ) /2 )}{( ν π ) ^{d /2} Γ ( ν /2 ) det ( Ω ˉ ) ^{1/2}} (1 + \frac{Q ( z )}{ν})^{- \frac{ν + d}{2}}, z \in R^{d} .

t_{d} (z; \overset{ˉ}{Ω}, ν) = \frac{Γ (( ν + d ) /2 )}{( ν π ) ^{d /2} Γ ( ν /2 ) det ( Ω ˉ ) ^{1/2}} (1 + \frac{Q ( z )}{ν})^{- \frac{ν + d}{2}}, z \in R^{d} .

t_{d} (x) = 2 t_{d} (z; \overset{ˉ}{Ω}, ν) T (α^{⊤} z \frac{ν + d}{ν + Q ( z )}; ν + d), z \in R^{d}

t_{d} (x) = 2 t_{d} (z; \overset{ˉ}{Ω}, ν) T (α^{⊤} z \frac{ν + d}{ν + Q ( z )}; ν + d), z \in R^{d}

D (α, ν) = 2 {lo g L_{p} (\overset{α}{^}, \overset{ν}{^}) - lo g L_{p} (α, ν)}

D (α, ν) = 2 {lo g L_{p} (\overset{α}{^}, \overset{ν}{^}) - lo g L_{p} (α, ν)}

d_{q}=\mbox{$\textstyle\frac{1}{2}$}(q_{3}-q_{1})

d_{q}=\mbox{$\textstyle\frac{1}{2}$}(q_{3}-q_{1})

G = \frac{( q _{3} - q _{2} ) - ( q _{2} - q _{1} )}{q _{3} - q _{1}} = \frac{q _{3} - 2 q _{2} + q _{1}}{2 d _{q}};

G = \frac{( q _{3} - q _{2} ) - ( q _{2} - q _{1} )}{q _{3} - q _{1}} = \frac{q _{3} - 2 q _{2} + q _{1}}{2 d _{q}};

M = \frac{( e _{7} - e _{5} ) + ( e _{3} - e _{1} )}{e _{6} - e _{2}}

M = \frac{( e _{7} - e _{5} ) + ( e _{3} - e _{1} )}{e _{6} - e _{2}}

δ^{*}

δ^{*}

ν^{*}

lo g λ \approx η_{1}^{(ν)} u + η_{2}^{(ν)} u^{3} + η_{3}^{(ν)} u^{- 3}, u = lo g G .

lo g λ \approx η_{1}^{(ν)} u + η_{2}^{(ν)} u^{3} + η_{3}^{(ν)} u^{- 3}, u = lo g G .

\tilde{ω} = \frac{q _{3} - q _{1}}{q _{3}^{ST} - q _{1}^{ST}}

\tilde{ω} = \frac{q _{3} - q _{1}}{q _{3}^{ST} - q _{1}^{ST}}

\tilde{ξ} = q_{2} - \tilde{ω} q_{2}^{ST}

\tilde{ξ} = q_{2} - \tilde{ω} q_{2}^{ST}

ξ_{i} = x_{i}^{⊤} β, i = 1, \dots, n,

ξ_{i} = x_{i}^{⊤} β, i = 1, \dots, n,

\tilde{β}_{0} = \tilde{β}_{0}^{m} - \tilde{ω} q_{2}^{ST}

\tilde{β}_{0} = \tilde{β}_{0}^{m} - \tilde{ω} q_{2}^{ST}

ξ_{i}^{⊤} = x_{i}^{⊤} β, i = 1, \dots, n,

ξ_{i}^{⊤} = x_{i}^{⊤} β, i = 1, \dots, n,

\tilde{θ}_{j} = (\tilde{β}_{\cdot j}^{⊤}, \tilde{ω}_{j}, \tilde{λ}_{j}, \tilde{ν}_{j})^{⊤}, \tilde{z}_{ij} = \tilde{ω}_{j}^{- 1} (y_{ij} - x_{i}^{⊤} \tilde{β}_{\cdot j}) .

\tilde{θ}_{j} = (\tilde{β}_{\cdot j}^{⊤}, \tilde{ω}_{j}, \tilde{λ}_{j}, \tilde{ν}_{j})^{⊤}, \tilde{z}_{ij} = \tilde{ω}_{j}^{- 1} (y_{ij} - x_{i}^{⊤} \tilde{β}_{\cdot j}) .

F_{W} (m_{\tilde{w}}; ρ, \tilde{ν}) = 1/2

F_{W} (m_{\tilde{w}}; ρ, \tilde{ν}) = 1/2

\Omega^{*}=\pmatrix{\bar{\Omega}&\delta\cr\delta^{\top}&1}>0\,.

\Omega^{*}=\pmatrix{\bar{\Omega}&\delta\cr\delta^{\top}&1}>0\,.

\alpha=\big{(}1-\delta^{\top}\bar{\Omega}^{-1}\delta\big{)}^{-1/2}\bar{\Omega}^{-1}\delta

\alpha=\big{(}1-\delta^{\top}\bar{\Omega}^{-1}\delta\big{)}^{-1/2}\bar{\Omega}^{-1}\delta

\tilde{Ω} = \tilde{ω} \tilde{\overset{ˉ}{Ω}} \tilde{ω}

\tilde{Ω} = \tilde{ω} \tilde{\overset{ˉ}{Ω}} \tilde{ω}

\begin{array}[]{rcl}\lambda&:&0,\quad 2,\quad 8,\\ \nu&:&1,\quad 3,\quad 8,\\ n&:&50,\leavevmode\nobreak\ 100,\leavevmode\nobreak\ 250,\leavevmode\nobreak\ 500\end{array}

\begin{array}[]{rcl}\lambda&:&0,\quad 2,\quad 8,\\ \nu&:&1,\quad 3,\quad 8,\\ n&:&50,\leavevmode\nobreak\ 100,\leavevmode\nobreak\ 250,\leavevmode\nobreak\ 500\end{array}

(- \infty, - 20], \leavevmode \leavevmode (- 20, - 2], \leavevmode \leavevmode (- 2, - 0.2], \leavevmode \leavevmode (- 0.2, 0], \leavevmode \leavevmode (0, 0.2], \leavevmode \leavevmode (0.2, 2], \leavevmode \leavevmode (2, 20], \leavevmode \leavevmode (20, \infty]

(- \infty, - 20], \leavevmode \leavevmode (- 20, - 2], \leavevmode \leavevmode (- 2, - 0.2], \leavevmode \leavevmode (- 0.2, 0], \leavevmode \leavevmode (0, 0.2], \leavevmode \leavevmode (0.2, 2], \leavevmode \leavevmode (2, 20], \leavevmode \leavevmode (20, \infty]

\begin{array}[]{rccccc}&&T_{0}(x)\nobreak\leavevmode&T_{1}(x)&\leavevmode\nobreak\ T_{2}(x)&\leavevmode\nobreak\ T_{3}(x)\\ \hbox{case A\leavevmode\nobreak\ }(p=3)&:&1&x&\leavevmode\nobreak\ \sqrt{1+x}\\ \hbox{case B\leavevmode\nobreak\ }(p=3)&:&1&x&\sin 3x\\ \hbox{case C\leavevmode\nobreak\ }(p=4)&:&1&x&\sin 3x&x/(1+0.8\,x)\end{array}

\begin{array}[]{rccccc}&&T_{0}(x)\nobreak\leavevmode&T_{1}(x)&\leavevmode\nobreak\ T_{2}(x)&\leavevmode\nobreak\ T_{3}(x)\\ \hbox{case A\leavevmode\nobreak\ }(p=3)&:&1&x&\leavevmode\nobreak\ \sqrt{1+x}\\ \hbox{case B\leavevmode\nobreak\ }(p=3)&:&1&x&\sin 3x\\ \hbox{case C\leavevmode\nobreak\ }(p=4)&:&1&x&\sin 3x&x/(1+0.8\,x)\end{array}

\Omega=\pmatrix{1&1/2\cr 1/2&1},\qquad\alpha=\lambda\>\pmatrix{1\cr 2}

\Omega=\pmatrix{1&1/2\cr 1/2&1},\qquad\alpha=\lambda\>\pmatrix{1\cr 2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Distribution Estimation and Applications · Statistical Methods and Bayesian Inference · Advanced Statistical Methods and Models

Full text

Some computational aspects of maximum likelihood estimation

of the skew- $t$ distribution

Adelchi Azzalini

Dipartimento di Scienze Statistiche

Università di Padova, Italia

Mahdi Salehi

Department of Mathematics and Statistics

University of Neyshabur, Iran

(16th June 2019)

Abstract

Since its introduction, the skew- $t$ distribution has received much attention in the literature both for the study of theoretical properties and as a model for data fitting in empirical work. A major motivation for this interest is the high degree of flexibility of the distribution as the parameters span their admissible range, with ample variation of the associated measures of skewness and kurtosis. While this high flexibility allows to adapt a member of the parametric family to a wide range of data patterns, it also implies that parameter estimation is a more delicate operation with respect to less flexible parametric families, given that a small variation of the parameters can have a substantial effect on the selected distribution. In this context, the aim of the present contribution is to deal with some computational aspects of maximum likelihood estimation. A problem of interest is the possible presence of multiple local maxima of the log-likelihood function. Another one, to which most of our attention is dedicated, is the development of a quick and reliable initialization method for the subsequent numerical maximization of the log-likelihood function, both in the univariate and the multivariate context.

1 Background and aims

1.1 Flexible distributions: the skew- $t$ case

In the context of distribution theory, a central theme is the study of flexible parametric families of probability distributions, that is, families allowing substantial variation of their behaviour when the parameters span their admissible range. For brevity, we shall refer to this domain with the phrase ‘flexible distributions’. The archetypal construction of this logic is represented by the Pearson system of curves for univariate continuous variables. In this formulation, the density function is regulated by four parameters, allowing wide variation of the measures of skewness and kurtosis, hence providing much more flexibility than in the basic case represented by the normal distribution, where only location and scale can be adjusted.

Since Pearson times, flexible distributions have remained a persistent theme of interest in the literature, with a particularly intense activity in recent years. A prominent feature of newer developments is the increased consideration for multivariate distributions, reflecting the current availability in applied work of larger datasets, both in sample size and in dimensionality. In the multivariate setting, the various formulations often feature four blocks of parameters to regulate location, scale, skewness and kurtosis.

While providing powerful tools for data fitting, flexible distributions also pose some challenges when we enter the concrete estimation stage. We shall be working with maximum likelihood estimation (MLE) or variants of it, but qualitatively similar issues exist for other criteria. Explicit expressions of the estimates are out of the question; some numerical optimization procedure is always involved and this process is not so trivial because of the larger number of parameters involved, as compared with fitting simpler parametric models, such as a Gamma or a Beta distribution. Furthermore, in some circumstances, the very flexibility of these parametric families can lead to difficulties: if the data pattern does not aim steadily towards a certain point of the parameter space, there could be two or more such points which constitute comparably valid candidates in terms of log-likelihood or some other estimation criterion. Clearly, these problems are more challenging with small sample size, later denoted $n$ , since the log-likelihood function (possibly tuned by a prior distribution) is relatively more flat, but numerical experience has shown that they can persist even for fairly large $n$ , in certain cases.

The focus of interest in this paper will be the skew- $t$ (ST) distribution introduced by Branco & Dey, (2001) and studied in detail by Azzalini & Capitanio, (2003); see also Gupta, (2003). The main formal constituents and properties of the ST family will be summarized in the next subsection. Here, we recall instead some of the many publications that have provided evidence of the practical usefulness of the ST family, in its univariate and multivariate version, thanks to its capability to adapt to a variety of data patterns. The numerical exploration by Azzalini & Genton, (2008), using data of various origins and nature, is an early study in this direction, emphasizing the potential of the distribution as a tool for robust inference. The robustness aspects of ST-based inference has also been discussed by (Azzalini & Capitanio,, 2014, § 4.3.5) and more extensively by Azzalini, (2016). On the more applied domain, numerous publications motivated by application problems have further highlighted the ST usefulness, typically with data distributions featuring substantial tailweight and asymmetry. For space reasons, the following list reports only a few of the many publications of this sort, with a preference for early work: Walls, (2005) and Pitt, (2010) use the ST distribution for modelling log-transformed returns of film and music industry products as a function of explanatory variables; Meucci, (2006) and Adcock, (2010) develop methods for optimal portfolio allocation in a financial context, where long tails and asymmetry of returns distribution are standard features; Ghizzoni et al., (2010) use the multivariate ST distributions to model riverflow, jointly at multiple sites; Pyne et al., (2009) present an early model-based clustering formulation using the multivariate ST distributions as the basic component for flow cytometric data analysis.

Given its value in data analysis, but also the above-mentioned possible critical aspects of the log-likelihood function, it seems appropriate to explore the corresponding issues for MLE computation and to develop a methodology which provides good starting points for the numerical maximization of the log-likelihood. After a brief summary of the main facts about the ST distribution in the next subsection, the rest of the paper is dedicated to these issues. Specifically, one section examines qualitatively and numerically various aspects of the ST log-likelihood, while the rest of the paper develops a technique to initialize the numerical search for MLE.

To avoid potential misunderstanding, we underline that the above-indicated program of work does not intend to imply a general inadequacy of the currently available computational resources, which will be recalled in due course. There are, however, critical cases where these resources run into problems, most typically when the data distribution exhibits very long tails. For these challenging situations, an improved methodology is called for.

1.2 The skew- $t$ distribution: basic facts

Before entering our actual development, we recall some basic facts about the ST parametric family of continuous distributions. In its simplest description, it is obtained as a perturbation of the classical Student’s $t$ distribution. For a more specific description, start from the univariate setting, where the components of the family are identified by four parameters. Of these four parameters, the one denoted $\xi$ in the following regulates the location of the distribution; scale is regulated by the positive parameter $\omega$ ; shape (as departure from symmetry) is regulated by $\lambda$ ; tail-weight is regulated by $\nu$ (with $\nu>0$ ), denoted ‘degrees of freedom’ like for a classical $t$ distribution.

It is convenient to introduce the distribution in the ‘standard case’, that is, with location $\xi=0$ and scale $\omega=1$ . In this case, the density function is

[TABLE]

where

[TABLE]

is the density function of the classical Student’s $t$ on $\nu$ degrees of freedom and $T(\cdot;\nu)$ denotes its distribution function; note however that in (1) this is evaluated with $\nu+1$ degrees of freedom. Also, note that the symbol $t$ is used for both densities in (1) and (2), which are distinguished by the presence of either one or two parameters.

If $Z$ is a random variable with density function (1), the location and scale transform $Y=\xi+\omega\,Z$ has density

[TABLE]

where $\theta=(\xi,\omega,\lambda,\nu)$ . In this case, we write $Y\sim\mathrm{ST}(\xi,\omega^{2},\lambda,\nu)$ , where $\omega$ is squared for similarity with the usual notation for normal distributions.

When $\lambda=0$ , we recover the scale-and-location family generated by the $t$ distribution (2). When $\nu\to\infty$ , we obtain the skew-normal (SN) distribution with parameters $(\xi,\omega,\lambda)$ , which is described for instance by (Azzalini & Capitanio,, 2014, Chap. 2). When $\lambda=0$ and $\nu\to\infty$ , (3) converges to the $\mathrm{N}(\xi,\omega^{2})$ distribution.

Some instances of density (1) are displayed in the left pane of Figure 1. If $\lambda$ was replaced by $-\lambda$ , the densities would be reflected on the opposite side of the vertical axis, since $-Y\sim\mathrm{ST}(-\xi,\omega^{2},-\lambda,\nu)$ .

Similarly to the classical $t$ distribution, moments exist when their order is smaller than $\nu$ . Under this condition, expressions for the mean value, the variance and the coefficients of skewness and excess kurtosis, namely the standardized third and fourth cumulants, are as follows:

[TABLE]

where

[TABLE]

It is visible that, as $\lambda$ spans the real line, so does the coefficient of skewness $\gamma_{1}$ when $\nu\to 3^{+}$ . For $\nu\leq 3$ , $\gamma_{1}$ does not exists; however, at least one of the tails is increasingly heavier as $\nu\to 0$ , given the connection with the Student’s $t$ . A fairly similar pattern holds for the coefficients of kurtosis $\gamma_{2}$ , with the threshold at $\nu=4$ for its existence. If $\nu\to 4^{+}$ , the range of $\gamma_{2}$ is $[0,\infty$ ). The feasible $(\gamma_{1},\gamma_{2})$ space when $\nu>4$ is displayed in Figure 4.5 of Azzalini & Capitanio, (2014). Negative $\gamma_{2}$ values are not achievable, but this does not not seem to be a major drawback in most applications.

The multivariate ST density is represented by a perturbation of the classical multivariate $t$ density in $d$ dimensions, namely

[TABLE]

where $\bar{\Omega}$ is a symmetric positive-definite matrix with unit diagonal elements and $Q(z)=z^{\top}\bar{\Omega}^{-1}z$ . The multivariate version of (1) is then given by

[TABLE]

where $\alpha$ is a $d$ -dimensional vector regulating asymmetry. An instance of density (6) with $d=2$ is displayed in the right pane of Figure 1 via contour level curves.

Similarly to the univariate setting, we consider the location and scale transformation of a variable $Z$ with density (6) to $Y=\xi+\omega\,Z$ where now $\xi\in\mathbb{R}^{d}$ and $\omega$ is diagonal matrix with positive diagonal elements. For the resulting variable, we use the notation $Y\sim\mathrm{ST}_{d}(\xi,\Omega,\alpha,\nu)$ where $\Omega=\omega\bar{\Omega}\omega$ .

One property required for our later development is that each marginal component of $Y$ is a univariate ST variable, having a density of type (1) whose parameters are extracted from the corresponding components of $Y$ , with the exception of $\lambda$ for which the marginalization step is slightly more elaborate.

There are many additional properties of the ST distribution which, for space reasons, we do not report here and refer the reader to the quoted literature. A self-contained account is provided by the monograph Azzalini & Capitanio, (2014); see specifically Chapter 4 for the univariate case and Chapter 6 for the multivariate case.

2 On the likelihood function of ST models

2.1 Basic general aspects

The high flexibility of the ST distribution makes it particularly appealing in a wide range of data fitting problems, more than its companion, the SN distribution. Reliable techniques for implementing connected MLE or other estimation methods are therefore crucial.

From the inference viewpoint, another advantage of the ST over the related SN distribution is the lack of a stationary point at $\lambda=0$ (or $\alpha=0$ in the multivariate case), and the implied singularity of the information matrix. This stationary point of the SN is systematic: it occurs for all samples, no matter what $n$ is. This peculiar aspect has been emphasized more than necessary in the literature, considering that it pertains to a single although important value of the parameter. Anyway, no such problem exists under the ST assumption. The lack of a stationary point at the origin was first observed empirically and welcomed as ‘a pleasant surprise’ by Azzalini & Capitanio, (2003), but no theoretical explanation was given. Additional numerical evidence in this direction has been provided by Azzalini & Genton, (2008). The theoretical explanation of why the SN and the ST likelihood functions behave differently was finally established by Hallin & Ley, (2012).

Another peculiar aspect of the SN likelihood function is the possibility that the maximum of the likelihood function occurs at $\lambda=\pm\infty$ , or at $\|\alpha\|\to\infty$ in the multivariate case. Note that this happen without divergence of the likelihood function, but only with divergence of the parameter achieving the maximum. In this respect the SN and the ST model are similar: both of them can lead to this pattern.

Differently from the stationarity point at the origin, the phenomenon of divergent estimates is transient: it occurs mostly with small $n$ , and the probability of its occurrence decreases very rapidly when $n$ increases. However, when it occurs for the $n$ available data, we must handle it. There are different views among statisticians on whether such divergent values must be retained as valid estimates or they must be rejected as unacceptable. We embrace the latter view, for the reasons put forward by Azzalini & Arellano-Valle, (2013), and adopt the maximum penalized likelihood estimate (MPLE) proposed there to prevent the problem. While the motivation for MPLE is primarily for small to moderate $n$ , we use it throughout for consistency.

There is an additional peculiar feature of the ST log-likelihood function, which however we mention only for completeness, rather than for its real relevance. In cases when $\nu$ is allowed to span the whole positive half-line, poles of the likelihood function must exist near $\nu=0$ , similarly to the case of a Student’s $t$ with unspecified degrees of freedom. This problem has been explored numerically by Azzalini & Capitanio, (2003), and the indication was that these poles must exist at very small values of $\nu$ , such as $\hat{\nu}=0.06$ in one specific instance.

This phenomenon is qualitatively similar to the problem of poles of the likelihood function for a finite mixture of continuous distributions. Even in the simple case of univariate normal components, there always exist $n$ poles on the boundary of the parameter space if the standard deviations of the components are unrestricted; see for instance (Day,, 1969, Section 7). The problem is conceptually interesting, in both settings, but in practice it is easily dealt with in various ways. In the ST setting, the simplest solution is to impose a constraint $\nu>\nu_{0}>0$ where $\nu_{0}$ is some very small value, such as $\nu_{0}=0.1$ or $0.2$ . Even if fitted to data, a $t$ or ST density with $\nu<0.1$ would be an object hard to use in practice.

2.2 Numerical aspects and some illustrations

Since, on the computational side, we shall base our work the R package sn, described by Azzalini, (2019), it is appropriate to describe some key aspects of this package.. There exists a comprehensive function for model fitting, called selm, but the actual numerical work in case of an ST model is performed by functions st.mple and mst.mple, in the univariate and the multivariate case, respectively. To numerical efficiency, we shall be using these functions directly, rather than via selm. As their names suggest, st.mple and mst.mple perform MPLE, but they can be used for classical MLE as well, just by omitting the penalty function. The rest of the description refers to st.mple, but mst.mple follows a similar scheme.

In the univariate case, denote by $\theta=(\xi,\omega,\alpha,\nu)^{\top}$ the parameters to be estimated, or possibly $\theta=(\beta^{\top},\omega,\alpha,\nu)^{\top}$ when a linear regression model is introduced for the location parameter, in which case $\beta$ is a vector of $p$ regression coefficients. Denote by $\log L(\theta)$ the log-likelihood function at point $\theta$ . If no starting values are supplied, the first operation of st.mple is to fit a linear model to the available explanatory variables; this reduces to the constant covariate value 1 if $p=1$ . For the residuals from this linear fit, sample cumulants of order up to four are computed, hence including the sample variance. An inversion from these values to $\theta$ may or may not be possible, depending on whether the third and fourth sample cumulants fall in the feasible region for the ST family. If the inversion is successful, initial values of the parameters are so obtained; if not, the final two components of $\theta$ are set at $(\alpha,\nu)=(0,10)$ , retaining the other components from the linear fit. Starting from this point, MLE or MPLE is searched for using a general numerical optimization procedure. The default procedure for performing this step is the R function nlminb, supplied with the score functions besides the log-likelihood function. We shall refer, comprehensively, to this currently standard procedure as ‘method M0’.

In all our numerical work, method M0 uses st.mple, and the involved function nlminb, with all tuning parameters kept at their default values. The only activated option is the one switching between MPLE and MLE, and even this only for the work of present section. Later on, we shall always use MPLE, with penalty function Qpenalty which implements the method proposed in Azzalini & Arellano-Valle, (2013).

We start our numerical work with some illustrations, essentially in graphical form, of the log-likelihood generated by some simulated datasets. The aim is to provide a direct perception, although inevitably limited, of the possible behaviour of the log-likelihood and the ensuing problems which it poses for MLE search and other inferential procedures. Given this aim, we focus on cases which are unusual, in some way or another, rather than on ‘plain cases’.

The type of graphical display which we adopt is based on the profile log-likelihood function of $(\alpha,\nu)$ , denoted $\log L_{p}(\alpha,\nu)$ . This is obtained, for any given $(\alpha,\nu)$ , by maximizing $\log L(\theta)$ with respect to the remaining parameters. To simplify readability, we transform $\log L_{p}(\alpha,\nu)$ to the likelihood ratio test statistic, also called ‘deviance function’:

[TABLE]

where $\log L_{p}(\hat{\alpha},\hat{\nu})$ is the overall maximum value of the log-likelihood, equivalent to $\log L(\hat{\theta})$ . The concept of deviance applies equally to the penalized log-likelihood.

The plots in Figure 2 displays, in the form of contour level plots, the behaviour of $D(\alpha,\nu)$ for two artificially generated samples, with $\nu$ expressed on the logarithmic scale for more convenient readability. Specifically, the top plots refer to a sample of size $n=50$ drawn from the $\mathrm{ST}(0,1,1,2)$ ; the left plot, refers to the regular log-likelihood, while the right plot refers to the penalized log-likelihood.

The plots include marks for points of special interest, as follows:

$\triangle$

the true parameter point;

$\circ$

the point having maximal (penalized) log-likelihood on a $51\times 51$ grid of points spanning the plotted area;

$+$

the MLE or MPLE point selected by method M0;

$\ast$

the preliminary estimate to be introduced in Section 3.2, later denoted M1;

$\times$

the MLE or MPLE point selected by method M2 presented later in the text.

It will be noticed that the top-left plot does not show a $+$ mark. This is because the MLE point delivered by M0 has $\hat{\alpha}=\hat{\nu}\to\infty$ (actually some huge values representing numerical ‘approximations of infinity’), where $\log L_{p}\approx-81.85$ ; consequently the maximum of $\log L_{p}$ over the plotted area takes place at its margin. Note that the log-likelihood function has a local maximum at about $(1.45,5.6)$ , where $\log L_{p}\approx-84.09$ ; this local maximum is quite close to the true parameter point, especially so in the light of the limited sample size. There are two messages from this example: one is that the log-likelihood may have more than one maximum; the other is that a local maximum can provide a better choice than the global maximum, at least in this case.

Given that $\hat{\alpha}=\infty$ , consider MPLE estimation in the top-right plot. The maximum of $\log L_{p}$ , marked by $\circ$ , is now close to the point $(1.45,5.6)$ , but method M0 fails to find it, and it picks up the point $(15.5,6.4)$ . This must be due to a poor choice of the initial point for numerical search, given that method M2, which differs only for this initial point, lands on the correct point.

Peculiar behaviours, either of the log-likelihood or of the estimation procedures or both of them, are certainly more frequent when $n$ is small or moderate, but problems can persist even for fairly large $n$ , as illustrated by the bottom two plots of Figure 2 which refer to a sample of size $n=300$ from $\mathrm{ST}(0,1,2,3)$ . In this case, M0 yields $\hat{\alpha}\approx 3$ and $\hat{\nu}\to\infty$ , denoted by the vertical ticks below the top side of the plotted area; the associated $\log L_{p}$ value is $-422.5$ for the left plot, $-424.4$ for the right plot. Both these values are lower than the corresponding maximal $\log L_{p}$ values, $-419.8$ and $-420.8$ . Again, better initial search points used by method M2 leads to the correct global maxima, at about $(2.4,5.6)$ and $(2.3,5.0)$ , respectively.

3 On the choice of initial parameters for MLE search

The aim of this section, which represents the main body of the paper, is to develop a methodology for improving the selection of initial parameter values from where to start the MPLE search via some numerical optimization technique, which should hopefully achieve a higher maximum.

3.1 Preliminary remarks and the basic scheme

We have seen in Section 2 the ST log-likelihood function can be problematic; it is then advisable to select carefully the starting point for the MLE search. While contrasting the risk of landing on a local maximum, a connected aspect of interest is to reduce the overall computing time. Here are some preliminary considerations about the stated target.

Since these initial estimates will be refined by a subsequent step of log-likelihood maximization, there is no point in aiming at a very sophisticate method. In addition, we want to keep the involved computing header as light as possible. Therefore, we want a method which is simple and quick to compute; at the same time, it should be reasonably reliable, hopefully avoiding nonsensical outcomes.

Another consideration is that we cannot work with the methods of moments, or some variant of it, as this would impose a condition $\nu>4$ , bearing in mind the constraints recalled in Section 1.2. Since some of the most interesting applications of ST-based models deal with very heavy tails, hence with low degrees of freedom, the condition $\nu>4$ would be unacceptable in many important applications. The implication is that we have to work with quantiles and derived quantities.

To ease exposition, we begin by presenting the logic in the basic case of independent observations from a common univariate distribution $\mathrm{ST}(\xi,\omega^{2},\lambda,\nu)$ . The first step is to select suitable quantile-based measures of location, scale, asymmetry and tail-weight. The following list presents a set of reasonably choices; these measures can be equally referred to a probability distribution or to a sample, depending on the interpretation of the terms quantile, quartile and alike.

Location

The median is the obvious choice here; denote it by $q_{2}$ , since it coincides with the second quartile.

Scale

A commonly used measure of scale is the semi-interquartile difference, also called quartile deviation, that is

[TABLE]

where $q_{j}$ denotes the $j$ th quartile; see for instance (Kotz et al.,, 2006, vol. 10, p. 6743).

Asymmetry

A classical non-parametric measure of asymmetry is the so-called Bowley’s measure

[TABLE]

see (Kotz et al.,, 2006, vol. 12, p. 7771–3). Since the same quantity, up to an inessential difference, had previously been used by Galton, some authors attribute to him its introduction. We shall refer to $G$ as the Galton-Bowley measure.

Kurtosis

A relatively more recent proposal is the Moors measure of kurtosis, presented in Moors, (1988),

[TABLE]

where $e_{j}$ denotes the $j$ th octile, for $j=1,\dots,7$ . Clearly, $e_{2j}=q_{j}$ for $j=1,2,3$ .

A key property is that $d_{q}$ is independent of the location of the distribution, and $G$ and $M$ are independent of location and scale.

For any distribution $\mathrm{ST}(\xi,\omega^{2},\lambda,\nu)$ , the values of $Q=(q_{2},d_{q},G,M)$ are functions of the parameters $\theta=(\xi,\omega,\lambda,\nu)$ . Given a set of observations $y=(y_{1},\dots,y_{n})$ drawn from $\mathrm{ST}(\xi,\omega^{2},\lambda,\nu)$ under mutual independence condition, we compute sample values of $\tilde{Q}=(\tilde{q}_{2},\tilde{d}_{q},\tilde{G},\tilde{M})$ of $Q$ from the sample quantiles and then inversion of the functions connecting $\theta$ and $Q$ will yield estimates $\tilde{\theta}$ of the ST parameters. In essence, the logic is similar to the one underlying the method of moments, but with moments replaced by quantiles.

In the following subsection, we discuss how to numerically carry out the inversion from $Q$ to $\theta$ . Next, we extend the procedure to settings which include explanatory variables and multivariate observations.

3.2 Inversion of quantile-based measures to ST parameters

For the inversion of the parameter set $Q=(q_{2},d_{q},G,M)$ to $\theta=(\xi,\omega,\lambda,\nu)$ , the first stage considers only the components $(G,M)$ which are to be mapped to $(\lambda,\nu)$ , exploiting the invariance of $G$ and $M$ with respect to location and scale. Hence, at this stage, we can work assuming that $\xi=0$ and $\omega=1$ .

Start by computing, for any given pair $(\lambda,\nu)$ , the set of octiles $e_{1},\dots,e_{7}$ of $\mathrm{ST}(0,1,\lambda,\nu)$ , and from here the corresponding $(G,M)$ values. Operationally, we have computed the ST quantiles using routine qst of package sn. Only non-negative values of $\lambda$ need to be considered, because a reversal of the $\lambda$ sign simply reverses the sign of $G$ , while $M$ is unaffected, thanks to the mirroring property of the ST quantiles when $\lambda$ is changed to $-\lambda$ .

Initially, our numerical exploration of the inversion process examined the contour level plots of $G$ and $M$ as functions of $\lambda$ and $\nu$ , as this appeared to be the more natural approach. Unfortunately, these plots turned out not to be useful, because of the lack of a sufficiently regular pattern of the contour curves. Therefore these plots are not even displayed here.

A more useful display is the one adopted in Figure 3, where the coordinate axes are now $G$ and $M$ . The shaded area, which is the same in both panels, represents the set of feasible $(G,M)$ points for the ST family. In the first plot, each of the black lines indicates the locus of points with constant values of $\delta$ , defined by (4), when $\nu$ spans the positive half-line; the selected $\delta$ values are printed at the top of the shaded area, when feasible without clutter of the labels. The use of $\delta$ instead of $\lambda$ simply yields a better spread of the contour lines with different parameter values, but it is conceptually irrelevant. The second plot of Figure 3 displays the same admissible region with superimposed a different type of loci, namely those corresponding to specified values of $\nu$ , when $\delta$ spans the $[0,1]$ interval; the selected $\nu$ values are printed on the left side of the shaded area.

Details of the numerical calculations are as follows. The Galton-Bowley and the Moors measures have been evaluated over a $13\times 25$ grid of points identified by the selected values

[TABLE]

The $G$ and $M$ values so obtained form the basis of Figure 3 and subsequent calculations. Note that, while the vectors $\delta^{*}$ and $\nu^{*}$ identify a regular grid of points on the $(\delta,\nu)$ space, they corresponds to a curved grid on the shaded regions of Figure 3.

Boundary parameter values require a special handling. Specifically, $\delta=1$ and correspondingly $\lambda=\infty$ identify the Student’s $t$ distribution truncated below 0, that is, the square root transform of the Snedecor’s $F(1,\nu)$ distribution; hence, in this case the quantiles are computed as square roots of the $F(1,\nu)$ quantiles. The associated points lie on the right-side boundary of the shaded area. Another special value is $\nu=\infty$ which corresponds to the SN distribution. In this case, function qsn of package sn has been used; the corresponding points lie on the concave curve in the bottom-left corner of the shaded area.

Conceptually, Figure 3 represents the key for the inversion from $(G,M)$ to $(\delta,\nu)$ and equivalently to $(\lambda,\nu)$ since $\lambda=(1-\delta^{2})^{-1/2}\delta$ . However, in practical terms, we must devise a mechanism for inversely interpolating the $(G,M)$ values computed at the grid points. Evaluation of this interpolation scheme at the sample values $(\tilde{G},\tilde{M})$ will yield the desired estimates.

To this end, the second plot indicates the most favourable front for tackling the problem, since its almost horizontal lines show that the Moors measure is very nearly a function of $\nu$ only. Denote by $M^{\circ}$ the value of $M$ when $\delta=0$ . The 25 available values of $M^{\circ}$ at $\nu^{*}$ are reported in the second column of Table 1; the remaining columns will be explained shortly. From these 25 values, an interpolating spline of $1/\nu$ as a function of $M^{\circ}$ has been introduced. Use of the $1/\nu$ transformed variable substantially reduces the otherwise extreme curvature of the function. Evaluation of this spline function at the sample value $\tilde{M}$ and conversion into its reciprocal yields an initial estimate $\tilde{\nu}$ .

Consider now estimation of $\delta$ or equivalently of $\lambda$ , a task which essentially amounts to approximate the curves in the first plot in Figure 3. After some numerical exploration, it turned out that a closely interpolating function can be established in the following form:

[TABLE]

where the fitted values of the coefficients $\eta_{j}^{(\nu)}$ for the selected $\nu^{*}$ values are reported in the last three columns of Table 1, with the exception of $\nu=\infty$ . Use of (8) combined with the coefficients of Table 1 allows to find an approximate value of $\lambda$ for the selected values of $\nu$ . If an intermediate value of $\nu$ must be considered, such that $\nu_{1}<\nu<\nu_{2}$ where $\nu_{1},\nu_{2}$ are two adjacent values of $\nu^{*}$ , a linear interpolation of the corresponding coefficients is performed. More explicitly, a value of $\eta_{j}^{(\nu)}$ is obtained by linear interpolation of $\eta_{j}^{(\nu_{1})}$ and $\eta_{j}^{(\nu_{2})}$ , for $j=1,2,3$ ; then (8) is applied using these interpolated coefficients. If $\nu$ is outside the range of finite $\nu$ values in the first column of Table 1, the $\eta_{j}^{(\nu)}$ values associated to the closest such value of $\nu$ are used.

Operationally, we use the just-described scheme with $\nu$ set at the value $\tilde{\nu}$ obtained earlier, leading to an estimate $\tilde{\lambda}$ of $\lambda$ .

Numerical testing of this procedure has been performed as follows. For a number of pairs of values $(\alpha,\nu)$ , the corresponding octiles and the $(G,M)$ measures have been computed and the proposed procedure has been applied to these measures. The returned parameter values were satisfactorily close to the original $(\alpha,\nu)$ pair, with some inevitable discrepancies due to the approximations involved, but of limited entity. If necessary, a refinement can be obtained by numerical search targeted to minimize a suitable distance between a given pair $(G,M)$ and the analogous values derived from the associated $(\alpha,\nu)$ pair. However, this refinement was not felt necessary in the numerical work described later, thanks to the good working of the above-described interpolation scheme.

We are now left with estimation of $\xi$ and $\omega$ . Bearing in mind the representation $Y=\xi+\omega\,Z$ introduced just before (3), $\omega$ is naturally estimated by

[TABLE]

where the terms in the numerator are sample quartiles and those in the denominator are quartiles of $Z\sim\mathrm{ST}(0,1,\tilde{\lambda},\tilde{\nu})$ .

Consideration of $Y=\xi+\omega\,Z$ again says that an estimate of $\xi$ can be obtained as an adjustment of the sample median $q_{2}$ via

[TABLE]

where $q_{2}^{\mathrm{ST}}$ is the median of $Z\sim\mathrm{ST}(0,1,\tilde{\lambda},\tilde{\nu})$ .

The estimates so produced are marked by an asterisk in the examples of Figure 2, showing to perform well in those cases, while requiring a negligible computing time compared to MLE.

3.3 Extension to the regression case

We want to extend the methodology of Section 3.2 to the regression setting where the location parameter varies across observations as a linear function of a set of $p$ , say, explanatory variables, which are assumed to include the constant term, as it is commonly the case. If $x_{i}$ is the vector of covariates pertaining to the $i$ th subject, observation $y_{i}$ is now assumed to be drawn from $\mathrm{ST}(\xi_{i},\omega,\lambda,\nu)$ where

[TABLE]

for some $p$ -dimensional vector $\beta$ of unknown parameters; hence now the parameter vector is $\theta=(\beta^{\top},\omega,\lambda,\nu)^{\top}$ . The assumption of independently drawn observations is retained.

The direct extension of the median as an estimate of location, which was used in Section 3.2, is an estimate of $\beta$ obtained by median regression, which corresponds to adoption of the least absolute deviations fitting criterion instead of the more familiar least squares. This can also be viewed as a special case of quantile regression, when the quantile level is set at $1/2$ . A classical treatment of quantile regression is Koenker, (2005) and corresponding numerical work can be carried out using the R package quantreg, see Koenker, (2018), among others tools.

Use of median regression delivers an estimate $\tilde{\beta}^{m}$ of $\beta$ and a vector of residual values, $r_{i}=y_{i}-x_{i}^{\top}\tilde{\beta}^{m}$ for $i=1,\dots,n$ . Ignoring $\beta$ estimation errors, these residuals are values sampled from $\mathrm{ST}(-m_{0},\omega^{2},\lambda,\nu)$ , where $m_{0}$ is a suitable value, examined shortly, which makes the distribution to have 0 median, since this is the target of the median regression criterion. We can then use the same procedure of Section 3.2, with the $y_{i}$ ’s replaced the $r_{i}$ ’s, to estimate $\omega,\lambda,\nu$ , given that the value of $m_{0}$ is irrelevant at this stage.

The final step is a correction to the vector $\tilde{\beta}^{m}$ to adjust for the fact that $y_{i}-x_{i}^{\top}\beta$ should have median $m_{0}$ , that is, the median of $\mathrm{ST}(0,\omega,\lambda,\nu)$ , not median 0. This amounts to increase all residuals by a constant value $m_{0}$ , and this step is accomplished by setting a vector $\tilde{\beta}$ with all components equal to $\tilde{\beta}^{m}$ except that the intercept term, $\beta_{0}$ say, is estimated by

[TABLE]

similarly to (10).

3.4 Extension to the multivariate case

Consider now the case of $n$ independent observations from a multivariate $Y$ variable with density (6), hence $Y\sim\mathrm{ST}_{d}(\xi,\Omega,\alpha,\nu)$ . This case can be combined with the regression setting of Section 3.3, so that the $d$ -dimensional location parameter varies for each observation according to

[TABLE]

where now $\beta=(\beta_{\cdot 1},\dots,\beta_{\cdot d})$ is a $p\times d$ matrix of parameters. Since we have assumed that the explanatory variables include a constant term, the regression case subsumes the one of identical distribution, when $p=1$ . Hence we deal with the regression case directly, where the $i$ th observation is sampled from $Y_{i}\sim\mathrm{ST}_{d}(\xi_{i},\Omega,\alpha,\nu)$ and $\xi_{i}$ is given by (12), for $i=1,\dots,n$ .

Arrange the observed values in a $n\times d$ matrix $y=(y_{ij})$ . Application of the procedure presented in Sections 3.2 and 3.3 separately to each column of $y$ delivers estimates of $d$ univariate models. Specifically, from the $j$ th column of $y$ , we obtain estimates $\tilde{\theta}_{j}$ and corresponding ‘normalized’ residuals $\tilde{z}_{ij}$ :

[TABLE]

where it must be recalled that the ‘normalization’ operation uses location and scale parameters, but these do not coincide with the mean and the standard deviation of the underlying random variable.

Since the meaning of expression (12) is to define a set of univariate regression modes with a common design matrix, the vectors $\tilde{\beta}_{\cdot 1},\dots,\tilde{\beta}_{\cdot d}$ can simply be arranged in a $p\times d$ matrix $\tilde{\beta}$ which represents an estimate of $\beta$ .

The set of univariate estimates in (13) provide $d$ estimates for $\nu$ , while only one such a value enters the specification of the multivariate ST distribution. We have adopted the median of $\tilde{\nu}_{1},\dots,\tilde{\nu}_{d}$ as the single required estimate, denoted $\tilde{\nu}$ .

The scale quantities $\tilde{\omega}_{1},\dots,\tilde{\omega}_{d}$ estimate the square roots of the diagonal elements of $\Omega$ , but off-diagonal elements require a separate estimation step. What is really required to estimate is the scale-free matrix $\bar{\Omega}$ . This is the problem examined next.

If $\omega$ is the diagonal matrix formed by the squares roots of $\Omega_{11},\dots,\Omega_{dd}$ , all variables $\omega^{-1}(Y_{i}-\xi_{i})$ have distribution $\mathrm{ST}_{d}(0,\bar{\Omega},\alpha,\nu)$ , for $i=1,\dots,n$ . Denote by $Z=(Z_{1},\dots,Z_{d})^{\top}$ the generic member of this set of variables. We are concerned with the distribution of the products $Z_{j}Z_{k}$ , but for notational simplicity we focus on the specific product $W=Z_{1}\,Z_{2}$ , since all other products are of similar nature.

We must then examine the distribution of $W{=}Z_{1}Z_{2}$ when $(Z_{1},Z_{2})$ is a bivariate ST variable. This looks at first to be a daunting task, but a major simplification is provided by consideration of the perturbation invariance property of symmetry-modulated distributions, of which the ST is an instance. For a precise exposition of this property, see for instance Proposition 1.4 of Azzalini & Capitanio, (2014), but in essence it states that, since $W$ is an even function of $(Z_{1},Z_{2})$ , its distribution does not depend on $\alpha$ , and it coincides with the distribution of the case $\alpha=0$ , that is, the case of a usual bivariate Student’s $t$ distribution, with dependence parameter $\bar{\Omega}_{12}$ .

Denote by $F_{W}(w;\rho,\nu)$ the distribution function of the product $W$ of variables $(Z_{1},Z_{2})$ having bivariate Student’s $t$ density (5) in dimension $d=2$ with $\nu$ degrees of freedom and dependence parameter $\rho$ where $|\rho|<1$ . An expression of $F_{W}(w;\rho,\nu)$ is available in Theorem 1 of Wallgren, (1980). Although this expression involves numerical integration, this is not problematic since univariate integration can be performed efficiently and reliably with the R function integrate.

To estimate $\Omega_{12}$ , we search for the value of $\rho$ such that the median of the distribution of $W$ equates its sample value. In practice, we compute the sample values $\tilde{w}_{1},\dots,\tilde{w}_{n}$ where $\tilde{w}_{i}=\tilde{z}_{i1}\tilde{z}_{i2}$ , using the residuals in (13), and denote their median by $m_{\tilde{w}}$ . Then we must numerically solve the non-linear equation

[TABLE]

with respect to $\rho$ . Solution of the equation is facilitated by the monotonicity of $F_{W}(w;\rho,\nu)$ with respect to $\rho$ , ensured by Theorem 2 of Wallgren, (1980). The solution of (14) is the estimate $\tilde{\Omega}_{12}$ .

Proceeding similarly for other pairs of variables $(Z_{j},Z_{k})$ , all entries of matrix $\bar{\Omega}$ can be estimated; the diagonal elements are all 1. However, it could happen that the matrix so produced is not a positive-definite correlation matrix, because of estimation errors. Furthermore an even more stringent condition has to be satisfied, namely that

[TABLE]

This condition on $\Omega^{*}$ applies to all skew-elliptical distributions, of which the ST is an instance; see (Branco & Dey,, 2001, p. 101) or (Azzalini & Capitanio,, 2014, p. 171).

Denote by $\tilde{\Omega}^{*}$ the estimate of $\Omega^{*}$ with $\tilde{\bar{\Omega}}$ in the $d\times d$ top-left block obtained by solutions of equations of type (14) and $\tilde{\delta}$ computed by applying $\delta(\tilde{\lambda}_{j})$ in (4) to $\tilde{\lambda}_{1},\dots,\tilde{\lambda}_{d}$ . If $\tilde{\Omega}^{*}$ is positive definite, then we move on to the next step; otherwise an adjustment is required.

There exist various techniques to adjust a nearly positive-definite matrix to achieve positive-definiteness. Some numerical experimentation has been carried out using procedure nearPD of R package Matrix; see Bates & Maechler, (2019). Unfortunately, this did not work well when we used the resulting matrix for the next step, namely computation of the vector

[TABLE]

which enters the density function (6); see for instance equation (4) of Azzalini & Capitanio, (2003), which is stated for the SN distribution, but it holds also for the ST. The unsatisfactory outcome from nearPD for our problem is presumably due to modifications in the relative size of the components of $\tilde{\Omega}^{*}$ , leading to grossly inadequate $\alpha$ vectors, typically having a gigantic norm.

A simpler type of adjustment has therefore been adopted, as follows. If condition (15) does not hold for $\tilde{\Omega}^{*}$ , the off-diagonal elements of the matrix are shrunk by a factor 0.95, possibly repeatedly, until (15) is satisfied. This procedure was quick to compute and it did not cause peculiar outcomes from (16).

Hence, either directly from the initial estimates of $\bar{\Omega}$ and $\delta$ or after the adjustment step just described, we obtain valid components satisfying condition (15) and a corresponding vector $\tilde{\alpha}$ from (16). The final step is to introduce scale factors via

[TABLE]

where $\tilde{\omega}=\mathrm{diag}(\tilde{\omega}_{1},\dots,\tilde{\omega}_{d})$ . This completes estimation of $(\beta,\Omega,\alpha,\nu)$ .

3.5 Simulation work to compare initialization procedures

A number of simulation runs has been performed to examine the performance of the proposed methodology. The computing environment is R version 3.6.0. The reference point for these evaluations is the methodology currently in use, as provided by the publicly available version of R package sn at the time of writing, namely version 1.5-4; see Azzalini, (2019). This will be denoted ‘the current method’ in the following. Since the role of the proposed method is to initialize the numerical MLE search, not the initialization procedure per se, we compare the new and the current method with respect to final MLE outcome. However, since the numerical optimization method used after initialization is the same, any variations in the results originate from the different initialization procedures.

We stress again that in a vast number of cases the working of the current method is satisfactory and we are aiming at improvements when dealing with ‘awkward samples’. These commonly arise with ST distributions having low degrees of freedom, about $\nu=1$ or even less, but exceptions exist, such as the second sample in Figure 2.

The primary aspect of interest is improvement in the quality of data fitting. This is typically expressed as an increase of the maximal achieved log-likelihood, in its penalized form. Another desirable effect is improvement in computing time.

The basic set-up for such numerical experiments is represented by simple random samples, obtained as independent and identically distributed values drawn from a named $\mathrm{ST}(\xi,\omega,\lambda,\nu)$ . In all cases we set $\xi=0$ and $\omega=1$ . For the other ingredients, we have selected the following values:

[TABLE]

and, for each combination of of these values, $N=2000$ samples have been drawn.

The smallest examined sample size, $n=50$ , must be regarded as a sort of ‘sensible lower bound’ for realistic fitting of flexible distributions such as the ST. In this respect, recall the cautionary note of (Azzalini & Capitanio,, 2014, p. 63) about the fitting of a SN distribution with small sample sizes. Since the ST involves an additional parameter, notably one having strong effect on tail behaviour, that annotation holds a fortiori here.

For each of the $3\times 3\times 4\times 2000=72000$ samples so generated, estimation of the parameters $(\xi,\omega,\lambda,\nu)$ has been carried out using the following methods.

M0: this is the current method, which maximizes the penalized log-likelihood using function st.mple as described in Section 2.2.

M1: preliminary estimates are computed as described in Section 3.2;

M2: maximization of the penalized log-likelihood, still using function st.mple, but starting from the estimates of M1;

M3: similar to M2, but using a simplified form of M1, where only the location and scale parameters are estimated, setting $\lambda=0$ and $\nu=10$ .

An exhaustive analysis of the simulation outcome would be far too lengthy and space consuming. As already mentioned, our primary interest is on the differences of maximized log-likelihood. Specifically, if denote by $\log{\hat{L}_{h}}$ the maximized value of the penalized log-likelihood using method M $h$ , we focus on the quantities $D_{20}$ , $D_{23}$ and $D_{30}$ , where $D_{hk}=\log{\hat{L}_{h}}-\log{\hat{L}_{k}}$ . Table 2 reports the observed frequencies of $D_{hk}$ values, grouped in intervals

[TABLE]

crosstabulated either with $n$ or with $\nu$ .

We are not concerned with samples having values $D_{hk}$ in the interval $(-0.2,0.2)$ , since these differences are not relevant from an inferential viewpoint; just note that they constitute the majority of cases. As for $D_{20}$ , the fraction of cases falling outside $(-0.2,0.2)$ is small, but it is not negligible, and this justifies our efforts to improve over M0. As expected, larger $D_{20}$ values occur more easily when $n$ or $\nu$ are small, but sometimes also otherwise. In all but a handful of cases, these larger differences are on the positive sides, confirming the effectiveness of the proposed method for initialization. The general indication is that both methods M2 and M3, and implicitly so M1, improve upon the current method M0.

Visual inspections of individual cases where M1 performs poorly indicates that the problem originates in the sample octiles, on which all the rest depends. Especially with very low $n$ and/or $\nu$ , the sample octiles can occasionally happen to behave quite differently from expectations, spoiling everything. Unfortunately, there is no way to get around this problem, which however is sporadic.

Another indication from Table 2 is that M3 is essentially equivalent to M2, in terms of maximized log-likelihood, and in some cases it is even superior, in spite of its simplicity.

Another aspect is considered is Table 3 which reports computing times and their differences as frequencies of time intervals. A value $t_{k}$ represents the computing time for estimation from a given sample using method M $k$ , obtained by the first value reported by the R function system.time. For M2 and M3, $t_{k}$ includes the time spent for initialization with M1, although this is a very minor fraction of the overall time. Clearly the samples considered are of quite different nature, especially so for sample size. However, our purpose is solely comparative and, since exactly the same samples are processed by the various methods, the comparison of average computing times is valid. Table 3 shows a clear advantage of M2 over M0 in terms of computing time and also some advantage, but less prominent, over M3.

Additional simulations have been run having the location parameter expressed via a linear regression. Given a vector $x$ formed by $n$ equally spaced points on the interval $(-1,1)$ , design matrices have been built using $p$ transformations $T_{j}(x)$ , inclusive of the constant function $T_{0}(x)=1$ , as follows:

[TABLE]

Computation of $T_{j}(x)$ over the $n$ values of $x$ yields the columns of the design matrix; the regression parameters $\beta_{1},\dots,\beta_{p}$ have been set at $\beta_{j}=1$ for all $j$ s. For each of the A, B, C design matrices, and for each parameter combinations in (17), $N=2000$ have been generated, similarly to the case of simple random samples.

In Table 4, we summarize results only for case C, as the other cases are quite similar. The distribution of $D_{20}$ in the top two sub-tables still indicate a superiority of M2 over M0, although less pronounced than for simple samples. The lower portion of the table indicates a slight superiority of M3 over M2, reinforcing the similar indication from Table 2.

In the two subtables of Table 4 about $D_{20}$ , note that there are 160 samples where M0 goes completely wrong. All these sample were generated with $\nu=1$ , a fact which is not surprising considering the initial parameter selection of st.mple, in its standard working described at the beginning of Section 2.2. Since that initial selection is based on a least-squares fit of the regression parameters, this step clashes with the non-existence of moments when the underlying ST distribution has $\nu=1$ degrees of freedom. Not only the regression parameters are poorly fitted, but the ensuing residuals are spoiled, affecting also the initial fit of the other parameters.

A set of simulations has also been run in the bivariate case, hence sampling from density (6) with $d=2$ . The scale matrix and the shape vector have been set to

[TABLE]

where $\lambda$ spans the values given in (17). Also $n$ and $\nu$ have been set like in (17), with the exception that $n=50$ has been not included, considering that $50$ data points would constitute a too small sample in the present context. On the whole, $3^{3}\times 2000=54000$ bivariate samples have then been generated. They have been processed by function mst.mple of package sn and the initialization method of Section 3.4, with obvious modifications of the meaning of notation M0 to M3.

The summary output of the simulations is presented in Table 5. There is a clear winner this time, since M3 is constantly superior to the others. Between M0 and M2, the latter is still preferable for $\nu=1$ , but not otherwise.

The almost constant superiority of M3 over M2 is quite surprising, given the qualitatively different indication emerging in the univariate case. This rather surprising effect must be connected to transformation (16), as it has also been indicated by direct examination of a number of individual cases: a moderate estimation error even of a single $\lambda_{j}$ component, and consequently of $\delta_{j}$ , transforms into a poor estimate of $\alpha$ . It so happens that the conservative choice $\alpha=0$ of M3 avoids problems and can be, in its simplicity, more effective.

3.6 Conclusions

The overall indication of the simulation work is that the proposed preliminary estimates work quite effectively, providing an improved initialization of the numerical MPLE search. The primary aspect is that higher log-likelihood values are usually achieved, compared to the currently standard method, M0, sometimes by a remarkable margin. Another positive aspect is the saving in the overall computing time.

Of the two variant forms of the new initialization, leading to methods M2 and M3, the latter has emerged as clearly superior in the multivariate case, but no such clear-cut conclusion can be drawn in the univariate setting, with indications somewhat more favourable for M2. In this case, it is advisable to consider both variant of the preliminary estimates and carry out two numerical searches. Having to choose between them, the quick route is to take the one with higher log-likelihood. However, direct inspection of both outcomes must be recommended, including exploration of the profile log-likelihood surface.

Surely, it would have been ideal to identify a universally superior method, to be adopted for all situations, but this type of simplification still eludes us.

Bibliography22

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Adcock, (2010) Adcock, C. J. (2010). Asset pricing and portfolio selection based on the multivariate extended skew-Student- t 𝑡 t distribution. Ann. Oper. Res. , 176(1), 221–234.
2Azzalini, (2016) Azzalini, A. (2016). Flexible distributions as an approach to robustness: the skew- t 𝑡 t case. In C. Agostinelli, A. Basu, P. Filzmoser, & D. Mukherjee (Eds.), Recent Advances in Robust Statistics: Theory and Applications chapter 1, (pp. 1–16). Springer India.
3Azzalini, (2019) Azzalini, A. (2019). The R package sn : The Skew-Normal and Related Distributions such as the Skew- t 𝑡 t (version 1.5-4). Università di Padova, Italia.
4Azzalini & Arellano-Valle, (2013) Azzalini, A. & Arellano-Valle, R. B. (2013). Maximum penalized likelihood estimation for skew-normal and skew- t 𝑡 t distributions. J. Statist. Plann. Inference , 143(2), 419–433. Available online 30 June 2012.
5Azzalini & Capitanio, (2003) Azzalini, A. & Capitanio, A. (2003). Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t 𝑡 t distribution. J. R. Statist. Soc., ser. B , 65(2), 367–389. Full version of the paper at ar Xiv.org:0911.2342 .
6Azzalini & Capitanio, (2014) Azzalini, A. & Capitanio, A. (2014). The Skew-Normal and Related Families . IMS monographs. Cambridge, UK: Cambridge University Press.
7Azzalini & Genton, (2008) Azzalini, A. & Genton, M. G. (2008). Robust likelihood methods based on the skew- t 𝑡 t and related distributions. Int. Statist. Rev. , 76, 106–129.
8Bates & Maechler, (2019) Bates, D. & Maechler, M. (2019). Matrix: Sparse and Dense Matrix Classes and Methods . R package version 1.2-17.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Some computational aspects of maximum likelihood estimation

Abstract

1 Background and aims

1.1 Flexible distributions: the skew-ttt case

1.2 The skew-ttt distribution: basic facts

2 On the likelihood function of ST models

2.1 Basic general aspects

2.2 Numerical aspects and some illustrations

3 On the choice of initial parameters for MLE search

3.1 Preliminary remarks and the basic scheme

3.2 Inversion of quantile-based measures to ST parameters

3.3 Extension to the regression case

3.4 Extension to the multivariate case

3.5 Simulation work to compare initialization procedures

3.6 Conclusions

1.1 Flexible distributions: the skew- $t$ case

1.2 The skew- $t$ distribution: basic facts