High-dimensional copula variational approximation through transformation

Michael Stanley Smith; Ruben Loaiza-Maya; David J. Nott

arXiv:1904.07495·stat.CO·November 21, 2019

High-dimensional copula variational approximation through transformation

Michael Stanley Smith, Ruben Loaiza-Maya, David J. Nott

PDF

Open Access

TL;DR

This paper introduces a novel variational approximation method using copula models with transformations like Yeo-Johnson, improving high-dimensional Bayesian inference accuracy with minimal additional computational cost.

Contribution

It proposes a new copula-based variational approximation framework that enhances accuracy in high-dimensional models without significant computational overhead.

Findings

01

Copula models outperform Gaussian and skew-normal approximations in accuracy.

02

The method is computationally efficient and scalable to high dimensions.

03

Successful application to three models with six real datasets.

Abstract

Variational methods are attractive for computing Bayesian inference for highly parametrized models and large datasets where exact inference is impractical. They approximate a target distribution - either the posterior or an augmented posterior - using a simpler distribution that is selected to balance accuracy with computational feasibility. Here we approximate an element-wise parametric transformation of the target distribution as multivariate Gaussian or skew-normal. Approximations of this kind are implicit copula models for the original parameters, with a Gaussian or skew-normal copula function and flexible parametric margins. A key observation is that their adoption can improve the accuracy of variational inference in high dimensions at limited or no additional computational cost. We consider the Yeo-Johnson and G&H transformations, along with sparse factor structures for the scale…

Tables4

Table 1. Table 1: Two transformations, their inverses and derivatives that are required to implement the copula variational Bayes estimator. For the YJ transformation, the term θ ¯ = 1 − θ ¯ 𝜃 1 𝜃 \bar{\theta}=1-\theta , and γ 𝛾 \gamma is a scalar. The inverse G&H is a two parameter transformation with γ = { g , 0 < h < 1 } 𝛾 𝑔 0 ℎ 1 \gamma=\{g,0<h<1\} . Note that t γ subscript 𝑡 𝛾 t_{\gamma} in the first row is never computed in the SGA algorithm, along with a number of derivatives labelled ‘Not Required’. MATLAB routines to evaluate the functions are provided in the Supplementary Materials.

	Yeo-Johnson Transformation		Inverse G&H Transformation
Function	$θ < 0, ψ < 0$	$θ \geq 0, ψ \geq 0$	$g \neq 0$	$g = 0$
$t_{γ} (θ)$	$- \frac{{\bar{θ}}^{2 - γ} - 1}{2 - γ}$	$\frac{{(θ + 1)}^{γ} - 1}{γ}$	Evaluated Numerically	Evaluated Numerically
$t_{γ}^{- 1} (ψ)$	$1 - {(1 - ψ (2 - γ))}^{\frac{1}{2 - γ}}$	${(1 + ψ γ)}^{\frac{1}{γ}} - 1$	$\frac{\exp (g ψ) - 1}{g} \exp (h ψ^{2} / 2)$	$ψ \exp (\frac{h ψ^{2}}{2})$
$\frac{\partial}{\partial ψ} t_{γ}^{- 1} (ψ)$	${(1 - ψ (2 - γ))}^{\frac{γ - 1}{2 - γ}}$	${(1 + ψ γ)}^{\frac{1 - γ}{γ}}$	$\exp (g ψ + \frac{h ψ^{2}}{2}) + h ψ t_{γ}^{- 1} (ψ)$	$\exp (\frac{h ψ^{2}}{2}) + h ψ t_{γ}^{- 1} (ψ)$
$t_{γ}^{'} (θ) = \frac{\partial}{\partial θ} t_{γ} (θ)$	${\bar{θ}}^{1 - γ}$	${(θ + 1)}^{γ - 1}$	${[\frac{\partial}{\partial ψ} t_{γ}^{- 1} (ψ)]}^{- 1}$	${[\frac{\partial}{\partial ψ} t_{γ}^{- 1} (ψ)]}^{- 1}$
$\frac{\partial^{2}}{\partial ψ^{2}} t_{γ}^{- 1} (ψ)$	Not Required	Not Required	$\exp (g ψ + \frac{h ψ^{2}}{2}) (g + h ψ) +$	$\exp (\frac{h ψ^{2}}{2}) h ψ +$
			$h ψ [\frac{\partial}{\partial ψ} t_{γ}^{- 1} (ψ)] + h t_{γ}^{- 1} (ψ)$	$h ψ [\frac{\partial}{\partial ψ} t_{γ}^{- 1} (ψ)] + h t_{γ}^{- 1} (ψ)$
$\frac{\partial^{2}}{\partial θ^{2}} t_{γ} (θ)$	$(γ - 1) {\bar{θ}}^{- γ}$	$(γ - 1) {(θ + 1)}^{γ - 2}$	$- {[\frac{\partial}{\partial ψ} t_{γ}^{- 1} (ψ)]}^{- 3} \frac{\partial^{2}}{\partial ψ^{2}} t_{γ}^{- 1} (ψ)$	$- {[\frac{\partial}{\partial ψ} t_{γ}^{- 1} (ψ)]}^{- 3} \frac{\partial^{2}}{\partial ψ^{2}} t_{γ}^{- 1} (ψ)$
$\frac{\partial}{\partial γ} t_{γ} (θ)$	$\frac{(2 - γ) {\bar{θ}}^{2 - γ} \ln (\bar{θ}) - {\bar{θ}}^{2 - γ} + 1}{{(2 - γ)}^{2}}$	$\frac{γ {(1 + θ)}^{γ} \ln (θ + 1) - {(1 + θ)}^{γ} + 1}{γ^{2}}$	Not Required	Not Required
$\frac{\partial}{\partial γ} t_{γ}^{- 1} (ψ)$	$- \frac{\partial}{\partial ψ} t_{γ}^{- 1} (ψ) \frac{\partial}{\partial γ} t_{γ} (θ)$	$- \frac{\partial}{\partial ψ} t_{γ}^{- 1} (ψ) \frac{\partial}{\partial γ} t_{γ} (θ)$	$\frac{\partial}{\partial g} t_{γ}^{- 1} (ψ) = \frac{ψ}{g} \exp (g ψ + \frac{h ψ^{2}}{2}) - \frac{t_{γ}^{- 1} (ψ)}{g}$	$\frac{\partial}{\partial h} t_{γ}^{- 1} (ψ) = \frac{ψ^{2}}{2} t_{γ}^{- 1} (ψ)$
			$\frac{\partial}{\partial h} t_{γ}^{- 1} (ψ) = \frac{ψ^{2}}{2} t_{γ}^{- 1} (ψ)$
$\frac{\partial}{\partial γ} t_{γ}^{'} (θ)$	$- {(\bar{θ})}^{1 - γ} \ln (\bar{θ})$	${(θ + 1)}^{γ - 1} \ln (θ + 1)$	$\frac{\partial}{\partial g} t_{γ}^{'} (θ) = - {[\frac{\partial}{\partial ψ} t_{γ}^{- 1} (ψ)]}^{- 2} \frac{\partial}{\partial g} [\frac{\partial}{\partial ψ} t_{γ}^{- 1} (ψ)]$
			$\frac{\partial}{\partial h} t_{γ}^{'} (θ) = - {[\frac{\partial}{\partial ψ} t_{γ}^{- 1} (ψ)]}^{- 2} \frac{\partial}{\partial h} [\frac{\partial}{\partial ψ} t_{γ}^{- 1} (ψ)]$	$\frac{\partial}{\partial h} t_{γ}^{'} (θ) = - {[\frac{\partial}{\partial ψ} t_{γ}^{- 1} (ψ)]}^{- 2} \frac{\partial}{\partial h} [\frac{\partial}{\partial ψ} t_{γ}^{- 1} (ψ)]$

Table 2. Table 2: Comparison of different variational approximations q λ ( 𝜽 ) subscript 𝑞 𝜆 𝜽 q_{\lambda}(\text{\boldmath$\theta$}) to the augmented posterior of the mixed logistic regression for the polypharmacy data. The mean field Gaussian, with and without YJ transformation, are included as benchmarks A1 and A2. All the remaining approximations use factor decompositions for the scale matrices with k = 5 𝑘 5 k=5 factors. For each approximation, the number of variational parameters | 𝝀 | 𝝀 |\text{\boldmath$\lambda$}| , average lower bound value over the last 1000 steps, and the time to complete 1,000 steps using MATLAB on a standard laptop are reported.

Variational Approximation	# Parameters $\| 𝝀 \|$	Max. Lower Bound	Time (mins)
(A1) Mean Field Gaussian	1018	-923.08	0.85
(A2) Mean Field YJ Transform	1527	-913.17	0.85
(A3) Gaussian	3553	-918.24	1.99
(A4) Skew-normal	4062	-923.16	2.28
(A5) Gaussian Copula (YJ Transform)	4062	-908.33	2.00
(A6) Skew-normal Copula (YJ Transform)	4571	-916.80	2.34
(A7) Gaussian Copula (iGH Transform)	4571	-909.21	1.86
(A8) Skew-normal Copula (iGH Transform)	5080	-924.01	2.15

Table 3. Table 3: Average lower bound value over the last 1000 steps for six variational approximations to the posteriors of the four logistic regression examples.

Example	Variational Approximation
	(A3)	(A4)	(A5)	(A6)	(A7)	(A8)
Spam	-828.15	-824.28	-827.96	-824.02	-828.69	-824.60
Krkp	-386.39	-386.68	-386.86	-384.98	-390.33	-386.80
Iono	-103.98	-100.39	-104.39	-100.95	-106.26	-102.46
Mush	-126.31	-124.06	-127.93	-124.21	-132.15	-129.15

Table 4. Table 4: Closed form expressions for four derivatives in Appendix B . These are used to compute the gradient of the lower bound efficiently when using the reparameterization trick and a skew-normal copula approximation with a factor covariance structure. They are expressed recursively (with the terms evaluated from top to bottom for each derivative) and derived in the Online Appendix. In the table we denote 𝝃 = ( B 𝒛 + 𝒅 ∘ ϵ ) 𝝃 𝐵 𝒛 𝒅 bold-italic-ϵ \bm{\xi}=\left(B\bm{z}+\bm{d}\circ\bm{\epsilon}\right) , and P 𝑃 P is a matrix of zeros and ones such that d 𝜽 ( 𝜺 , 𝝀 ) d 𝒅 = d 𝜽 ( 𝜺 , 𝝀 ) d D P 𝑑 𝜽 𝜺 𝝀 𝑑 𝒅 𝑑 𝜽 𝜺 𝝀 𝑑 𝐷 𝑃 \frac{d\bm{\theta}(\bm{\varepsilon},\bm{\lambda})}{d\bm{d}}=\frac{d\bm{\theta}(\bm{\varepsilon},\bm{\lambda})}{dD}P . MATLAB routines to evaluate the expressions are available in the Supplementary Materials.

Computing $\frac{d 𝜽 (𝜺, 𝝀)}{d B}$		Computing $\frac{d 𝜽 (𝜺, 𝝀)}{d 𝒅}$
$M_{1} =$	${\tilde{𝜹}}_{ψ} {\tilde{𝜹}}_{ψ}^{⊤}$	$M_{1} =$	${\tilde{𝜹}}_{ψ} {\tilde{𝜹}}_{ψ}^{⊤}$
$M_{2} =$	$𝝃^{⊤} Σ_{ψ}^{- 1} {\tilde{𝜹}}_{ψ} I_{m} + 𝝃^{⊤} Σ_{ψ}^{- 1} \otimes {\tilde{𝜹}}_{ψ}$	$M_{2} =$	$𝝃^{⊤} Σ_{ψ}^{- 1} {\tilde{𝜹}}_{ψ} I_{m} + 𝝃 ⊤ Σ_{ψ}^{- 1} \otimes {\tilde{𝜹}}_{ψ}$
$M_{3} =$	$0.5 ε_{0} {(1 - {\tilde{𝜹}}_{ψ}^{⊤} Σ_{ψ}^{- 1} {\tilde{𝜹}}_{ψ})}^{- 1 / 2} {\tilde{𝜹}}_{ψ}$	$M_{3} =$	$0.5 ε_{0} {(1 - {\tilde{𝜹}}_{ψ}^{⊤} Σ_{ψ}^{- 1} {\tilde{𝜹}}_{ψ})}^{- 1 / 2} {\tilde{𝜹}}_{ψ}$
$M_{4} =$	$ε_{0} \sqrt{1 - {\tilde{𝜹}}_{ψ}^{⊤} Σ_{ψ}^{- 1} {\tilde{𝜹}}_{ψ}}$	$M_{4} =$	$ε_{0} \sqrt{1 - {\tilde{𝜹}}_{ψ}^{⊤} Σ_{ψ}^{- 1} {\tilde{𝜹}}_{ψ}}$
$M_{5} =$	${\tilde{𝜹}}_{ψ}^{⊤} (Σ_{ψ}^{- 1} B \otimes {\tilde{𝜹}}_{ψ}^{⊤} Σ_{ψ}^{- 1})$	$M_{5} =$	$- (Σ_{ψ}^{- 1} D \otimes {\tilde{𝜹}}_{ψ}^{⊤} Σ_{ψ}^{- 1} +$
$M_{6} =$	${\tilde{𝜹}}_{ψ}^{⊤} (Σ_{ψ}^{- 1} \otimes {\tilde{𝜹}}_{ψ}^{⊤} Σ_{ψ}^{- 1} B) K_{m, p}$		${\tilde{𝜹}}_{ψ}^{⊤} Σ_{ψ}^{- 1} D \otimes Σ_{ψ}^{- 1}) P$
$M_{7} =$	$[diag (B_{.1}), \dots, diag (B_{. p})]$	$M_{6} =$	$[diag (D_{.1}), \dots, diag (D_{. p})]$
$M_{8} =$	$diag (𝜹_{𝝍}) S_{ψ}^{- 1 / 2} M_{7}$	$M_{7} =$	$diag (𝜹_{𝝍}) S_{ψ}^{- 1 / 2} M_{6} P$
$M_{9} =$	$- 2 M_{3} \otimes ({\tilde{𝜹}}_{ψ}^{⊤} Σ_{ψ}^{- 1} M_{8})$	$M_{8} =$	$- 2 M_{3} \otimes ({\tilde{𝜹}}_{ψ}^{⊤} Σ_{ψ}^{- 1} M_{7})$
$M_{10} =$	$M_{3} \otimes M_{5} + M_{3} \otimes M_{6}$	$M_{9} =$	$- M_{3} \otimes ({\tilde{𝜹}}_{ψ}^{⊤} M_{5})$
$M_{11} =$	$M_{2} M_{8}$	$M_{10} =$	$M_{2} M_{7}$
$M_{12} =$	$- (𝝃^{⊤} Σ_{ψ}^{- 1} B \otimes Σ_{ψ}^{- 1}) -$	$M_{11} =$	$- ((𝝃^{⊤} Σ_{ψ}^{- 1} D \otimes Σ_{ψ}^{- 1}) +$
	$K_{1, m} (Σ_{ψ}^{- 1} B \otimes 𝝃^{⊤} Σ_{ψ}^{- 1})$		$K_{1, m} (Σ_{ψ}^{- 1} D \otimes 𝝃^{⊤} Σ_{ψ}^{- 1})) P$
$T_{B 0} =$	$\| r \| M_{8}$	$T_{d 0} =$	$\| r \| M_{7}$
$T_{B 1} =$	$𝒛^{⊤} \otimes I_{m}$	$T_{d 1} =$	$(ϵ^{⊤} \otimes I_{m}) P$
$M_{13} =$	$M_{12} + Σ_{ψ}^{- 1} T_{B 1}$	$M_{12} =$	$M_{11} + Σ_{ψ}^{- 1} T_{d 1}$
$T_{B 2} =$	$M_{11} + M_{1} M_{13}$	$T_{d 2} =$	$M_{10} + M_{1} M_{12}$
$T_{B 3} =$	$M_{9} + M_{10} + M_{4} M_{8}$	$T_{d 3} =$	$M_{8} + M_{9} + M_{4} M_{7}$
$\frac{d 𝜽 (λ, ζ)}{d B} =$	$\frac{d t_{γ}^{- 1} (ψ)}{d ψ} (T_{B 0} + T_{B 1} +$	$\frac{d 𝜽 (λ, ζ)}{d 𝒅} =$	$\frac{d t_{γ}^{- 1} (ψ)}{d ψ} (T_{d 0} + T_{d 1} +$
	$T_{B 2} + T_{B 3})$		$T_{d 2} + T_{d 3})$
Computing $\frac{d 𝜽 (𝜺, 𝝀)}{d 𝜶_{𝝍}}$		Computing $\nabla_{θ} \log q_{λ} (𝜽)$
$M_{1} =$	$\sqrt{1 - {\tilde{𝜹}}_{ψ}^{⊤} Σ_{ψ}^{- 1} {\tilde{𝜹}}_{ψ}}$	$T_{q 1} =$	${(t_{γ_{1}}^{''} (θ_{1}) / t_{γ_{1}}^{'} (θ_{1}), \dots, t_{γ_{m}}^{''} (θ_{m}) / t_{γ_{m}}^{'} (θ_{m}))}^{⊤}$
$M_{2} =$	$- M_{1}^{- 1} {\tilde{𝜹}}_{ψ}^{⊤} Σ_{ψ}^{- 1}$	$M_{1} =$	$diag (t_{γ_{1}}^{'} (θ_{1}), \dots, t_{γ_{m}}^{'} (θ_{m}))$
$M_{3} =$	$ε_{0} vec (I_{m}) \otimes M_{2}$	$T_{q 2} =$	$- M_{1}^{⊤} Σ_{ψ}^{- 1} (𝝍 - 𝝁_{ψ})$
$M_{4} =$	${\tilde{𝜹}}_{ψ}^{⊤} \otimes I_{m}$	$M_{4} =$	$𝜶_{ψ}^{⊤} S_{ψ}^{- 1 / 2} (𝝍 - 𝝁_{ψ})$
$M_{5} =$	$M_{4} M_{3} + ε_{0} M_{1} I_{m}$	$T_{q 3} =$	$M_{1}^{⊤} S_{ψ}^{- 1 / 2} 𝜶_{ψ} \frac{ϕ_{1} (M_{4})}{Φ_{1} (M_{4})}$
$M_{6} =$	$𝝃^{⊤} Σ_{ψ}^{- 1} {\tilde{𝜹}}_{ψ} I_{m} + 𝝃^{⊤} Σ_{ψ}^{- 1} \otimes {\tilde{𝜹}}_{ψ}$	$\nabla_{θ} \log q_{λ} (𝜽) =$	$T_{q 1} + T_{q 2} + T_{q 3}$
$M_{7} =$	$\| r \| I_{m} - M_{6} + M_{5}$
$M_{8} =$	$\frac{d t_{γ}^{- 1} (ψ)}{d ψ} M_{7} S_{ψ}^{1 / 2}$
$M_{9} =$	$1 + 𝜶_{𝝍}^{⊤} Ω_{ψ} 𝜶_{𝝍}$
$M_{10} =$	$M_{9}^{- 3 / 2} (M_{9} Ω_{ψ} - Ω_{ψ} 𝜶_{𝝍} 𝜶_{𝝍}^{⊤} Ω_{ψ})$
$\frac{d θ (λ, ζ)}{d 𝜶_{𝝍}} =$	$M_{8} M_{10}$

Equations220

KL (q_{λ} (\boldmath θ) ∣∣ p (\boldmath θ ∣ \boldmath y))

KL (q_{λ} (\boldmath θ) ∣∣ p (\boldmath θ ∣ \boldmath y))

KL (q_{λ} (\boldmath θ) ∣∣ p (\boldmath θ ∣ \boldmath y))

KL (q_{λ} (\boldmath θ) ∣∣ p (\boldmath θ ∣ \boldmath y))

= lo g p (\boldmath y) - L (\boldmath λ),

L (\boldmath λ) = E_{q_{λ}} [lo g g (\boldmath θ) - lo g q_{λ} (\boldmath θ)],

L (\boldmath λ) = E_{q_{λ}} [lo g g (\boldmath θ) - lo g q_{λ} (\boldmath θ)],

\boldmath λ^{(i + 1)}

\boldmath λ^{(i + 1)}

L (\boldmath λ)

L (\boldmath λ)

\nabla_{λ} L (\boldmath λ)

\nabla_{λ} L (\boldmath λ)

q_{λ} (\boldmath θ) = p (\boldmath ψ; \boldmath π) i = 1 \prod m t_{γ_{i}}^{'} (θ_{i}),

q_{λ} (\boldmath θ) = p (\boldmath ψ; \boldmath π) i = 1 \prod m t_{γ_{i}}^{'} (θ_{i}),

q_{λ_{i}} (θ_{i}) = p_{i} (ψ_{i}; \boldmath π_{i}) t_{γ_{i}}^{'} (θ_{i}), \mbox f or i = 1, \dots, m,

q_{λ_{i}} (θ_{i}) = p_{i} (ψ_{i}; \boldmath π_{i}) t_{γ_{i}}^{'} (θ_{i}), \mbox f or i = 1, \dots, m,

q_{λ} (\boldmath θ) = c (\boldmath u; \tilde{\boldmath π}) i = 1 \prod m q_{λ_{i}} (θ_{i}),

q_{λ} (\boldmath θ) = c (\boldmath u; \tilde{\boldmath π}) i = 1 \prod m q_{λ_{i}} (θ_{i}),

c (\boldmath u; \tilde{\boldmath π}) = \frac{p ( \boldmath ψ ; \boldmath π )}{\prod _{i = 1}^{m} p _{i} ( ψ _{i} ; \boldmath π _{i} )} = \frac{p ( ( F _{1}^{- 1} ( u _{1} ) , \dots , F _{m}^{- 1} ( u _{m} ) ) ^{⊤} ; \boldmath π )}{\prod _{i = 1}^{m} p _{i} ( F _{i}^{- 1} ( u _{i} ) ; \boldmath π _{i} )},

c (\boldmath u; \tilde{\boldmath π}) = \frac{p ( \boldmath ψ ; \boldmath π )}{\prod _{i = 1}^{m} p _{i} ( ψ _{i} ; \boldmath π _{i} )} = \frac{p ( ( F _{1}^{- 1} ( u _{1} ) , \dots , F _{m}^{- 1} ( u _{m} ) ) ^{⊤} ; \boldmath π )}{\prod _{i = 1}^{m} p _{i} ( F _{i}^{- 1} ( u _{i} ) ; \boldmath π _{i} )},

C (\boldmath u; \tilde{\boldmath π}) = F (F_{1}^{- 1} (u_{1}; \boldmath π_{1}), \dots, F_{m}^{- 1} (u_{m}; \boldmath π_{m}); \boldmath π),

C (\boldmath u; \tilde{\boldmath π}) = F (F_{1}^{- 1} (u_{1}; \boldmath π_{1}), \dots, F_{m}^{- 1} (u_{m}; \boldmath π_{m}); \boldmath π),

t_{\gamma}(\theta)=\left\{\begin{array}[]{cl}-\frac{(-\theta+1)^{2-\gamma}-1}{2-\gamma}&\mbox{if }\theta<0\\ \frac{(\theta+1)^{\gamma}-1}{\gamma}&\mbox{if }\theta\geq 0\end{array}\right..

t_{\gamma}(\theta)=\left\{\begin{array}[]{cl}-\frac{(-\theta+1)^{2-\gamma}-1}{2-\gamma}&\mbox{if }\theta<0\\ \frac{(\theta+1)^{\gamma}-1}{\gamma}&\mbox{if }\theta\geq 0\end{array}\right..

t_{\gamma}^{-1}(\psi)=\left\{\begin{array}[]{cl}\frac{\exp(g\psi)-1}{g}\exp(h\psi^{2}/2)&\mbox{ if }g\neq 0\\ \psi\exp(\frac{h\psi^{2}}{2})&\mbox{ if }g=0\end{array}\right.\,,

t_{\gamma}^{-1}(\psi)=\left\{\begin{array}[]{cl}\frac{\exp(g\psi)-1}{g}\exp(h\psi^{2}/2)&\mbox{ if }g\neq 0\\ \psi\exp(\frac{h\psi^{2}}{2})&\mbox{ if }g=0\end{array}\right.\,,

Σ_{ψ}

Σ_{ψ}

c^{D V} (\boldmath v; \boldmath η) = t = 2 \prod T k = 1 \prod m i n (t - 1, p) c_{k}^{\mbox M I X} (v_{t - k ∣ t - 1}, v_{t ∣ t - k + 1}; \boldmath η_{k}),

c^{D V} (\boldmath v; \boldmath η) = t = 2 \prod T k = 1 \prod m i n (t - 1, p) c_{k}^{\mbox M I X} (v_{t - k ∣ t - 1}, v_{t ∣ t - k + 1}; \boldmath η_{k}),

p (\boldmath y, \boldmath v ∣ \boldmath η) = c^{D V} (\boldmath v; \boldmath η) t = 1 \prod T I (a_{t} \leq v_{t} < b_{t}),

p (\boldmath y, \boldmath v ∣ \boldmath η) = c^{D V} (\boldmath v; \boldmath η) t = 1 \prod T I (a_{t} \leq v_{t} < b_{t}),

p (ψ; π) = 2 ϕ_{m} (ψ; μ_{ψ}, Σ_{ψ}) Φ_{1} (α_{ψ}^{⊤} S_{ψ}^{- 1/2} (ψ - μ_{ψ})),

p (ψ; π) = 2 ϕ_{m} (ψ; μ_{ψ}, Σ_{ψ}) Φ_{1} (α_{ψ}^{⊤} S_{ψ}^{- 1/2} (ψ - μ_{ψ})),

ψ = μ_{ψ} + \tilde{δ}_{ψ} ∣ r ∣ + (I - \tilde{δ}_{ψ} \tilde{δ}_{ψ}^{⊤} Σ_{ψ}^{- 1}) (B z + D ϵ) + 1 - \tilde{δ}_{ψ}^{⊤} Σ_{ψ}^{- 1} \tilde{δ}_{ψ} \tilde{δ}_{ψ} ε_{0},

ψ = μ_{ψ} + \tilde{δ}_{ψ} ∣ r ∣ + (I - \tilde{δ}_{ψ} \tilde{δ}_{ψ}^{⊤} Σ_{ψ}^{- 1}) (B z + D ϵ) + 1 - \tilde{δ}_{ψ}^{⊤} Σ_{ψ}^{- 1} \tilde{δ}_{ψ} \tilde{δ}_{ψ} ε_{0},

q_{λ} (\boldmath θ) = q_{λ^{a}} (η) q_{λ^{b}} (v) = p_{a} (\boldmath ψ^{a}; π^{a}) p_{b} (\boldmath ψ^{b}; π^{b}) (i = 1 \prod m t_{γ_{a, i}}^{'} (η_{i})) (t = 1 \prod T t_{γ_{b, t}}^{'} (\tilde{v}_{t}) \frac{d v ~ _{t}}{d v _{t}})

q_{λ} (\boldmath θ) = q_{λ^{a}} (η) q_{λ^{b}} (v) = p_{a} (\boldmath ψ^{a}; π^{a}) p_{b} (\boldmath ψ^{b}; π^{b}) (i = 1 \prod m t_{γ_{a, i}}^{'} (η_{i})) (t = 1 \prod T t_{γ_{b, t}}^{'} (\tilde{v}_{t}) \frac{d v ~ _{t}}{d v _{t}})

lo g q_{λ^{a}} (η) = lo g p_{a} (\boldmath ψ^{a}; π_{a}) + i = 1 \sum m lo g t_{γ_{a, i}}^{'} (η_{i}) .

lo g q_{λ^{a}} (η) = lo g p_{a} (\boldmath ψ^{a}; π_{a}) + i = 1 \sum m lo g t_{γ_{a, i}}^{'} (η_{i}) .

\nabla_{λ^{a}} log q_{λ^{a}} (η) = (\nabla_{μ} log q_{λ^{a}} (η)^{⊤}, \nabla_{b} log q_{λ^{a}} (η)^{⊤}, \nabla_{d} log q_{λ^{a}} (η)^{⊤}, \nabla_{γ} log q_{λ^{a}} (η)^{⊤})^{⊤}

\nabla_{λ^{a}} log q_{λ^{a}} (η) = (\nabla_{μ} log q_{λ^{a}} (η)^{⊤}, \nabla_{b} log q_{λ^{a}} (η)^{⊤}, \nabla_{d} log q_{λ^{a}} (η)^{⊤}, \nabla_{γ} log q_{λ^{a}} (η)^{⊤})^{⊤}

\nabla_{μ} log q_{λ^{a}} (η) =

\nabla_{μ} log q_{λ^{a}} (η) =

\nabla_{b} log q_{λ^{a}} (η) =

\nabla_{d} log q_{λ^{a}} (η) =

\nabla_{γ} log q_{λ^{a}} (η) =

lo g q_{λ^{b}} (v) = t = 1 \sum T (\frac{1}{2} \tilde{v}_{t}^{2} - c_{t} - \frac{( ψ _{t}^{b} - ζ _{t} ) ^{2}}{2 exp ( 2 c _{t} )} - lo g (b_{t} - a_{t}) + lo g (t_{γ_{b, t}}^{'} (\tilde{v}_{t}))),

lo g q_{λ^{b}} (v) = t = 1 \sum T (\frac{1}{2} \tilde{v}_{t}^{2} - c_{t} - \frac{( ψ _{t}^{b} - ζ _{t} ) ^{2}}{2 exp ( 2 c _{t} )} - lo g (b_{t} - a_{t}) + lo g (t_{γ_{b, t}}^{'} (\tilde{v}_{t}))),

\nabla_{ζ} lo g q_{λ^{b}} (v) =

\nabla_{ζ} lo g q_{λ^{b}} (v) =

\nabla_{c} lo g q_{λ^{b}} (v) =

\nabla_{γ} lo g q_{λ^{b}} (v) =

q_{λ} (θ)

q_{λ} (θ)

\nabla_{λ} L (\boldmath λ) =

\nabla_{λ} L (\boldmath λ) =

=

\nabla_{λ} L (\boldmath λ) = (\nabla_{μ_{ψ}} L (\boldmath λ)^{⊤}, \nabla_{α_{ψ}} L (\boldmath λ)^{⊤}, \nabla_{\mbox v ec h (B)} L (\boldmath λ)^{⊤}, \nabla_{d} L (\boldmath λ)^{⊤}, \nabla_{γ} L (\boldmath λ)^{⊤})^{⊤},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models · Gaussian Processes and Bayesian Inference · Machine Learning and Algorithms

Full text

High-dimensional Copula Variational Approximation through Transformation

Michael Stanley Smith, Rubén Loaiza-Maya & David J. Nott

Michael Stanley Smith is Chair of Management (Econometrics) at Melbourne Business School, University of Melbourne; Rubén Loaiza-Maya is a Postdoctoral Fellow at the Department of Econometrics and Business Statistics, Monash University; and, David J. Nott is Associate Professor of Statistics, National University of Singapore. Correspondence should be directed to Michael Smith at [email protected].

Acknowledgments: The authors would like to thank Dr. Linda Tan for providing the MCMC output for the examples in Section 4, and Prof. Richard Gerlach and the review team for comments that helped improve the paper.

High-dimensional Copula Variational Approximation through Transformation

Abstract

Variational methods are attractive for computing Bayesian inference when exact inference is impractical. They approximate a target distribution—either the posterior or an augmented posterior—using a simpler distribution that is selected to balance accuracy with computational feasibility. Here we approximate an element-wise parametric transformation of the target distribution as multivariate Gaussian or skew-normal. Approximations of this kind are implicit copula models for the original parameters, with a Gaussian or skew-normal copula function and flexible parametric margins. A key observation is that their adoption can improve the accuracy of variational inference in high dimensions at limited or no additional computational cost. We consider the Yeo-Johnson and inverse G&H transformations, along with sparse factor structures for the scale matrix of the Gaussian or skew-normal. We also show how to implement efficient re-parametrization gradient methods for these copula-based approximations. The efficacy of the approach is illustrated by computing posterior inference for three different models using six real datasets. In each case, we show that our proposed copula model distributions are more accurate variational approximations than Gaussian or skew-normal distributions, but at only a minor or no increase in computational cost.

Key Words: Factor variational approximation, inverse G&H transformation, Implicit copula, Skew-normal copula, Yeo-Johnson transformation.

1 Introduction and Literature Review

Variational methods are an increasingly popular tool for computing posterior inferences for models with large numbers of parameters and/or large datasets; see Ormerod and Wand (2010) and Blei et al. (2017) for overviews. Unlike conventional Monte Carlo methods, which are able in principle to estimate quantities of interest with any desired precision, variational methods are approximate. However, they are often substantially faster, and can be used to estimate models where exact inference is impractical. Key to the success of variational inference is the selection of an approximation that balances accuracy with computational viability. In this paper we suggest a general approach to variational inference for a high-dimensional target distribution using Gaussian or skew-normal copula-based approximations. They are formed by using Gaussian or skew-normal distributions for an element-wise parametric transformation of the target. Parsimonious factor parametrizations of the scale matrix of these distributions are used to make the computations feasible. For the transformations, we consider the Yeo-Johnson (Yeo and Johnson, 2000) and inverse G&H families (Tukey, 1977). They allow for skewness and more complex features in the marginal densities of the copula model, without requiring a large number of additional variational parameters– which is important for maintaining computational efficiency in high dimensions. We also show how efficient re-parameterization gradient methods can be used for the copula models, including for the skew-normal by making use of its latent Gaussian structure. We show in a number of examples that our Gaussian and skew-normal copula models are more accurate approximations than the corresponding Gaussian and skew-normal distributions. Importantly, this increase in accuracy usually comes at only a minor increase in computational time, while in some instances the copula models are actually faster to calibrate.

Variational inference methods for Bayesian computation approximate a target posterior or augmented posterior distribution using another distribution which is more tractable. The form of the approximation is commonly derived either from an assumed factorization of the density, or the adoption of some convenient parametric family. In the current work, we consider parametric families of approximations, for which a Gaussian is the most common choice. Important early work on Gaussian approximations can be found in Opper and Archambeau (2009), where they considered models having a Gaussian prior and factorizing likelihood, and showed that in this class of models the number of variational parameters does not proliferate with increasing dimension. Challis and Barber (2013) discussed Gaussian approximations for models where the posterior could be expressed in a certain form, and show an equivalence between local variational methods and Kullback-Leibler divergence minimization methods in their setup. They also considered various parametrizations of the covariance matrix based on the Cholesky factor for the optimization. More recent work on Gaussian approximations has focused on stochastic gradient methods which largely remove any restriction on the kind of models to which the methodology applies. Key references here are papers by Kingma and Welling (2014) and Rezende et al. (2014) who introduced efficient variance reduction methods for stochastic gradient estimation in the variational optimization. These methods will be discussed further later. Some similar ideas were developed independently about the same time in Titsias and Lázaro-Gredilla (2014) and Salimans et al. (2013). The latter authors also consider methods for Gaussian approximation able to use second derivative information from the log posterior, as well as methods for forming non-Gaussian approximations by making use of hierarchical structures or mixtures of Gaussians. Kucukelbir et al. (2017) consider an automatic differentiation approach to Gaussian variational approximation which considers both diagonal and dense Cholesky parametrizations of the covariance matrix and the use of fixed marginal transformations of parameters. Their approach is implemented in the statistical package Stan (Carpenter et al., 2017).

A key difficulty with Gaussian approximations is the way that the number of covariance parameters increases quadratically with the number of model parameters, making Gaussian variational approximation impractical unless more parsimonious parametrizations of the covariance matrix are adopted. While assuming a diagonal covariance matrix is one possibility, this leads to the inability to represent the posterior dependence. Work on structured approximations for covariance matrices in Gaussian approximation applicable to high-dimensional problems includes the work of Challis and Barber (2013) mentioned above, and Tan and Nott (2018), who parameterize the covariance matrix in terms of a sparse Cholesky factor of the precision matrix. Related methods for time series models are developed in Archer et al. (2016). Miller et al. (2016) and Ong et al. (2018) consider factor parametrizations of covariance matrices, with the former authors also considering mixture approximations, with Gaussian component covariance matrices having the factor structure. Earlier approaches which used a one factor approximation to the covariance or precision matrix were considered by Seeger (2000) and Rezende et al. (2014). Quiroz et al. (2018) consider combining factor parametrizations for state reduction with sparse precision Cholesky factors for capturing dynamic dependence structure in high-dimensional state space models. Guo et al. (2016) consider similar “variational boosting” mixture approximations to Miller et al. (2016), although they use different approaches to the specification of mixture components and to the optimization.

The references above relate to different approaches to variational inference based on Gaussian or mixtures of Gaussians approximations. However, there is also a large literature on other approaches to developing flexible variational families. Most pertinent to the present work are methods based on copulas. Tran et al. (2015) use vine copulas, but these can be too slow to evaluate in high dimensions, and selection of the appropriate vine structure and component pair-copulas is difficult in general. Han et al. (2016) also employ element-wise transformations to construct a Gaussian copula model, and their work is most closely related to ours. They consider dense Cholesky factor parametrizations for the covariance matrix in the copula, and employ approximations to the posterior marginals based on flexible Bernstein polynomial transformations. Our work differs from theirs in the focus on approximations that can be calibrated in high dimensions. In particular, we use parsimonious factor parametrizations for the copula scale matrix which are feasible to implement for a high-dimensional model parameter vector, as well as parametric transformations which are computationally efficient and do not employ too many variational parameters. We also go beyond Gaussian copula approximations by investigating skew-normal copulas as well. Skew-normal variational families are considered in Ormerod (2011), who considers application to models which have a structure where the lower bound can be computed using one-dimensional quadrature. However, Ormerod (2011) does not consider skew-normal copulas.

Apart from copulas, there are many other ways to specify rich variational families. These include normalizing flows (Rezende and Mohamed, 2015), Stein variational gradient descent (Liu and Wang, 2016), real-valued non-volume preserving transformations (Dinh et al., 2016), methods based on transport maps (Spantini et al., 2018), implicit variational approximations where the variational family is specified through a generative process without a closed form density (Huszár, 2017) and hierarchical variational models (Ranganath et al., 2016). Some of these approaches attain their flexibility through using compositions of transformations of an initial density, but they do not fit into the copula framework discussed here.

The rest of the paper is organized as follows. Section 2 gives a brief introduction to variational inference methods, followed by a general description of our proposed implicit copula approach. Sections 3 and 4 consider Gaussian copula and skew-normal copula approximations, respectively. They illustrate our approach in six examples, where the approximations are more accurate than the corresponding Gaussian approximations, but at limited or no computational cost. Section 5 gives some concluding discussion and directions for future work. MATLAB code to implement our approach is described in the Online Appendix.

2 Variational Inference

In this section we first provide a short overview of variational inference. We then outline the implicit copulas formed through transformation that we employ as variational approximations.

2.1 Approximate Bayesian inference

We consider Bayesian inference with data $y$ having density $p(\text{\boldmath$ y $}|\text{\boldmath$ \theta $})$ , where $\text{\boldmath$ \theta $}=(\theta_{1},\dots,\theta_{m})^{\top}$ is either a parameter vector, or a parameter vector augmented with some additional latent variables. The prior and posterior densities are denoted by $p(\text{\boldmath$ \theta $})$ and $p(\text{\boldmath$ \theta $}|\text{\boldmath$ y $})\propto p(\text{\boldmath$ \theta $})p(\text{\boldmath$ y $}|\text{\boldmath$ \theta $})=g(\text{\boldmath$ \theta $})$ , respectively. We will consider variational inference methods, in which a member $q_{\lambda}(\text{\boldmath$ \theta $})$ of some parametric family of densities is used to approximate $p(\text{\boldmath$ \theta $}|\text{\boldmath$ y $})$ , where $\text{\boldmath$ \lambda $}\in\Lambda$ is a vector of variational parameters. For example, for the Gaussian family $\lambda$ would consist of the distinct elements of the mean vector and covariance matrix. Approximate Bayesian inference is then formulated as an optimization problem, where a measure of divergence between $q_{\lambda}(\text{\boldmath$ \theta $})$ and $p(\text{\boldmath$ \theta $}|\text{\boldmath$ y $})$ is minimized with respect to $\lambda$ . The Kullback-Leibler divergence

[TABLE]

is typically used, and we employ it here. If $p(\text{\boldmath$ y $})=\int p(\text{\boldmath$ \theta $})p(\text{\boldmath$ y $}|\text{\boldmath$ \theta $})d\text{\boldmath$ \theta $}$ denotes the marginal likelihood, then it is easily shown (see, for example, Ormerod and Wand (2010)) that

[TABLE]

where $\mathcal{L}(\text{\boldmath$ \lambda $})$ is called the variational lower bound. Because $\log p(\text{\boldmath$ y $})$ does not depend on $\lambda$ , minimization of the Kullback-Leibler divergence above with respect to $\lambda$ is equivalent to maximizing the variational lower bound $\mathcal{L}(\text{\boldmath$ \lambda $})$ .

The lower bound takes the form of an intractable integral, so it seems challenging to optimize. However, notice that from (1) it can be written as an expectation with respect to $q_{\lambda}$ as

[TABLE]

which allows easily application of stochastic gradient ascent (SGA) methods (Robbins and Monro, 1951, Bottou, 2010). In SGA we start from an initial value $\text{\boldmath$ \lambda $}^{(0)}$ for $\lambda$ and update it recursively as

[TABLE]

where $\text{\boldmath$ \rho $}_{i}=(\rho_{i1},\dots,\rho_{im})^{\top}$ is a vector of step sizes, ‘ $\circ$ ’ denotes the element-wise product of two vectors, and $\widehat{\nabla_{\lambda}\mathcal{L}(\text{\boldmath$ \lambda $}^{(i)})}$ is an unbiased estimate of the gradient of $\mathcal{L}(\text{\boldmath$ \lambda $})$ at $\text{\boldmath$ \lambda $}=\text{\boldmath$ \lambda $}^{(i)}$ . For appropriate step size choices this will converge to a local mode of $\mathcal{L}(\text{\boldmath$ \lambda $})$ . Adaptive step size choices are often used in practice, and we use the ADADELTA method of Zeiler (2012).

To implement SGA unbiased estimates of the gradient of the lower bound are required. These can be obtained directly by differentiating (2), and evaluating the expectation in a Monte Carlo fashion by simulating from $q_{\lambda}$ . However, variance reduction methods for the gradient estimation are often also important for fast convergence and stability. One of the most useful is the ‘reparametrization trick’ (Kingma and Welling, 2014, Rezende et al., 2014). In this approach, it is assumed that an iterate $\theta$ can be generated from $q_{\lambda}$ by first drawing $\varepsilon$ from a density $f_{\varepsilon}$ which does not depend on $\lambda$ , and then transforming $\varepsilon$ by a deterministic function $\text{\boldmath$ \theta $}=h(\text{\boldmath$ \varepsilon $},\text{\boldmath$ \lambda $})$ of $\varepsilon$ and $\lambda$ . From (2), the lower bound can be written as the following expectation with respect to $f_{\varepsilon}$ :

[TABLE]

Differentiating under the integral sign in (3) gives

[TABLE]

and approximating the expression (4) by Monte Carlo using one or more random draws from $f_{\varepsilon}$ gives an unbiased estimate of $\nabla_{\lambda}\mathcal{L}(\text{\boldmath$ \lambda $})$ . An intuitive reason for the success of the re-parameterization trick is that it allows gradient information from the log-posterior to be used, by moving the variational parameters inside $g(\text{\boldmath$ \theta $})$ in (3). Xu et al. (2018) show how the trick reduces the variance of the gradient estimates when $q_{\lambda}$ is a Gaussian with diagonal covariance matrix (the so-called ‘mean field’ Gaussian approximation). We employ the re-parameterization trick, and specify a function $h$ , for a skew-normal copula in Section 4.

2.2 Variational approximations through transformations

Let $t_{\gamma}$ be a family of one-to-one transformations onto the real line with parameter vector $\gamma$ . To construct our variational approximation, we transform each parameter as $\psi_{i}=t_{\gamma_{i}}(\theta_{i})$ and adopt a known distribution function $F(\text{\boldmath$ \psi $};\text{\boldmath$ \pi $})$ , with vector of parameters $\pi$ , for $\text{\boldmath$ \psi $}=(\psi_{1},\ldots,\psi_{m})^{\top}$ . For example, if $F$ is a Gaussian distribution function, then $\text{\boldmath$ \pi $}=(\text{\boldmath$ \mu $}_{\psi}^{\top},\mbox{vech}(\Sigma_{\psi}))^{\top}$ , where $\text{\boldmath$ \mu $}_{\psi}$ and $\Sigma_{\psi}$ are the mean and covariance matrix. If $p(\text{\boldmath$ \psi $};\text{\boldmath$ \pi $})=\frac{\partial^{m}}{\partial\psi_{1}\cdots\partial\psi_{m}}F(\text{\boldmath$ \psi $};\text{\boldmath$ \pi $})$ , then the density of the approximation can be recovered by computing the Jacobian of the element-wise transformation from $\theta$ to $\psi$ , so that

[TABLE]

where the variational parameters are $\text{\boldmath$ \lambda $}^{\top}=(\text{\boldmath$ \gamma $}_{1}^{\top},\ldots,\text{\boldmath$ \gamma $}_{m}^{\top},\text{\boldmath$ \pi $}^{\top})$ and $t_{\gamma_{i}}^{\prime}(\theta_{i})=\frac{d\psi_{i}}{d\theta_{i}}$ . Moreover, if $F$ has known marginal distribution functions $F_{i}(\psi_{i};\text{\boldmath$ \pi $}_{i})$ and densities $p_{i}(\psi_{i};\text{\boldmath$ \pi $}_{i})$ for $i=1,\ldots,m$ , with $\text{\boldmath$ \pi $}_{i}\subseteq\text{\boldmath$ \pi $}$ , the marginal densities of the approximation are

[TABLE]

with $\text{\boldmath$ \lambda $}_{i}^{\top}=(\text{\boldmath$ \gamma $}_{i}^{\top},\text{\boldmath$ \pi $}_{i}^{\top})$ a sub-vector of $\text{\boldmath$ \lambda $}^{\top}$ .

The density at (5) can also be represented using its copula decomposition as follows. If $Q_{\lambda_{i}}(\theta_{i})=\int_{-\infty}^{\theta_{i}}q_{\lambda_{i}}(s)\mbox{d}s$ is the distribution function of $\theta_{i}$ , then

[TABLE]

where $\text{\boldmath$ u $}=(u_{1},\ldots,u_{m})^{\top}$ , $u_{i}=Q_{\lambda_{i}}(\theta_{i})$ and $c$ is an $m$ -dimensional copula density with parameter vector $\tilde{\text{\boldmath$ \pi $}}$ . In much of the existing copula modeling literature, a parametric copula is selected for $c$ . When this is combined with pre-specified margins, this results in a flexible distributional form for $q_{\lambda}$ ; for example, in the variational inference literature Tran et al. (2015) use a vine copula. However, in this paper the copula is instead derived directly from (5) and (6) by inverting Sklar’s theorem, with copula density

[TABLE]

and copula function

[TABLE]

determined by $F$ . Such a copula is called an ‘inversion copula’ (Nelsen, 2006, pp.51–52) or an ‘implicit copula’ (McNeil et al., 2005). In general, the copula parameters $\tilde{\text{\boldmath$ \pi $}}$ are given by $\pi$ , but with additional constraints to ensure they are identifiable in the copula; see Smith and Maneesoonthorn (2018) for examples. However, here the elements of $\pi$ are also parameters of the margins at (6), and this identifies $\pi$ in $q_{\lambda}$ without any additional constraints.

The most popular choice for $F$ is a Gaussian distribution, resulting in the Gaussian copula (Song, 2000). More recently, there has been growing interest in selecting other distributions, such as the skew-t distribution (Demarta and McNeil, 2005, Smith et al., 2012) or those arising from state space models (Smith and Maneesoonthorn, 2018). These can produce distributional families for $q_{\lambda}$ that are more flexible in their dependence structures. Later, we will illustrate our approach with sparse Gaussian and skew-normal distributions for $F$ , but note that other parametric distributions can also be used.

We observe that the expression at (5) is much easier to employ in variational inference than that at (7) for three reasons. First, as mentioned above, the constraints on $\pi$ required to identify $\tilde{\text{\boldmath$ \pi $}}$ do not need to be elucidated as $\pi$ is fully identified in (5). Second, evaluating (7) requires repeated computation of the vector $\text{\boldmath$ u $}=(Q_{\lambda_{1}}(\theta_{1}),\ldots,Q_{\lambda_{m}}(\theta_{m}))^{\top}$ which involves $m$ numerical integrations, whereas evaluating (5) does not. Third, optimizing the lower bound with respect to $\tilde{\text{\boldmath$ \pi $}}$ proves more difficult than the unconstrained $\pi$ ; an observation made previously by Han et al. (2016) for Gaussian copula variational approximation.

2.3 Two transformations

Key to the success of our approach is the choice of an appropriate family of transformations $t_{\gamma}$ . Because $\psi_{i}=t_{\gamma_{i}}(\theta_{i})$ has distribution function $F_{i}$ , which is either Gaussian or skew-normal in our paper, we consider two choices that have proven successful in transforming data to near normality or symmetry. The first is the single parameter transformation of Yeo and Johnson (2000) (YJ hereafter), which extends the Box-Cox transformation to the entire real line. For $0<\gamma<2$ , it is given by

[TABLE]

The second is based on the two parameter (monotonic) G&H transformation of Tukey (1977), an overview of which can be found in Headrick et al. (2008). This is used to transform a standard Gaussian variable to another, which can be asymmetric and heavy-tailed (Peters et al., 2016). Thus, the G&H transformation is one from normality, so that we use it for $t_{\gamma}^{-1}$ . For $\gamma=(g,0<h<1)$ , set

[TABLE]

then $t_{\gamma}$ can be obtained by numerical inversion. We bound $h<1$ because it corresponds to a G&H transformation from a standard Gaussian to another random variable with a first moment that exists; see (Peters et al., 2016, Sec.5.1).

For both transformations, $t_{\gamma}:\mathbb{R}\rightarrow\mathbb{R}$ , so that if a parameter $\theta_{i}$ is constrained we first transform it to the real line; for example, with a scale or variance parameter we set $\theta_{i}$ to its logarithm. Interestingly, when implementing SGA $t_{\gamma}$ is not evaluated, but $t_{\gamma}^{-1}$ is repeatedly. Table 1 provides these, along with expressions for derivatives with respect to the model and variational parameters that are required to implement the SGA algorithm. For both transformations these are all fast to compute.

3 Gaussian Copula Variational Approximation

3.1 Gaussian copula factor specification

The simplest implicit copula is the Gaussian copula, where $F(\text{\boldmath$ \psi $};\text{\boldmath$ \pi $})=\Phi_{m}(\text{\boldmath$ \psi $};\text{\boldmath$ \mu $}_{\psi},\Sigma_{\psi})$ is a Gaussian distribution function with mean $\text{\boldmath$ \mu $}_{\psi}$ and covariance matrix $\Sigma_{\psi}$ . In constructing a Gaussian copula, it is usual to also set $\text{\boldmath$ \mu $}_{\psi}=(\mu_{\psi,1},\ldots,\mu_{\psi,m})^{\top}=\bm{0}$ and $\mbox{diag}(\Sigma_{\psi})=(\sigma^{2}_{\psi,1},\ldots,\sigma^{2}_{\psi,m})=(1,1,\ldots,1)$ because these parameters are unidentified in the Gaussian copula function; for example, see the discussion in Song (2000). However, we do not need to do so here because these parameters are fully identified in the density $q_{\lambda}$ at (5) as they are also parameters of its margins, with $\text{\boldmath$ \pi $}_{i}=(\mu_{\psi,i},\sigma^{2}_{\psi,i})^{\top}$ at (6). To illustrate, Figure 1 plots $q_{\lambda_{i}}$ for the YJ transformation, showing that this density can capture both positive or negative skew. Moreover, the direction and level of skew can differ in each margin, depending on $\gamma$ , making $q_{\lambda}$ a substantially more flexible approximation than a Gaussian.

When $\theta$ is of higher dimensions, we follow Ong et al. (2018) and adopt a factor structure for $\Sigma_{\psi}$ as follows. Let $B$ be an $m\times k$ matrix with $k<<m$ . For identifiability reasons it is assumed that the upper triangle of $B$ is zero. Let $\bm{d}=(d_{1},\dots,d_{m})^{\top}$ be a vector of parameters with $d_{i}>0$ , and denote by $D$ the $m\times m$ diagonal matrix with entries $\bm{d}$ . We assume that

[TABLE]

so that the number of parameters in $\Sigma_{\psi}$ grows only linearly with $m$ if $k<<m$ is kept fixed. We note that this copula is equivalent to the Gaussian factor copula suggested by Murray et al. (2013) and Oh and Patton (2017) to model data, although they do not use it as a variational approximation. The Gaussian random vector has the generative representation $\text{\boldmath$ \psi $}=\text{\boldmath$ \mu $}+B\text{\boldmath$ z $}+D\text{\boldmath$ \epsilon $}$ , where $\text{\boldmath$ z $}=(z_{1},\dots,z_{k})^{\top}\sim N(0,I_{k})$ and $\text{\boldmath$ \epsilon $}\sim N(0,I_{m})$ . By setting $\text{\boldmath$ \varepsilon $}^{\top}=(\text{\boldmath$ z $}^{\top},\text{\boldmath$ \epsilon $}^{\top})$ , $h(\text{\boldmath$ \varepsilon $},\text{\boldmath$ \lambda $})=(t_{\gamma_{1}}^{-1}(\psi_{1}),\ldots,t_{\gamma_{m}}^{-1}(\psi_{m}))^{\top}$ , and $\text{\boldmath$ \pi $}=(\text{\boldmath$ \mu $}_{\psi}^{\top},\mbox{vech}(B)^{\top},\text{\boldmath$ d $}^{\top})$ , the closed form re-parameterization gradients in a Gaussian variational approximation with factor covariance structure given in Ong et al. (2018) can be used.111Here the ‘vech’ operator is the half-vectorization of a rectangular matrix, defined for an $(n\times K)$ matrix $A$ with $n>K$ as $\text{vech}(A)=\left(A_{1:n,1}^{\top},\dots,A_{K:n,K}^{\top}\right)^{\top}$ with $A_{k:n,k}=\left(A_{k,k},\dots,A_{n,k}\right)^{\top}$ for $k=1,\dots,K$ .

3.2 Application: ordinal time series copula model

3.2.1 The model and extended likelihood

To illustrate our proposed variational approximation we use it to estimate a complex model with a complex augmented posterior, where its greater flexibility may increase the accuracy of inference compared to simpler approximations. We consider the copula time series model for an ordinal-valued random vector $\bm{Y}=(Y_{1},\ldots,Y_{T})^{\top}$ proposed by Loaiza-Maya and Smith (2019). These authors use a $T$ -dimensional parsimonious copula with density $c^{DV}(\text{\boldmath$ v $})$ , where $\text{\boldmath$ v $}=(v_{1},\ldots,v_{T})^{\top}$ , to capture serial dependence in $\bm{Y}$ (this is not to be confused with the use of another copula for the variational approximation). The time series is assumed to be stationary with marginal distribution function $G$ , which is estimated non-parametrically in an initial step using the empirical distribution function.

The time series copula employed is a parsimonious drawable vine (D-vine) of Markov order $p$ , as given in Smith (2015), and defined as follows. Let $\{V_{t}\}_{t=1}^{T}$ be a stochastic process with $V_{t}=G(Y_{t})$ , so that $V_{t}$ is marginally uniform. For $s<t$ , denote222Note that $F_{V}(v_{t}|v_{s},\ldots,v_{t-1})$ is the distribution function of $V_{t}|V_{s}=v_{s},\ldots,V_{t-1}=v_{t-1}$ evaluated at $v_{t}$ , and $F_{V}(v_{s}|v_{s+1},\ldots,v_{t})$ is the distribution function of $V_{s}|V_{s+1}=v_{s+1},\ldots,V_{t}=v_{t}$ evaluated at $v_{s}$ . $v_{t|s}=F_{V}(v_{t}|v_{s},\ldots,v_{t-1})$ , $v_{s|t}=F_{V}(v_{s}|v_{s+1},\ldots,v_{t})$ and $v_{t|t}=v_{t}$ , then the D-vine copula density is the product

[TABLE]

of bivariate copula densities $c^{\mbox{\tiny MIX}}_{1},\ldots,c^{\mbox{\tiny MIX}}_{p}$ called ‘pair-copulas’ (Aas et al., 2009), each with individual parameter vector $\text{\boldmath$ \eta $}_{k}$ . This D-vine copula therefore has parameter vector $\text{\boldmath$ \eta $}=(\text{\boldmath$ \eta $}_{1}^{\top},\ldots,\text{\boldmath$ \eta $}_{p}^{\top})^{\top}$ , and is parsimonious because $|\text{\boldmath$ \eta $}|$ does not increase with $T$ . To capture the heteroskedasticity that exists in most ordinal-valued time series Loaiza-Maya and Smith (2019) use a five parameter mixture copula for $c^{\mbox{\tiny MIX}}_{k}$ , which we also use here and is outlined in Part A of the Online Appendix, leading to a total of $|\text{\boldmath$ \eta $}|=5p$ model parameters. Given $v$ , the arguments $\{v_{t|s},v_{s|t};t=2,\ldots,T,s<t\}$ of the pair-copulas in (9) are computed using the recursive Algorithm 1 in Smith (2015).

It is widely known (Song, 2000, Genest and Nešlehová, 2007) that the mass function $p(\text{\boldmath$ y $}|\text{\boldmath$ \eta $})$ of this discrete-margined copula model is computationally intractable, so we use the extended likelihood of Smith and Khaled (2012) instead. This employs the vector $\bm{V}=(V_{1},\ldots,V_{T})^{\top}$ , such that the joint mass function of $(\bm{Y}^{\top},\bm{V}^{\top})$ is

[TABLE]

with the indicator function ${\cal I}(X)=1$ if $X$ is true, and zero otherwise. It is straight-forward to show that the margin in $y$ of (10) is the required mass function $p(\text{\boldmath$ y $}|\text{\boldmath$ \eta $})$ . Evaluating the extended likelihood at (10) avoids the computational burden of evaluating $p(\text{\boldmath$ y $}|\text{\boldmath$ \eta $})$ directly.

3.2.2 The variational approximation

We follow Loaiza-Maya and Smith (2019) and estimate the model by setting $\text{\boldmath$ \theta $}=(\text{\boldmath$ \eta $}^{\top},\text{\boldmath$ v $}^{\top})^{\top}$ and approximating the augmented posterior $p(\text{\boldmath$ \theta $}|\text{\boldmath$ y $})\propto p(\text{\boldmath$ y $},\text{\boldmath$ v $}|\text{\boldmath$ \eta $})p(\text{\boldmath$ \eta $})$ , which uses the extended likelihood and a proper uniform prior $p(\text{\boldmath$ \eta $})$ . The target distribution therefore has dimension $m=|\text{\boldmath$ \theta $}|=5p+T$ . These authors use the variational approximation $q_{\lambda}(\text{\boldmath$ \theta $})=q_{\lambda^{a}}(\text{\boldmath$ \eta $})q_{\lambda^{b}}(\text{\boldmath$ v $})$ , assuming independence between $\eta$ and $v$ , and a Gaussian distribution with a factor covariance structure for $q_{\lambda^{a}}$ . However, because each $v_{t}$ is constrained to $[a_{t},b_{t})$ , it is transformed to the real line as $\tilde{v}_{t}=\Phi_{1}^{-1}((v_{t}-a_{t})/(b_{t}-a_{t}))$ , where $\Phi_{1}$ is the distribution function of a standard Gaussian, and independent Gaussians used as approximations for $\tilde{v}_{1},\ldots,\tilde{v}_{T}$ .

Loaiza-Maya and Smith (2019) label this approximation ‘VA2’, and we extend it as follows. For $q_{\lambda^{a}}$ we use a Gaussian copula formed through the YJ transformation with a $k$ factor structure, so that $\text{\boldmath$ \lambda $}^{a}$ has $5p(k+3)-k(k-1)/2$ elements (the unique elements in the factor decomposition plus the YJ transformation parameters). For each $\tilde{v}_{t}$ we use a normal approximation after a YJ transformation, so that $\text{\boldmath$ \lambda $}^{b}$ has $3T$ elements (the means and variances of the Gaussians, plus the YJ transformation parameters). The full set of variational parameters are $\text{\boldmath$ \lambda $}=(\text{\boldmath$ \lambda $}^{a},\text{\boldmath$ \lambda $}^{b})^{\top}$ . They are calibrated using Algorithm 1 of Loaiza-Maya and Smith (2019), which employs SGA with control variates and the analytical gradient $\nabla_{\lambda}q_{\lambda}$ ; the latter of which is given in Appendix A for our copula approximation outlined here.

3.2.3 Empirical illustration: monthly counts of attempted murder

We fit the time series model in Section 3.2.1 to $T=264$ monthly counts of Attempted Murder in New South Wales, Australia. Plots of the time series and the empirical distribution function used for margin $G$ can be found in (Loaiza-Maya and Smith, 2019, Fig.1). The parsimonious D-vine in (9) has Markov order $p=3$ , and the target density is complex with dimension $m=279$ . We fit three parsimonious variational approximations: (i) the Gaussian copula outlined above with $k=3$ factors, (ii) a Gaussian distribution with factor covariance and $k=3$ factors, and (iii) a fully mean field Gaussian. Note that (ii) is equivalent to our copula approximation but with all YJ parameters set to $\gamma_{i}=1$ (ie. an identity transformation), as is (iii) but with the additional constraint that $\Sigma_{\psi}$ is diagonal. Figure 2 plots lower bound values against step number for all three methods using the same SGA algorithm, and the copula approximation clearly dominates.

To assess the accuracy of the three variational approximations, we also estimate the posterior using the (slow, but exact) data augmentation MCMC method of Smith and Khaled (2012). Figure 3 depicts the accuracy of the first three marginal posterior moments of the variational approximations. The panels provide scatterplots of the true moments against their approximations, with a blue scatter for the proposed copula approximation, and a red scatter for the Gaussian approximation. The left-hand panels give results for $\eta$ and the right-hand panel for $v$ . More accurate variational approximations result in scatters that lie closer to the 45 degree line, and we make two observations. First, panels (e,f) show that the true posteriors are skewed, and that the copula approximation does a very good job of estimating the skew. Second, panel (c) reveals that by capturing the third moment in the augmented vector $\text{\boldmath$ \theta $}=(\text{\boldmath$ \eta $},\text{\boldmath$ v $})$ , the posterior standard deviation of $\eta$ is also estimated more accurately. Figure 4 compares the marginal densities for the four parameters which exhibit the most skew, and the tails are more accurately estimated using the copula approximation.

4 Skew-Normal Copula Approximation

4.1 Copula specification

An alternative implicit copula that we consider is based on the skew-normal distribution of Azzalini and Dalla Valle (1996) and Azzalini and Capitanio (2003). In this case, the transformed parameters $\bm{\psi}$ are assumed to have joint density

[TABLE]

where $\phi_{m}$ denotes an $m$ -dimensional Gaussian density, $S_{\psi}=\text{diag}(\sigma^{2}_{\psi,1},\dots,\sigma^{2}_{\psi,m})$ , and $\sigma^{2}_{\psi,i}$ is the ith diagonal element of $\Sigma_{\psi}$ . The parameters $\text{\boldmath$ \alpha $}_{\psi}$ determine the level of skew in the marginals of $\psi$ , and when $\text{\boldmath$ \alpha $}_{\psi}=\bm{0}$ the distribution reduces to a Gaussian. As noted in Section 2.2, the parameters $\{\bm{\mu}_{\psi},\Sigma_{\psi},\bm{\alpha}_{\psi}\}$ are fully identified in the representation of $q_{\lambda}$ at (5), whereas they are not if (11) is used only for the construction of the copula.

Demarta and McNeil (2005), Smith et al. (2012) and Yoshiba (2018) show that implicit copulas constructed from skew-elliptical distributions are more flexible than elliptical copulas because they allow for asymmetric dependence.333This is not to be confused with asymmetry of the marginal distributions $q_{\lambda_{i}}$ . Here, we focus on the skew-normal copula because it is typically faster and easier to calibrate than the skew-t copula. When $\text{\boldmath$ \alpha $}_{\psi}\neq\bm{0}$ it captures asymmetric dependence, making it more flexible than the Gaussian copula considered in Section 3, although the same factor structure discussed in Section 3.1 is adopted for the scale matrix $\Sigma_{\psi}$ . Therefore, the approximation $q_{\lambda}(\text{\boldmath$ \theta $})$ to the target $p(\text{\boldmath$ \theta $}|\text{\boldmath$ y $})$ has variational parameters $\bm{\lambda}=(\bm{\mu}_{\psi}^{\top},\text{vech}(B)^{\top},\bm{d}^{\top},\bm{\alpha}_{\psi}^{\top},\bm{\gamma}^{\top})^{\top}$ , where $B$ and $d$ are as defined in Section 3.1.

In our empirical examples, we employ the re-parametrization trick to reduce the variance of the gradient estimate. This uses a simple generative representation of $\bm{\psi}$ in terms of standardized random components. Using the properties of the skew-normal distribution (Azzalini and Dalla Valle, 1996), the following generative representation for $\bm{\psi}$ can be derived (see Part B of the Online Appendix for details). If $\Omega_{\psi}=S_{\psi}^{-1/2}\Sigma_{\psi}S_{\psi}^{-1/2}$ , $\bm{\delta}_{\psi}=\left(1+\bm{\alpha}_{\psi}^{\top}\Omega_{\psi}\bm{\alpha}_{\psi}\right)^{-1/2}\Omega_{\psi}\bm{\alpha}_{\psi}$ and $\bm{\tilde{\delta}}_{\psi}=S_{\psi}^{1/2}\bm{\delta}_{\psi}$ , then

[TABLE]

where $r\sim N\left(0,1\right)$ , $\varepsilon_{0}\sim N\left(0,1\right)$ , $\bm{z}\sim N\left(\bm{0},I_{k}\right)$ , $\bm{\epsilon}\sim N\left(\bm{0},I_{m}\right)$ , is distributed skew-normal with density at (11). Setting $\text{\boldmath$ \varepsilon $}^{\top}=(r,\varepsilon_{0},\text{\boldmath$ z $}^{\top},\text{\boldmath$ \epsilon $}^{\top})$ and $h(\text{\boldmath$ \varepsilon $},\text{\boldmath$ \lambda $})=(t_{\gamma_{1}}^{-1}(\psi_{1}),\ldots,t_{\gamma_{m}}^{-1}(\psi_{m}))^{\top}$ , the gradient at (4) can be evaluated by first drawing $\varepsilon$ from an $N(\bm{0},I)$ distribution, and computing the derivatives analytically; see Appendix B for details.

4.2 Examples

To illustrate the use of a skew-normal copula as a variational approximation, we employ it to approximate the posterior of several logistic regressions examined previously in Ong et al. (2018).

4.2.1 Mixed logistic regression

The first uses the polypharmacy longitudinal data in Hosmer et al. (2013), which features data on 500 subjects over 7 years. The logistic regression is specified fully in Ong et al. (2018), and it includes 8 fixed effects (including an intercept), plus one subject-based $N(0,\exp(2\zeta))$ random effect. The following approximations are fitted to the augmented posterior of $\theta$ , which comprises $\zeta$ , the 8 fixed effect coefficients, and the 500 random effect values:

(A1)

Mean Field Gaussian: independent univariate Gaussians

(A2)

Mean Field YJ Transform: independent univariate distributions with densities at (6), where $p_{i}(\psi_{i};\text{\boldmath$ \pi $}_{i})=\phi_{1}(\psi_{i};\mu_{\psi_{i}},\sigma^{2}_{\psi_{i}})$ is a Gaussian density and $t_{\gamma_{i}}$ is a YJ transform

(A3)

Gaussian: as in Ong et al. (2018)

(A4)

Skew-normal

(A5)

Gaussian Copula: as outlined in Section 3.1, with $t_{\gamma_{i}}$ a YJ transform

(A6)

Skew-normal Copula: as outlined in Section 4.1, where $t_{\gamma_{i}}$ is a YJ transform

(A7)

Gaussian Copula: as outlined in Section 3.1, with $t_{\gamma_{i}}$ an inverse G&H transform

(A8)

Skew-normal Copula: as outlined in Section 4.1, where $t_{\gamma_{i}}$ is an inverse G&H transform

In approximations A3–A8, a factor structure with $k=5$ factors is used for the variance (A3) or scale matrix (A4) of the distribution, or the copula parameter matrix (A5–A8). Thus, A4 extends the approximation of Ormerod (2011) to include a factor scale matrix, while A5 and A7 extend the approximation of Han et al. (2016) to have a factor copula parameter matrix and parametric margins constructed from the two transformations. For each approximation Table 2 lists the number of variational parameters $|\bm{\lambda}|$ , average lower bound value over the last 1000 steps of the SGA algorithm, and the time to complete 1000 steps using MATLAB on a standard laptop. Comparing the lower bound values for A2 and A1, it can be seen that allowing for asymmetry in the margins improves the approximation markedly; although using the skew-normal A4 is not as effective. The most accurate approximations are the Gaussian copulas A5 and A7. The time to complete 1000 SGA steps for the copula models is almost the same as the non-copula models (e.g. A5 and A7 are only 0.5% and 1.5% slower than A3) making them attractive choices.

To judge the approximation accuracy, the exact augmented posterior is computed using MCMC with data augmentation. Figure 5 plots the first three posterior moments of the approximations (vertical axes) against their true values (horizontal axes). Results are given for the approximations A3 (panels a,e,i), A4 (panels b,f,j), A5 (panels c,g,k) and A6 (panels d,h,l). All four identify the means well, but the striking result is that the two copula approximations capture the (Pearsons) skew coefficients remarkably well in panels (k,l). By doing so, the estimates of the second moment in panels (g,h) are also improved. Figure 6 illustrates further by plotting the exact posterior densities for the nine model parameters (excluding the random effects), along with those of approximations A1, A3, A5, and that obtained using INLA (Rue et al., 2009) with the same priors. Ignoring the dependence between parameters using A1 greatly understates the posterior standard deviation, which is well-known. However, adopting the Gaussian copula A5 improves the density estimates compared to the Gaussian A3 – particularly for $\zeta$ in panel (i). The latter is likely due to the skew in the posteriors of many random effect values, which is captured by the copula. Last, INLA approximates the near symmetric marginal posteriors of the fixed effects well, but has an inaccurate estimate for $\zeta$ in panel (i), thereby understating the level of heterogeneity in the data compared to all VB estimators.

4.2.2 Logistic regression

To illustrate the trade-off between speed and approximation accuracy, we consider the Spam, Ionosphere, Krkp and Mushroom test datasets considered in Ong et al. (2018). These have sample sizes $n=4601,351,3196$ and $8124$ , respectively, and are used to fit logistic regressions with 104, 111, 37 and 95 covariates. We use the same $N(0,10I)$ prior on the linear coefficients of the covariates as these authors, and fit the six correlated approximations A3–A8 using $k=3$ factors throughout. Table 3 reports the average lower bounds over the last 1000 steps. By this metric, the skewed approximations A4, A6 and A8 are the most accurate, although the differences between these three are small. However, the copula models can have a substantial speed advantage. Figure 7 compares the calibration speed by plotting the lower bound against time to implement the SGA algorithm (in MATLAB on a standard laptop). This shows that for the Krkp and Mushroom test data the copula models were much faster to calibrate than either the Gaussian or skew-normal. This can also be an important consideration when using variational inference in big data problems.

5 Discussion

In this paper we show how to employ copula model approximations in variational inference using element-wise transformations of the parameter space. This type of copula is called an ‘implicit copula’, and is obtained from the choice of distribution $F$ for the transformed parameters $\psi$ . We suggest using parametric transformations that are known to be effective in transforming data to near normality, and illustrate with the power transformation of Yeo and Johnson (2000) and the inverse G&H transformation of Tukey (1977). The implied margins of such transformations are available in closed form, and depend on both the transformation selected and the marginals of $F$ . While, in principle, any distribution can be selected for $F$ , elliptical and skew-elliptical (Genton, 2004) distributions are good choices for two reasons. First, they give rise to implicit copulas which have been shown previously to be effective; for example, see Fang et al. (2002), Demarta and McNeil (2005) and Smith et al. (2012). Second, by employing a factor decomposition for the scale matrix of $F$ , the number of copula parameters only increases linearly with $m$ .

The approximation provides a balance between computational viability and accuracy. We illustrate here using Gaussian and skew-normal copulas of dimensions up to $m=509$ , although higher dimensions can also be considered. Our empirical work shows that the Yeo-Johnson transformation is particularly effective and is quickly calibrated using SGA; in most cases, faster than calibrating the elliptical or skew-elliptical distributions themselves on the parameter vector. The approach of defining the copula approximation using element-wise transformations simplifies the computations required to implement variational inference by using (5). In contrast, selecting a high-dimensional copula function—such as a vine copula (Tran et al., 2015)—and marginals separately, uses (7) which is slower. Han et al. (2016) make a similar observation for a Gaussian copula, and we show this applies generally to all implicit copulas. Another important observation is that constraints on the parameters of $F$ usually employed to identify the implicit copula (for example, see Smith and Maneesoonthorn (2018)) are not required because they are identified through the margins $q_{\lambda_{i}}$ .

Last, we comment on possible extensions to our work. One interesting possibility is to consider other flexible multivariate models for constructing the implicit copula. Truncated Gaussian graphical models (Su et al., 2016) are one interesting possibility here, since they include the skew-normal distribution as a special case, and similar to the skew-normal they have a latent Gaussian structure which may be amenable to implementation of re-parametrization methods for gradient estimation in the optimization. Another interesting idea is to use the copula Bayesian network of Elidan (2010) as an approximation, where the local copulas are implicit copulas constructed through transformation as recommended in our paper. It would also be interesting to implement our copula approximations in other challenging settings, such as when some of the parameters are discrete, or in likelihood-free inference applications. Here gradient estimation for the optimization becomes more challenging, as straightforward re-parameterization techniques do not immediately apply.

Appendix A

This appendix derives the gradient needed to implement the example in Section 3.2.1. In this example, $\bm{\theta}=(\bm{\eta}^{\top},\bm{v}^{\top})^{\top}$ , where $\bm{\eta}$ are the model parameter and $\bm{v}$ the vector of auxiliary variables. The approximation to the augmented posterior of $\bm{\theta}$ is

[TABLE]

with $\text{\boldmath$ \psi $}^{a}=\left(\psi_{1}^{a},\dots,\psi_{m}^{a}\right)^{\top}$ , $\psi^{a}_{i}=t_{\gamma_{a,i}}(\eta_{i})$ , $\text{\boldmath$ \psi $}^{b}=\left(\psi_{1}^{b},\dots,\psi_{T}^{b}\right)^{\top}$ , $\psi_{t}^{b}=t_{\gamma_{b,t}}(\tilde{v}_{t})$ , $\tilde{v}_{t}=\Phi_{1}^{-1}\left(\frac{v_{t}-a_{t}}{b_{t}-a_{t}}\right)$ , $\text{\boldmath$ \lambda $}^{a}=((\bm{\pi}^{a})^{\top},(\bm{\gamma}^{a})^{\top})^{\top}$ , $\bm{\gamma}_{a}=\left(\gamma_{a,1},\dots,\gamma_{a,m}\right)^{\top}$ , $\bm{\lambda}^{b}=((\bm{\pi}^{b})^{\top},(\bm{\gamma}^{b})^{\top})^{\top}$ , $\bm{\gamma}_{b}=\left(\gamma_{b,1},\dots,\gamma_{b,T}\right)^{\top}$ . It follows then that

[TABLE]

For $\eta$ we use a Gaussian copula, so that $p_{a}\left(\text{\boldmath$ \psi $}^{a};\bm{\pi}_{a}\right)=\phi_{m}\left(\text{\boldmath$ \psi $}^{a},\bm{\mu},BB^{\top}+D^{2}\right)$ and $\bm{\lambda}^{a}=\left(\bm{\mu}^{\top},\bm{b}^{\top},\bm{d}^{\top},\bm{\gamma}^{a^{\top}}\right)^{\top}$ with $\bm{b}=\text{vech}(B)$ and $\bm{d}=\text{diag}\left(D\right)$ . Following Ong et al. (2018) and Loaiza-Maya and Smith (2019), it is straightforward to show that the elements of the gradient

[TABLE]

are

[TABLE]

with $\frac{\partial t_{\gamma}\left(\bm{\eta}\right)}{\partial\bm{\gamma}^{a}}=\text{Diag}\left(\frac{\partial t_{\gamma_{a,1}}(\eta_{1})}{\partial\gamma_{a,1}},\dots,\frac{\partial t_{\gamma_{a,m}}(\eta_{m})}{\partial\gamma_{a,m}}\right)$ .

For $\text{\boldmath$ \psi $}^{b}$ we assume an independent Gaussian approximation $p_{b}\left(\text{\boldmath$ \psi $}^{b};\bm{\pi}_{b}\right)=\prod_{t=1}^{T}\phi_{1}\left(\psi^{b}_{t};\zeta_{t},\exp{\left(2c_{t}\right)}\right)$ , where $\bm{\lambda}^{b}=\left(\bm{\zeta}^{\top},\bm{c}^{\top},\bm{\gamma}^{b^{\top}}\right)^{\top}$ , $\bm{\zeta}=\left(\zeta_{1},\dots,\zeta_{T}\right)^{\top}$ and $\bm{c}=\left(c_{1},\dots,c_{T}\right)^{\top}$ . The implied approximation for $\bm{v}$ is

[TABLE]

The gradient is $\nabla_{\lambda^{b}}\log\ q_{\lambda^{b}}(\bm{v})=\left(\nabla_{\zeta}\log\ q_{\lambda^{b}}(\bm{v})^{\top},\nabla_{c}\log\ q_{\lambda^{b}}(\bm{v})^{\top},\nabla_{\gamma}\log\ q_{\lambda^{b}}(\bm{v})^{\top}\right)^{\top}$ with elements

[TABLE]

where $\omega_{t}=\exp\left(c_{t}\right)$ .

Appendix B

This appendix provides details on the implementation of variational inference using the skew-normal approximation proposed in Section 4. Notice that by multiplying (11) by the Jacobian of the transformation from $\psi$ to $\theta$ , the approximating density is

[TABLE]

where $\text{\boldmath$ \psi $}=(\psi_{1},\ldots,\psi_{m})^{\top}$ and $\psi_{i}=t_{\gamma_{i}}(\theta_{i})$ . The complete vector of variational parameters for this approximation is $\bm{\lambda}^{\top}=(\bm{\mu}_{\psi}^{\top},\bm{\alpha}_{\psi}^{\top},\mbox{vech}(B)^{\top},\bm{d}^{\top},\bm{\gamma}^{\top})$ , where $\mbox{vech}(B)$ is the vectorization of $B$ omitting the zero upper triangular elements. As discussed in Section 2, to implement SGA using the re-parameterization trick, the gradient

[TABLE]

needs approximating. This is undertaken by drawing an iterate of $\text{\boldmath$ \varepsilon $}=(r,\varepsilon_{0},\text{\boldmath$ z $}^{\top},\text{\boldmath$ \epsilon $}^{\top})^{\top}$ from a $N(\bm{0},I)$ distribution, and then computing the derivatives inside (12) analytically. Below, we write $\text{\boldmath$ \theta $}=h(\text{\boldmath$ \varepsilon $},\text{\boldmath$ \lambda $})$ as $\bm{\theta}(\bm{\varepsilon},\bm{\lambda})$ for clarity. To derive the derivatives, note that the gradient can be broken up into sub-vectors

[TABLE]

where

[TABLE]

the derivative with respect to $\mbox{vech}(B)$ above is computed by ignoring elements on right hand side of the equation that correspond to the upper triangle of $B$ . The term $\nabla_{\theta}\log g(\bm{\theta})$ is model specific and needs to be derived on a case-by-case basis. Expressions for the remaining terms can be computed in closed form. First,

[TABLE]

where the elements are computed using the formulas given in Table 1 for either the YJ or G&H transformations. Expressions for the remaining four derivatives are provided in Table 4, which are derived in the Online Appendix. MATLAB routines that evaluate these derivatives are in the Supplementary Material.

Supplementary Materials

Supplementary materials contain:

smith_loaiza_maya_nott_webappend.pdf

An online appendix in two parts. Part A specifies the pair-copula used in Section 3.2; Part B derives the four derivatives in Appendix B.

Bibliography57

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Aas et al. (2009) Aas, K., Czado, C., Frigessi, A., and Bakken, H. (2009). Pair-copula constructions of multiple dependence. Insurance: Mathematics and Economics , 44(2):182 – 198.
2Archer et al. (2016) Archer, E., Park, I. M., Buesing, L., Cunningham, J., and Paninski, L. (2016). Black box variational inference for state space models. ar Xiv:1511.07367.
3Azzalini and Capitanio (2003) Azzalini, A. and Capitanio, A. (2003). Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t-distribution. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 65(2):367–389.
4Azzalini and Dalla Valle (1996) Azzalini, A. and Dalla Valle, A. (1996). The multivariate skew-normal distribution. Biometrika , 83(4):715–726.
5Blei et al. (2017) Blei, D. M., Kucukelbir, A., and Mc Auliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association , 112(518):859–877.
6Bottou (2010) Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Lechevallier, Y. and Saporta, G., editors, Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010) , pages 177–187. Springer.
7Carpenter et al. (2017) Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, Articles , 76(1):1–32.
8Challis and Barber (2013) Challis, E. and Barber, D. (2013). Gaussian Kullback-Leibler approximate inference. The Journal of Machine Learning Research , 14(1):2239–2286.