Copula-like Variational Inference

Marcel Hirt; Petros Dellaportas; Alain Durmus

arXiv:1904.07153·stat.ML·December 24, 2019

Copula-like Variational Inference

Marcel Hirt, Petros Dellaportas, Alain Durmus

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new family of variational distributions inspired by copulas, enabling efficient sampling and better approximation of complex posteriors in Bayesian neural networks.

Contribution

It proposes copula-like variational densities with efficient sampling and normalizing flows, improving approximation of non-Gaussian posteriors over traditional methods.

Findings

01

Performs comparably to state-of-the-art variational methods on benchmarks.

02

Can approximate non-Gaussian posteriors effectively.

03

Sampling complexity is linear in the dimension.

Abstract

This paper considers a new family of variational distributions motivated by Sklar's theorem. This family is based on new copula-like densities on the hypercube with non-uniform marginals which can be sampled efficiently, i.e. with a complexity linear in the dimension of state space. Then, the proposed variational densities that we suggest can be seen as arising from these copula-like densities used as base distributions on the hypercube with Gaussian quantile functions and sparse rotation matrices as normalizing flows. The latter correspond to a rotation of the marginals with complexity $O (d lo g d)$ . We provide some empirical evidence that such a variational family can also approximate non-Gaussian posteriors and can be beneficial compared to Gaussian approximations. Our method performs largely comparably to state-of-the-art variational approximations on standard regression…

Figures10

Click any figure to enlarge with its caption.

Tables7

Table 1. Table 1 : Comparison of the ELBO between different variational families for the logistic regression experiment.

Variational family	ELBO
Mean-field Gaussian	-3.42
Full-covariance Gaussian	-2.97
Copula-like without rotations	-2.30
Copula-like with rotations	-2.19

Table 2. Table 3 : Comparison of the ELBO between different variational families for the centred horseshoe model.

Variational family	ELBO
Mean-field Gaussian	-1.24
Full-covariance Gaussian	-0.04
Copula-like	0.04
3-mixture copula-like	0.08

Table 3. Table 5 : Variational approximations with transformations and different base distributions. Test root mean-squared error for UCI regression datasets. Standard errors in parenthesis.

	Copula-like	Independent copula	Copula-like	Independent copula
	with rotation	with rotation	with IAF	with IAF
Boston	3.43 (0.22)	3.51 (0.30)	3.21 (0.27)	3.61 (0.28)
Concrete	5.76 (0.14)	6.00 (0.13)	5.41 (0.10)	5.82 (0.11)
Energy	0.55 (0.01)	2.28 (0.11)	0.53 (0.02)	1.30 (0.10)
Kin8nm	0.08 (0.00)	0.08 (0.00)	0.08 (0.00)	0.08 (0.00)
Naval	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)
Power	4.02 (0.04)	4.19 (0.04)	4.05 (0.04)	4.15 (0.04)
Wine	0.64 (0.01)	0.64 (0.01)	0.64 ( 0.01)	0.64 (0.01)
Yacht	1.35 (0.08)	1.38 (0.12)	0.96 (0.06)	1.25 (0.09)
Protein	4.20 (0.01)	4.51 (0.04)	4.31 (0.01)	4.51 (0.03)

Table 4. Table 6 : Copula-like variational approximation without rotations and benchmark results. Test root mean-squared error for UCI regression datasets. Standard errors in parenthesis.

	Copula-like	Bayes-by-Backprop	SLANG	Dropout
	without rotation	results from [47]	results from [47]	results from [47]
Boston	3.22 (0.25)	3.43 (0.20)	3.21 (0.19)	2.97 (0.19)
Concrete	5.64 (0.14)	6.16 (0.13)	5.58 (0.12)	5.23 (0.12)
Energy	0.52 (0.02)	0.97 (0.09)	0.64 (0.04)	1.66 (0.04)
Kin8nm	0.08 (0.00)	0.08 (0.00)	0.08 (0.00)	0.10 (0.01)
Naval	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.01 (0.01)
Power	4.05 (0.04)	4.21 (0.03)	4.16 (0.04)	4.02 (0.04)
Wine	0.65 (0.01)	0.64 (0.01)	0.65 ( 0.01)	0.62 (0.01)
Yacht	1.23 (0.08)	1.13 (0.06)	1.08 (0.09)	1.11 (0.09)
Protein	4.31 (0.02)	NA	NA	4.27 (0.01)

Table 5. Table 7 : MNIST prediction errors.

Variational approximation with Horseshoe prior and size $200 \times 200$	Error Rate
Copula-like with rotations	1.70 %
Copula-like without rotations	1.78 %
Copula-like with IAF	2.04 %
Independent copula with IAF	2.88 %
Independent copula with rotations	2.90 %
Mean-field Gaussian	3.82 %
Copula-like without rotations and $δ_{i} = 0.99$ for all $i \in {1, \dots, d}$	5.70 %

Table 6. Table 8 : Variational approximations with transformations and different base distributions. Test log-likelihood for UCI regression datasets. Standard errors in parenthesis.

	Copula-like	Independent copula	Copula-like	Independent copula
	with rotation	with rotation	with IAF	with IAF
Boston	-2.85 (0.07)	-2.84 (0.09)	-2.78 (0.1)	-2.88 (0.09)
Concrete	-3.29 (0.03)	-3.30 (0.02)	-3.22 (0.02)	-3.26 (0.02)
Energy	-1.04 (0.02)	-2.34 (0.05)	-0.93 (0.03)	-1.78 (0.07)
Kin8nm	1.08 (0.01)	1.07 (0.01)	1.10 (0.01)	1.03 (0.01)
Naval	5.74 (0.05)	5.23 (0.05)	5.97 (0.05)	5.01 (0.05)
Power	-2.82 (0.01)	-2.85 (0.04)	-2.83 (0.04)	-2.85 (0.01)
Wine	-1.01 (0.01)	-1.02 (0.02)	-1.02 (0.02)	-1.02 (0.02)
Yacht	-2.01 (0.04)	-2.03 (0.06)	-1.69 (0.06)	-1.94 (0.07)
Protein	-2.87 (0.00)	-2.94 (0.00)	-2.90 (0.01)	-2.93 (0.01)

Table 7. Table 9 : Copula-like variational approximation without rotations and benchmark results. Test log-likelihood for UCI regression datasets. Standard errors in parenthesis.

	Copula-like	Bayes-by-Backprop	SLANG	Dropout
	without rotation	results from [47]	results from [47]	results from [47]
Boston	-2.79 (0.08)	-2.66 (0.06)	-2.58 (0.05)	-2.46 (0.06)
Concrete	-3.25 (0.03)	-3.25 (0.02)	-3.13 (0.03)	-3.04 (0.02)
Energy	-1.00 (0.03)	-1.45 (0.02)	-1.12 (0.01)	-1.99 (0.02)
Kin8nm	1.09 (0.01)	1.07 (0.00)	1.06 (0.00)	0.95 (0.01)
Naval	5.45 (0.12)	4.61 (0.01)	4.76 (0.00)	3.80 (0.01)
Power	-2.83 (0.01)	-2.86 (0.01)	-2.84 (0.01)	-2.80 (0.01)
Wine	-1.02 (0.01)	-0.97 (0.01)	-0.97 (0.01)	-0.93 (0.01)
Yacht	-1.92 (0.06)	-1.56 (0.03)	-1.88 (0.01)	-1.55 (0.03)
Protein	-2.89 (0.01)	NA	NA	-2.87 (0.01)

Equations57

KL (q_{ξ} ∣ π) = - \int_{R^{d}} q_{ξ} (x) lo g \frac{π ( x )}{q _{ξ} ( x )} d x = - E_{q_{ξ} (x)} [- U (x) - lo g q_{ξ} (x)] + lo g Z .

KL (q_{ξ} ∣ π) = - \int_{R^{d}} q_{ξ} (x) lo g \frac{π ( x )}{q _{ξ} ( x )} d x = - E_{q_{ξ} (x)} [- U (x) - lo g q_{ξ} (x)] + lo g Z .

L (ξ) = E_{q_{ξ} (x)} [lo g π_{0} (x) + lo g L (y^{1 : n} ∣ x) - lo g q_{ξ} (x)]

L (ξ) = E_{q_{ξ} (x)} [lo g π_{0} (x) + lo g L (y^{1 : n} ∣ x) - lo g q_{ξ} (x)]

lo g q_{T} (x_{T}) = lo g q_{0} (x) - t = 1 \sum T lo g det \frac{\partial T _{t} ( x _{t} )}{\partial x _{t}},

lo g q_{T} (x_{T}) = lo g q_{0} (x) - t = 1 \sum T lo g det \frac{\partial T _{t} ( x _{t} )}{\partial x _{t}},

P (U_{1} ⩽ u_{1}, \dots, U_{d} ⩽ u_{d}) = C (u_{1}, \dots, u_{d}), \int_{- \infty}^{x_{1}} \dots \int_{- \infty}^{x_{d}} π (t) d t = C (F_{1} (x_{1}), \dots, F_{d} (x_{d}))

P (U_{1} ⩽ u_{1}, \dots, U_{d} ⩽ u_{d}) = C (u_{1}, \dots, u_{d}), \int_{- \infty}^{x_{1}} \dots \int_{- \infty}^{x_{d}} π (t) d t = C (F_{1} (x_{1}), \dots, F_{d} (x_{d}))

G : u \mapsto (F_{1}^{- 1} (u_{1}), \dots, F_{d}^{- 1} (u_{d})),

G : u \mapsto (F_{1}^{- 1} (u_{1}), \dots, F_{d}^{- 1} (u_{d})),

π (x) = c (F_{1} (x_{1}), \dots, F_{d} (x_{d})) i = 1 \prod d π_{i} (x_{i}),

π (x) = c (F_{1} (x_{1}), \dots, F_{d} (x_{d})) i = 1 \prod d π_{i} (x_{i}),

c_{θ} (v_{1}, \dots, v_{d})

c_{θ} (v_{1}, \dots, v_{d})

Cor (Y_{i}, Y_{j}) = c_{ij} (\frac{E [ G ^{2} ]}{α ^{⋆} + 1} - \frac{E [ G ] ^{2}}{α ^{⋆}}),

Cor (Y_{i}, Y_{j}) = c_{ij} (\frac{E [ G ^{2} ]}{α ^{⋆} + 1} - \frac{E [ G ] ^{2}}{α ^{⋆}}),

H : v \mapsto (1 - δ) Id + {diag (2 δ) - Id} v,

H : v \mapsto (1 - δ) Id + {diag (2 δ) - Id} v,

P (δ_{i} = ϵ) = p and P (δ_{i} = 1 - ϵ) = 1 - p

P (δ_{i} = ϵ) = p and P (δ_{i} = 1 - ϵ) = 1 - p

O_{1} O_{2} = c_{1} s_{1} 00 - s_{1} c_{1} 00 00 c_{3} s_{3} 00 - s_{3} c_{3} c_{2} 0 s_{2} 0 0 c_{2} 0 s_{2} - s_{2} 0 c_{2} 0 0 - s_{2} 0 c_{2} = c_{1} c_{2} s_{1} c_{2} c_{3} s_{2} s_{3} s_{2} - s_{1} c_{2} c_{1} c_{2} - s_{3} s_{2} c_{3} s_{2} - c_{1} s_{2} - s_{1} s_{2} c_{3} c_{2} s_{3} c_{2} s_{1} s_{2} - c_{1} s_{s} - s_{3} c_{s} c_{3} c_{2},

O_{1} O_{2} = c_{1} s_{1} 00 - s_{1} c_{1} 00 00 c_{3} s_{3} 00 - s_{3} c_{3} c_{2} 0 s_{2} 0 0 c_{2} 0 s_{2} - s_{2} 0 c_{2} 0 0 - s_{2} 0 c_{2} = c_{1} c_{2} s_{1} c_{2} c_{3} s_{2} s_{3} s_{2} - s_{1} c_{2} c_{1} c_{2} - s_{3} s_{2} c_{3} s_{2} - c_{1} s_{2} - s_{1} s_{2} c_{3} c_{2} s_{3} c_{2} s_{1} s_{2} - c_{1} s_{s} - s_{3} c_{s} c_{3} c_{2},

T_{3} : x \mapsto O_{1} \dots O_{l o g d} x

T_{3} : x \mapsto O_{1} \dots O_{l o g d} x

W_{i \cdot}^{l} ∣ λ_{i}^{l}, τ^{l}, c \sim N (0, (τ^{l} \tilde{λ}_{i}^{l})^{2} I) \propto N (0, (τ^{l} λ_{i}^{l}))^{2} I) N (0, c^{2}),

W_{i \cdot}^{l} ∣ λ_{i}^{l}, τ^{l}, c \sim N (0, (τ^{l} \tilde{λ}_{i}^{l})^{2} I) \propto N (0, (τ^{l} λ_{i}^{l}))^{2} I) N (0, c^{2}),

E [f (V_{1}, \dots, V_{n})]

E [f (V_{1}, \dots, V_{n})]

\times g^{a - 1} (1 - g)^{b - 1} {ℓ = 1 \prod d \frac{u _{ℓ}^{α_{ℓ} - 1}}{Γ ( α _{ℓ} )}} Leb (g, u_{1}, \dots, u_{d - 1})

= k = 1 \sum d \frac{Γ ( α ^{⋆} )}{B ( a , b )} A_{k},

A_{k} = \int_{[0, 1]^{d}} \mathbbm 1 {u_{k} = j \in {1, \dots, d} max u_{j}} f {g u_{1} / u_{k}, \dots, g u_{d} / u_{k}} \times g^{a - 1} (1 - g)^{b - 1} {ℓ = 1 \prod d \frac{u _{ℓ}^{α_{ℓ} - 1}}{Γ ( α _{ℓ} )}} Leb (g, u_{1}, \dots, u_{d - 1}) .

A_{k} = \int_{[0, 1]^{d}} \mathbbm 1 {u_{k} = j \in {1, \dots, d} max u_{j}} f {g u_{1} / u_{k}, \dots, g u_{d} / u_{k}} \times g^{a - 1} (1 - g)^{b - 1} {ℓ = 1 \prod d \frac{u _{ℓ}^{α_{ℓ} - 1}}{Γ ( α _{ℓ} )}} Leb (g, u_{1}, \dots, u_{d - 1}) .

A_{1} = \int_{Δ_{1}} f {g, \dots, g u_{d} / u_{1}} g^{a - 1} (1 - g)^{b - 1} {ℓ = 1 \prod d \frac{u _{ℓ}^{α_{ℓ} - 1}}{Γ ( α _{ℓ} )}} Leb (g, u_{1}, u_{2}, \dots, u_{d - 1})

A_{1} = \int_{Δ_{1}} f {g, \dots, g u_{d} / u_{1}} g^{a - 1} (1 - g)^{b - 1} {ℓ = 1 \prod d \frac{u _{ℓ}^{α_{ℓ} - 1}}{Γ ( α _{ℓ} )}} Leb (g, u_{1}, u_{2}, \dots, u_{d - 1})

= \int_{\tilde{Δ}_{1}} f {g, w_{2}, \dots, w_{d - 1}, g / u_{1} - g - i = 2 \sum d - 1 w_{i}} g^{a - 1} (1 - g)^{b - 1}

\times {ℓ = 2 \prod d - 2 \frac{( u _{1} w _{ℓ} / g ) ^{α_{ℓ} - 1}}{Γ ( α _{ℓ} )}} \frac{u _{1}^{α_{1} - 1}}{Γ ( α _{1} )} \frac{( 1 - u _{1} - \sum _{i = 2}^{d - 1} u _{1} w _{i} / g ) ^{α_{d} - 1}}{Γ ( α _{d} )} \frac{g ^{d - 2}}{u _{1}^{d - 2}} Leb (g, u_{1}, w_{2}, \dots, w_{d - 1})

= \int_{\tilde{Δ}_{1}} f {g, w_{2}, \dots, w_{d - 1}, g / u_{1} - g - i = 2 \sum d - 1 w_{i}} g^{a - 1} (1 - g)^{b - 1}

\times {ℓ = 2 \prod d - 2 \frac{w _{ℓ}^{α_{ℓ} - 1}}{Γ ( α _{ℓ} )}} \frac{u _{1}^{α^{⋆} - 2}}{Γ ( α _{1} )} \frac{( g / u _{1} - g - \sum _{i = 2}^{d - 1} w _{i} ) ^{α_{d}}}{Γ ( α _{d} )} g^{- α^{⋆} + α_{1} + 1} Leb (g, u_{1}, w_{2}, \dots, w_{d - 1})

= \int_{\tilde{Δ}_{1}} f {g, w_{2}, \dots, w_{d - 1}, g / u_{1} - g - i = 2 \sum d - 1 w_{i}} g^{a - 1} (1 - g)^{b - 1}

\times {ℓ = 2 \prod d - 2 \frac{w _{ℓ}^{α_{ℓ} - 1}}{Γ ( α _{ℓ} )}} \frac{g ^{α_{1} - 1}}{Γ ( α _{1} )} \frac{( g / u _{1} - g - \sum _{i = 2}^{d - 1} w _{i} ) ^{α_{d} - 1}}{Γ ( α _{d} )} (u_{1} / g)^{α^{⋆} - 2} Leb (g, u_{1}, w_{2}, \dots, w_{d - 1}) .

\overset{ˉ}{Δ}_{1} = {(g, w_{d}, w_{2}, \dots, w_{d - 1}) : j \in {1, \dots, d} max w_{j} ⩽ g},

\overset{ˉ}{Δ}_{1} = {(g, w_{d}, w_{2}, \dots, w_{d - 1}) : j \in {1, \dots, d} max w_{j} ⩽ g},

A_{1} = \int_{\overset{ˉ}{Δ}_{1}} f (g, w_{2}, \dots, w_{d - 1}, w_{d})) g^{a - 1} (1 - g)^{b - 1}

A_{1} = \int_{\overset{ˉ}{Δ}_{1}} f (g, w_{2}, \dots, w_{d - 1}, w_{d})) g^{a - 1} (1 - g)^{b - 1}

\times {ℓ = 2 \prod d \frac{w _{ℓ}^{α_{ℓ} - 1}}{Γ ( α _{ℓ} )}} \frac{g ^{α_{1}}}{Γ ( α _{1} )} {g + j = 1 \sum d - 1 w_{j}}^{- α^{⋆}} Leb (g, w_{1}, w_{2}, \dots, w_{d - 1}) .

R_{2 d} = [R_{d} c_{d} \tilde{R}_{d} s_{d} - R_{d} s_{d} \tilde{R}_{d} c_{d}],

R_{2 d} = [R_{d} c_{d} \tilde{R}_{d} s_{d} - R_{d} s_{d} \tilde{R}_{d} c_{d}],

R_{2} = [c_{1} s_{1} - s_{1} c_{1}], \tilde{R}_{2} = [c_{3} s_{3} - s_{3} c_{3}] .

R_{2} = [c_{1} s_{1} - s_{1} c_{1}], \tilde{R}_{2} = [c_{3} s_{3} - s_{3} c_{3}] .

R_{5} = c_{1} s_{1} 000 - s_{1} c_{1} 000 00 c_{3} s_{3} 0 00 - s_{3} c_{3} 0 00001 c_{2} 0 s_{2} 00 0 c_{2} 0 s_{2} 0 - s_{2} 0 c_{2} 00 0 - s_{2} 0 c_{2} 0 00001 c_{4} 000 s_{4} 010000010000010 - s_{4} 000 c_{4} .

R_{5} = c_{1} s_{1} 000 - s_{1} c_{1} 000 00 c_{3} s_{3} 0 00 - s_{3} c_{3} 0 00001 c_{2} 0 s_{2} 00 0 c_{2} 0 s_{2} 0 - s_{2} 0 c_{2} 00 0 - s_{2} 0 c_{2} 0 00001 c_{4} 000 s_{4} 010000010000010 - s_{4} 000 c_{4} .

l (z, ϕ, δ) = \frac{lo g L ( y ^{1 : n} ∣ f _{ϕ, δ} ( z )) + lo g π _{0} ( f _{ϕ, δ} ( z ))}{lo g q _{ξ} ( f _{ϕ, δ} ( z ))} .

l (z, ϕ, δ) = \frac{lo g L ( y ^{1 : n} ∣ f _{ϕ, δ} ( z )) + lo g π _{0} ( f _{ϕ, δ} ( z ))}{lo g q _{ξ} ( f _{ϕ, δ} ( z ))} .

\nabla_{θ, ϕ} L (ξ)

\nabla_{θ, ϕ} L (ξ)

= E [\nabla_{z} l (S_{θ}^{- 1} (H), ϕ, δ) \nabla_{θ, ϕ} S_{θ}^{- 1} (H) + \nabla_{θ, ϕ} l (S_{θ}^{- 1} (H), ϕ, δ)]

= E [\nabla_{z} l (Z, ϕ, δ) \nabla_{θ, ϕ} Z + \nabla_{θ, ϕ} l (Z, ϕ, δ)],

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

marcelah/copula-like-vi
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaussian Processes and Bayesian Inference · Bayesian Modeling and Causal Inference · Statistical Mechanics and Entropy

Full text

\newaliascnt

lemmatheorem \aliascntresetthelemma

\newaliascntcorollarytheorem \aliascntresetthecorollary

\newaliascntpropositiontheorem \aliascntresettheproposition

\newaliascntdefinitiontheorem \aliascntresetthedefinition

\newaliascntremarktheorem \aliascntresettheremark

\newaliascntexampledefinition \aliascntresettheexample

Copula-like Variational Inference

Marcel Hirt

Department of Statistical Science

University College of London, UK

[email protected]

&Petros Dellaportas

Department of Statistical Science

University College of London, UK

Department of Statistics

Athens University of Economics and Business, Greece

and The Alan Turing Institute, UK &Alain Durmus

CMLA

École normale supérieure Paris-Saclay,

CNRS, Université Paris-Saclay, 94235 Cachan, France.

[email protected]

Abstract

This paper considers a new family of variational distributions motivated by Sklar’s theorem. This family is based on new copula-like densities on the hypercube with non-uniform marginals which can be sampled efficiently, i.e. with a complexity linear in the dimension $d$ of the state space. Then, the proposed variational densities that we suggest can be seen as arising from these copula-like densities used as base distributions on the hypercube with Gaussian quantile functions and sparse rotation matrices as normalizing flows. The latter correspond to a rotation of the marginals with complexity $\mathcal{O}(d\log d)$ . We provide some empirical evidence that such a variational family can also approximate non-Gaussian posteriors and can be beneficial compared to Gaussian approximations. Our method performs largely comparably to state-of-the-art variational approximations on standard regression and classification benchmarks for Bayesian Neural Networks.

1 Introduction

Variational inference [29, 68, 4] aims at performing Bayesian inference by approximating an intractable posterior density $\pi$ with respect to the Lebesgue measure on $\mathbb{R}^{d}$ , based on a family of distributions which can be easily sampled from. More precisely, this kind of inference posits some variational family $\mathrm{Q}$ of densities $(q_{\xi})_{\xi\in\Xi}$ with respect to the Lebesgue measure and intends to find a good approximation $q_{\xi^{\star}}$ belonging to $\mathrm{Q}$ by minimizing the Kullback-Leibler (KL) with respect to $\pi$ over $\mathrm{Q}$ , i.e. $\xi^{\star}\approx\operatorname*{arg\,min}_{\xi\in\Xi}\mathrm{KL}(q_{\xi}|\pi)$ . Further, suppose that $\pi(x)=\mathrm{e}^{-U(x)}/\mathrm{Z}$ with $U\colon\mathbb{R}^{d}\to\mathbb{R}$ measurable and $\mathrm{Z}=\int_{\mathbb{R}^{d}}\mathrm{e}^{-U(x)}\mathrm{d}x<\infty$ is an unknown normalising constant. Then, for any $\xi\in\Xi$ ,

[TABLE]

Since $\mathrm{Z}$ does not depend on $q_{\xi}$ , minimizing $\xi\mapsto\mathrm{KL}(q_{\xi}|\pi)$ is equivalent to maximizing $\xi\mapsto\log\mathrm{Z}-\mathrm{KL}(q_{\xi}|\pi)$ . A standard example is Bayesian inference over latent variables $x$ having a prior density $\pi_{0}$ for a given likelihood function $L(y^{1:n}|x)$ and $n$ observations $y^{1:n}=(y^{1},\ldots,y^{n})$ . The target density is the posterior $p(x|y^{1:n})$ with $U(x)=-\log\pi_{0}(x)-\log L(y^{1:n}|x)$ and the objective that is commonly maximized,

[TABLE]

is called a variational lower bound or ELBO. One of the main features of variational inference methods is their ability to be scaled to large datasets using stochastic approximation methods [24] and applied to non-conjugate models by using Monte Carlo estimators of the gradient [57, 35, 60, 63, 38]. However, the approximation quality hinges on the expressiveness of the distributions in $\mathrm{Q}$ and restrictive assumptions on the variational family that allow for efficient computations such as mean-field families, tend to be too restrictive to recover the target distribution. Constructing an approximation family $\mathrm{Q}$ that is both flexible to closely approximate the density of interest and at the same time computationally efficient has been an ongoing challenge. Much effort has been dedicated to find flexible and rich enough variational approximations, for instance by assuming a Gaussian approximation with different types of covariance matrices. For example, full-rank covariance matrices have been considered in [1, 28, 63] and low-rank perturbations of diagonal matrices in [1, 46, 53, 47]. Furthermore, covariance matrices with a Kronecker structure have been proposed in [42, 70]. Besides, more complex variational families have been suggested: such as mixture models [18, 22, 46, 40, 39], implicit models [45, 26, 67, 69, 64], where the density of the variational distribution is intractable. Finally, variational inference based on normalizing flows has been developed in [59, 34, 65, 43, 3]. As a special case and motivated by Sklar’s theorem [62], variational inference based on families of copula densities and one-dimensional marginal distributions have been considered by [66] where it is assumed that the copula is a vine copula [2] and by [23] where the copula is assumed to be a Gaussian copula together with non-parametric marginals using Bernstein polynomials. Recall that $c:\left[0,1\right]^{d}\to\mathbb{R}_{+}$ is a copula if and only if its marginals are uniform on $\left[0,1\right]$ , i.e. $\int_{\left[0,1\right]^{d-1}}c(u_{1},\ldots,u_{d})\mathrm{d}u_{1}\cdots\mathrm{d}u_{i-1}\mathrm{d}u_{i+1}\cdots\mathrm{d}u_{d}=\mathbbm{1}_{\left[0,1\right]}(u_{i})$ for any $i\in\{1,\ldots,d\}$ and $u_{i}\in\mathbb{R}$ . In the present work, we pursue these ideas but propose instead of using a family of copula densities, simply a family of densities $\{c_{\theta}:\left[0,1\right]^{d}\to\mathbb{R}_{+}\}_{\theta\in\Theta}$ on the hypercube $\left[0,1\right]^{d}$ . This idea is motivated from the fact that we are able to provide such a family which is both flexible and allow efficient computations.

The paper is organised as follow. In Section 2, we recall how one can sample more expressive distributions and compute their densities using a sequence of bijective and continuously differentiable transformations. In particular, we illustrate how to apply this idea in order to sample from a target density by first sampling a random variable $U$ from its copula density $c$ and then applying the marginal quantile function to each component of $U$ . A new family of copula-like densities on the hypercube is constructed in Section 3 that allow for some flexibility in their dependence structure, while enjoying linear complexity in the dimension of the state space for generating samples and evaluating log-densities. A flexible variational distribution on $\mathbb{R}^{d}$ is introduced in Section 4 by sampling from such a copula-like density and then applying a sequence of transformations that include $\frac{1}{2}d\log d$ rotations over pairs of coordinates. We illustrate in Section 6 that for some target densities arising for instance as the posterior in a logistic regression model, the proposed density allows for a better approximation as measured by the KL-divergence compared to a Gaussian density. We conclude with applying the proposed methodology on Bayesian Neural Network models.

2 Variational Inference and Copulas

In order to obtain expressive variational distributions, the variational densities can be transformed through a sequence of invertible mappings, termed normalizing flows [60]. To be more specific, assume a series $\{\mathscr{T}_{t}:\mathbb{R}^{d}\to\mathbb{R}^{d}\}_{t=1}^{T}$ of $\mathrm{C}^{1}$ -diffeomorphisms and a sample $X_{0}\sim q_{0}$ , where $q_{0}$ is a density function on $\mathbb{R}^{d}$ . Then the random variable $X_{T}=\mathscr{T}_{T}\circ\mathscr{T}_{T-1}\circ\cdots\circ\mathscr{T}_{1}(X_{0})$ has a density $q_{T}$ that satisfies

[TABLE]

with $x_{t}=\mathscr{T}_{t}\circ\mathscr{T}_{t-1}\circ\cdots\circ\mathscr{T}_{1}(x)$ . To allow for scalable inferences with such densities, the transformations $\mathscr{T}_{t}$ must be chosen so that the determinant of their Jacobians can be computed efficiently. One possibility that satisfies this requirement is to choose volume-preserving flows that have a Jacobian-determinant of one. This can be achieved by considering transformations $\mathscr{T}_{t}\colon x\mapsto H_{t}x$ where $H_{t}$ is an orthogonal matrix as proposed in [65] using a Householder-projection matrix $H_{t}$ .

An alternative construction of the same form can be used to construct a density using Sklar’s theorem [62, 48]. It establishes that given a target density $\pi$ on $(\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d}))$ , there exists a continuous function $C\colon\left[0,1\right]^{d}\to\left[0,1\right]$ and a probability space supporting a random variable $U=(U_{1},\dots,U_{d})$ valued in $\left[0,1\right]^{d}$ , such that for any $x\in\mathbb{R}^{d}$ , and $u\in\left[0,1\right]^{d}$ ,

[TABLE]

where for any $i\in\{1,\ldots,d\}$ , $F_{i}$ is the cumulative distribution function associated with $\pi_{i}$ , so for any $x_{i}\in\mathbb{R}$ , $F_{i}(x_{i})=\int_{-\infty}^{x_{i}}\pi_{i}(t_{i})\mathrm{d}t_{i}$ and $\pi_{i}$ is the $i^{\text{th}}$ marginal of $\pi$ , so for any $x_{i}\in\mathbb{R}$ , $\pi_{i}(x_{i})=\int_{\mathbb{R}^{d-1}}\pi(x)\mathrm{d}x_{1}\cdots\mathrm{d}x_{i-1}\mathrm{d}x_{i+1}\cdots\mathrm{d}x_{d}$ . To illustrate how one can obtain such a continuous function $C$ and random variable $U$ , recall that $\pi_{i}$ is assumed to be absolutely continuous with respect to the Lebesgue measure. Then for $(X_{1},\ldots,X_{d})\sim\pi$ , the random variable $U=\mathscr{G}^{-1}(X)=(F_{1}(X_{1}),\ldots,F_{d}(X_{d}))$ , where $\mathscr{G}\colon\left[0,1\right]^{d}\to\mathbb{R}^{d},$ with

[TABLE]

follows a law on the hypercube with uniform marginals. It can be readily shown that the cumulative distribution function $C$ of $U$ is continuous and satisfies (4). Note that taking the derivative of (4) yields

[TABLE]

where $c(u_{1},\dots,u_{d})=\frac{\partial}{\partial u_{1}}\cdots\frac{\partial}{\partial u_{d}}C(u_{1},\ldots,u_{d})$ is a copula density function by definition of $C$ . One possibility to approximate a target density $\pi$ is then to consider a parametric family of copula density functions $(c_{\theta})_{\theta\in\Theta}$ for $\Theta\in\mathbb{R}^{p_{c}}$ and one parametric family of a $d$ -dimensional vector of density functions $(f_{1},\ldots,f_{d})_{\phi\in\Phi}\colon\mathbb{R}^{d}\to\mathbb{R}^{d}$ for $\Phi\subset\mathbb{R}^{p_{f}}$ , and try to estimate $\theta\in\Theta$ and $\phi\in\Phi$ to get a good approximation of $\pi$ via variational Bayesian methods. This idea was proposed by [23] and [66], where Gaussian and vine copulas were used, respectively. The main hurdle for using such family is their computational cost which can be prohibitive since the dimension of $\Theta$ is of order $d^{2}$ . We remark that for latent Gaussian models with certain likelihood functions, a Gaussian variational approximation can scale linearly in the number of observations by using dual variables, see [54, 31].

3 Copula-like Density

In this paper, we consider another approach which relies on a copula-like density function on $\left[0,1\right]^{d}$ . Indeed, instead of an exact copula density function on $\left[0,1\right]^{d}$ with uniform marginals, we consider simply a density function on $\left[0,1\right]^{d}$ which allows to have a certain degree of freedom in the number of parameters we want to use. The family of copula-like densities that we consider is given by

[TABLE]

with the notation $v^{*}=\sum_{i=1}^{d}v_{i}$ and $\alpha^{*}=\sum_{i=1}^{d}\alpha_{i}$ . Therefore $\theta=(a,b,(\alpha_{i})_{i\in\{1,\ldots,d\}})\in(\mathbb{R}_{+}^{*}\times\mathbb{R}_{+}^{*}\times(\mathbb{R}_{+}^{*})^{d})=\Theta$ . The following probabilistic construction is proven in Appendix A to allow for efficient sampling from the proposed copula-like density.

Proposition \theproposition.

Let $\theta\in\Theta$ and suppose that

$(W_{1},\ldots,W_{d})\sim\text{Dirichlet}(\alpha_{1},\ldots,\alpha_{d})$ ; 2. 2.

$G\sim\text{Beta}(a,b)$ ; 3. 3.

$(V_{1},\ldots,V_{d})=(GW_{1}/W^{*},\ldots,GW_{d}/W^{*})$ , where $W^{*}=\max_{i\in\{1,\ldots,d\}}W_{i}$ .

Then the distribution of $(V_{1},\ldots,V_{d})$ has density with respect to the Lebesgue measure given by (6).

The proposed distribution builds up on Beta distributions, as they are the marginals of the Dirichlet distributed random variable $W\sim\text{Dir}(\alpha)$ , which is then multiplied with an independent random variable $G\sim\text{Beta}(a,b)$ . The resulting random variable $Y=WG$ follows a Beta-Liouville distribution, which allows to account for negative dependence, inherited from the Dirichlet distribution through a Beta stick-breaking construction, as well as positive dependence via a common Beta-factor. More precisely, one obtains

[TABLE]

for some $c_{ij}>0$ and $\alpha^{\star}=\sum_{k=1}^{d}\alpha_{k}$ , cf. [13]. Proposition 3 shows that one can transform the Beta-Liouville distribution living within the simplex to one that has support on the full hypercube, while also allowing for efficient sampling and log-density evaluations.

Now note that also $V^{-}=(1-V_{1},\ldots 1-V_{d})$ is a sample on the hypercube if $V\sim c_{\theta}$ , as is the convex combination $U=(U_{1},\ldots,U_{d})$ , where $U_{i}=\delta_{i}V_{i}+(1-\delta_{i})(1-V_{i})$ for any $\delta\in\left[0,1\right]^{d}$ . Put differently, we can write $U=\mathscr{H}(V)$ , where

[TABLE]

and $\operatorname{Id}$ is the identity operator. It is straightforward to see that $\mathscr{H}$ is a $\mathrm{C}^{1}$ -diffeomorphism for $\delta\in([0,1]\backslash\{0.5\})^{d}$ from the hypercube into $I_{1}\times\cdots\times I_{d}$ , where $I_{i}=\left[\delta_{i},1-\delta_{i}\right]$ if $\delta_{i}\in[0,0.5)$ and $I_{i}=\left[1-\delta_{i},\delta_{i}\right]$ if $\delta_{i}\in(0.5,1]$ . Note that the Jacobian-determinant of $\mathscr{H}$ is efficiently computable and is simply equal to $|\prod_{i=1}^{d}(2\delta_{i}-1)|$ for $\delta\in\left[0,1\right]^{d}$ .

We suggest to take initially at random $\delta\in\left[0,1\right]^{d}$ for the transformation $\mathscr{H}$ such that

[TABLE]

with $p,\epsilon\in(0,1)$ . In our experiments, we set $\epsilon=0.01$ and $p=1/2$ . We found that choosing a different (large enough) value of $\epsilon$ tends to yield no large difference, as this choice will get balanced by a different value of the standard deviation of the Gaussian marginal transformation. The motivation to consider $U=\mathscr{H}(V)$ with $V\sim c_{\theta}$ was first numerical stability since we need to compute quantile functions only on the interval $[\epsilon,1-\epsilon]$ using this transformation. Second, this transformation can increase the flexibility of our proposed family. We found empirically that the components of $V\sim c_{\theta}$ tend to be non-negative in higher dimensions. However, using sometimes (more) the antithetic component of $V$ by considering $U=\mathscr{H}(V)$ , the transformed density can also describe negative dependencies in high dimensions. What comes to mind to obtain a flexible density is then to either optimize over the parameter $\delta$ parametrising the transformation $\mathscr{H}$ or considering $\delta$ as an auxiliary variable in the variational density, resorting to techniques developed for such hierarchical families, see for instance [58, 69, 64]. However, this proved challenging in an initial attempt, since for $\delta_{i}=0.5$ , the transformation $\mathscr{H}$ becomes non-invertible, while restricting $\delta$ on say $\delta\in\{\epsilon,1-\epsilon\}^{d}$ , $\epsilon\approx 0$ , seemed less easy to optimize. Consequently, we keep $\delta$ fixed after sampling it initially according to (8). A sensible choice was $p=1/2$ since it leads to a balanced proportion of components of $\delta$ equal to $\epsilon$ and $1-\epsilon$ . However, the sampled value of $\delta$ might not be optimal and we illustrate in the next section how the variational density can be made more flexible.

4 Rotated Variational Density

We propose to apply rotations to the marginals in order to improve on the initial orientation that results from the sampled values of $\delta$ . Rotated copulas have been used before in low dimensions, see for instance [36], however, the set of orthogonal matrices has $d(d-1)/2$ free parameters. We reduce the number of free parameters by considering only rotation matrices $\mathcal{R}_{d}$ that are given as a product of $d/2\log d$ Givens rotations, following the FFT-style butterfly-architecture proposed in [16], see also [44] and [49] where such an architecture was used for approximating Hessians and kernel functions, respectively. Recall that a Givens rotation matrix [21] is a sparse matrix with one angle as its parameter that rotates two dimensions by this angle. If we assume for the moment that $d=2^{k}$ , $k\in\mathbb{N}^{*}$ , then we consider $k$ rotation matrices denoted $\mathcal{O}_{1},\ldots\mathcal{O}_{k}$ where for any $i\in\{1,\ldots,k\}$ , $\mathcal{O}_{i}$ contains $d/2$ independent rotations, i.e. is the product of $d/2$ independent Givens rotations. Givens rotations are arranged in a butterfly architecture that provides for a minimal number of rotations so that all coordinates can interact with one another in the rotation defined by $\mathcal{R}_{d}$ . For illustration, consider the case $d=4$ , where the rotation matrix is fully described using $4-1$ parameters $\nu_{1},\nu_{2},\nu_{3}\in\mathbb{R}$ by $\mathcal{R}_{4}=\mathcal{O}_{1}\mathcal{O}_{2}$ with

[TABLE]

where $c_{i}=\cos(\nu_{i})$ and $s_{i}=\sin(\nu_{i}$ ). We provide a precise recursive definition of $\mathcal{R}_{d}$ in Appendix B where we also describe the case where $d$ is not a power of two. In general, we have a computational complexity of $\mathcal{O}(d\log d)$ , due to the fact that $\mathcal{R}_{d}$ is a product of $\mathcal{O}(\log d)$ matrices each requiring $\mathcal{O}(d)$ operations. Moreover, note that $\mathcal{R}_{d}$ is parametrized by $d-1$ parameters $(\nu_{i})_{i\in\{1\ldots d-1\}}$ and each $\mathcal{O}_{i}$ can be implemented as a sparse matrix, which implies a memory complexity of $\mathcal{O}({d})$ . Furthermore, since $\mathcal{O}_{i}$ is orthonormal, we have $\mathcal{O}_{i}^{-1}=\mathcal{O}_{i}^{\top}$ and $|\det\mathcal{O}_{i}|=1$ .

To construct an expressive variational distribution, we consider as a base distribution $q_{0}$ the proposed copula-like density $c_{\theta}$ . We then apply the transformations $\mathscr{T}_{1}=\mathscr{H}$ and $\mathscr{T}_{2}=\mathscr{G}$ . The operator $\mathscr{G}$ in (5) is defined via quantile functions of densities $f_{1},\ldots,f_{d}$ , for which we choose Gaussian densities with parameter $\phi_{f}=(\mu_{1},\ldots,\mu_{d},\sigma_{1}^{2},\ldots,\sigma_{d}^{2})\in\mathbb{R}^{d}\times\mathbb{R}_{+}^{d}$ . As a final transformation, we apply the volume-preserving operator

[TABLE]

that has parameter $\phi_{\mathcal{R}}=(\nu_{1},\ldots,\nu_{d-1})\in\mathbb{R}^{d-1}$ . Altogether, the parameter for the marginal-like densities that we optimize over is $\phi=(\phi_{f},\phi_{\mathcal{R}})$ and simulation from the variational density boils down to the following algorithm.

Note that we apply the rotations after we have transformed samples from the hypercube into $\mathbb{R}^{d}$ , as the hypercube is not closed under Givens rotations. The variational density can then be evaluated using the normalizing flow formula (3). We optimize the variational lower bound $\mathcal{L}$ in (2) using reparametrization gradients, proposed by [35, 60, 63], but with an implicit reparametrization, cf. [14], for Dirichlet and Beta distributions. Such reparametrized gradients for Dirichlet and Beta distributions are readily available for instance in tensorflow probability [9]. Using Monte Carlo samples of unbiased gradient estimates, one can optimize the variational bound using some version of stochastic gradient descent. A more formal description is given in Appendix C.

We would like to remark that such sparse rotations can be similarly applied to proper copulas. While there is no additional flexibility by rotating a full-rank Gaussian copula, applying such rotations to a Gaussian copula with a low-rank correlation yields a Gaussian distribution with a more flexible covariance structure if combined with Gaussian marginals. In our experiments, we therefore also compare variational families constructed by sampling $(V_{1},\ldots,V_{d})$ from an independence copula in step $1$ in Algorithm 1, i.e. $V_{i}$ are independent and uniformly distributed on $[0,1]$ for any $i\in\{1,\ldots,d\}$ , which results approximately in a Gaussian variational distribution if the effect of the transformation $\mathscr{H}$ is neglected. However, a more thorough analysis of such families is left for future work. Similarly, transformations different from the sparse rotations in step $4$ in Algorithm 1 can be used in combination with a copula-like base density. Whilst we include a comparison with a simple Inverse Autoregressive Flow [34] in our experiments, a more exhaustive study of non-linear transformations is beyond the scope of this work.

5 Related Work

Conceptually, our work is closely related to [66, 23]. It differs from [66] in that it can be applied in high dimensions without having to search first for the most correlated variables using for instance a sequential tree selection algorithm [11]. The approach in [23] considered a Gaussian dependence structure, but has only been considered in low-dimensional settings. On a more computational side, our approach is related to variational inference with normalizing flows [59, 34, 65, 43, 3]. In contrast to these works that introduce a parameter-free base distribution commonly in $\mathbb{R}^{d}$ as the latent state space, we also optimize over the parameters of the base distribution which is supported on the hypercube instead, although distributions supported for instance on the hypersphere as a state space have been considered in [7]. Moreover, such approaches have been often used in the context of generative models using Variational Auto-Encoders (VAEs) [35], yet it is in principle possible to apply the proposed variational copula-like inference in an amortized fashion for VAEs.

A somewhat similar copula-like construction in the context of importance sampling has been proposed in [8]. However, sampling from this density requires a rejection step to ensure support on the hypercube, which would make optimization of the variational bound less straightforward. Lastly, [30] proposed a method to approximate copulas using mixture distributions, but these approximations have not been analysed neither in high dimensions nor in the context of variational inference.

6 Experiments

6.1 Bayesian Logistic Regression

Consider the target distribution $\pi$ on $(\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d}))$ arising as the posterior of a $d$ -dimensional logistic regression, assuming a Normal prior $\pi_{0}=\mathcal{N}(0,\tau^{-1}I)$ , $\tau=0.01$ , and likelihood function $L(y^{i}|x)=f(y^{i}x^{\top}\mathrm{a}^{i})$ , $f(z)=1/{(1+\mathrm{e}^{-z})}$ with $n$ observations $y^{i}\in\{-1,1\}$ and fixed covariates $\mathrm{a}^{i}\in\mathbb{R}^{d}$ for $i\in\{1,\ldots n\}$ . We analyse a previously considered synthetic dataset where the posterior distribution is non-Gaussian, yet it can be well approximated with our copula-like construction. Concretely, we consider the synthetic dataset with $d=2$ as in [50], Section 8.4 and [32] by generating $30$ covariates $\mathrm{a}\in\mathbb{R}^{2}$ from a Gaussian $\mathcal{N}((1,5)^{\top},I)$ for instances in the first class, while we generate $30$ covariates from $\mathcal{N}((-5,1)^{\top},1.1^{2}I)$ for instances in the second class. Samples from the target distribution using a Hamiltonian Monte Carlo (HMC) sampler [12, 51] are shown in Figure 1(a) and one observes non-Gaussian marginals that are positively correlated with heavy right tails. Using a Gaussian variational approximation with either independent marginals or a full covariance matrix as shown in Figure 1(b) does not adequately approximate the target distribution. Our copula-like construction is able to approximate the target more closely, both without any rotations (Figure 1(c)) and with a rotation of the marginals (Figure 1(d)). This is also supported by the ELBO obtained for the different variational families given in Table 2.

6.2 Centred Horseshoe Priors

We illustrate our approach in a hierarchical Bayesian model that posits a priori a strong coupling of the latent parameters. As an example, we consider a Horseshoe prior [6] that has been considered in the variational Gaussian copula framework in [23]. To be more specific, consider the generative model $y|\lambda\sim\mathcal{N}(0,\lambda)$ , with $\lambda\sim\mathcal{C}^{+}(0,1)$ , where $\mathcal{C}^{+}$ is a half-Cauchy distribution, i.e. $X\sim\mathcal{C}^{+}(0,b)$ has the density $p(x)\propto 1_{\mathbb{R}_{+}}(x)/(x^{2}+b^{2})$ . Note that we can represent a half-Cauchy distribution with Inverse Gamma and Gamma distributions using $X\sim\mathcal{C}^{+}(0,b)\iff X^{2}|Y\sim\mathcal{IG}(1/2,1/Y);Y\sim\mathcal{IG}(1/2,1/b^{2})$ , see [52], with a rate parametrisation of the inverse gamma density $p(x)\propto 1_{\mathbb{R}_{+}}(x)x^{a-1}e^{-b/x}$ for $X\sim\mathcal{IG}(a,b)$ . We revisit the toy model in [23] fixing $y=0.01$ . The model thus writes in a centred form as $\eta\sim\mathcal{G}(1/2,1)$ and $\lambda|\eta\sim\mathcal{IG}(1/2,{\eta})$ . Following [23], we consider the posterior density on $\mathbb{R}^{2}$ of the log-transformed variables $(x_{1},x_{2})=(\log\eta_{1},\log\lambda_{1})$ . In Figure 4, we show the approximate posterior distribution using a Gaussian family (2(b)) and a copula-like family (2(c)), together with samples from a HMC sampler (2(a)). A copula-like density yields a higher ELBO, see Table 4. The experiments in [23] have shown that a Gaussian copula with a non-parametric mixture model fits the marginals more closely. To illustrate that it is possible to arrive at a more flexible variational family by using a mixture of copula-like densities, we have used a mixture of $3$ copula-like densities in Figure 2(d). Note that it is possible to accommodate multi-modal marginals using a Gaussian quantile transformation with a copula-like density. Eventually, the flexibility of the variational approximation can be increased using different complementary work. For instance, one could use the new density within a semi-implicit variational framework [69] whose parameters are the output of a neural network conditional on some latent mixing variable.

6.3 Bayesian Neural Networks with Normal Priors

We consider an $L$ -hidden layer fully-connected neural network where each layer $l$ , $1\leqslant l\leqslant L+1$ has width $d_{l}$ and is parametrised by a weight matrix $W^{l}\in\mathbb{R}^{d_{l-1}\times d_{l}}$ and bias vector $b^{l}\in\mathbb{R}^{d_{l}}$ . Let $h^{1}\in\mathbb{R}^{d_{0}}$ denote the input to the network and $f$ be a point-wise non-linearity such as the ReLU function $f(a)=\max\{0,a\}$ and define the activations $a^{l}\in\mathbb{R}^{d_{l}}$ by $a^{l+1}=\sum_{i}h_{i}^{l}W^{l}_{i\cdot}+b^{l}$ for $l\geqslant 1$ , and the post-activations as $h^{l}=f(a^{l})$ for $l\geqslant 2$ . We consider a regression likelihood function $L(\cdot|a^{L+2},\sigma)=\mathcal{N}(a^{L+2},\exp(0.5\sigma))$ , and denote the concatenation of all parameters $W$ , $b$ and $\sigma$ as $x$ . We assume independent Normal priors for the entries of the weight matrix and bias vector with mean [math] and variance $\sigma_{0}^{2}$ . Furthermore, we assume that $\log\sigma\sim\mathcal{N}(0,16)$ . Inference with the proposed variational family is applied on commonly considered UCI regression datasets, repeating the experimental set-up used in [15]. In particular, we use neural networks with ReLU activation functions and one hidden layer of size $50$ for all datasets with the exception of the protein dataset that utilizes a hidden layer of size $100$ . We choose the hyper-parameter $\sigma_{0}^{2}\in\{0.01,0.1,1.,10.,100.\}$ that performed best on a validation dataset in terms of its predictive log-likelihood. Optimization was performed using Adam [33] with a learning rate of $0.002$ . We compare the predictive performance of a copula-like density $c_{\theta}$ and an independent copula as a base distribution in step 1 of Algorithm 1 and we apply different transformations $\mathscr{T}_{3}$ in step 4 of Algorithm 1:

a) the proposed sparse rotation defined in (9);

b) $\mathscr{T}_{3}=\operatorname{Id}$ ;

c) an affine autoregressive transformation $\mathscr{T}_{3}(x)=\{x-f_{\mu}(x)\}{\exp(-f_{\alpha}(x))}$ , see [34], also known as an inverse autogressive flow (IAF).

Here $f_{\mu}$ and $f_{\alpha}$ are autoregressive neural networks parametrized by $\mu$ and $\alpha$ satisfying $\frac{\partial f_{\mu}(x)_{i}}{\partial x_{j}}=\frac{\partial f_{\alpha}(x)_{i}}{\partial x_{j}}=0$ for $i\ \leqslant j$ and which can be computed in a single forward pass by properly masking the weights in the neural networks [17]. In our experiments, we use a one-hidden layer fully-connected network with width $d_{1}^{\text{IAF}}=50$ for $f_{\mu}$ and $f_{\alpha}$ . Note that for a $d$ -dimensional target density, the size of the weight matrices are of order $d\cdot d_{1}^{\text{IAF}}$ , implying a higher complexity compared to the sparse rotation. We also compare the predictions against Bayes-by-Backprop [5] using a mean-field model, with the results as reported in [47] for a mean-field Bayes-by-Backprop and low-rank Gaussian approximation proposed therein called SLANG. Furthermore, we also report the results for Dropout inference [15]. The test root mean-squared errors are given in Table 5 and Table 6; the predictive test log-likelihood can be find in the Appendix E in Table 8 and Table 9. We can observe from Table 5 and Table 8 that using a copula-like base distribution instead of an independent copula improves the predictive performance, using either rotations or IAF as the final transformation. The same tables also indicate that for a given base distribution, the IAF tends to outperform the sparse rotations slightly. Table 6 and Table 9 suggest that copula-like densities without any transformation in the last step can be a competitive alternative to a benchmark mean-field or Gaussian approximation. Dropout tends to perform slightly better. However, note that Dropout with a Normal prior and a variational mixture distribution that includes a Dirac delta function as one component gives rise to a different objective, since the prior is not absolutely continuous with respect to the approximate posterior, see [25].

6.4 Bayesian Neural Networks with Structured Priors

We illustrate our approach on a larger Bayesian neural network. To induce sparsity for the weights in the network, we consider a (regularised) Horseshoe prior [56] that has also been used increasingly as an alternative prior in Bayesian neural network to allow for sparse variational approximations, see [41, 19] for mean-field models and [20] for a structured Gaussian approximation. We consider again an $L$ -hidden layer fully-connected neural network where we assume that the weight matrix $W^{l}\in\mathbb{R}^{d_{l-1}\times d_{l}}$ for any $l\in\{1,\ldots,L+1\}$ and any $i\in\{1,\ldots,d_{l-1}\}$ satisfies a priori

[TABLE]

where $(\tilde{\lambda_{i}}^{l})^{2}=c^{2}(\lambda_{i}^{l})^{2}/(c^{2}+\tau^{2}(\lambda_{i}^{l})^{2})$ , $\lambda_{i}^{l}\sim\mathcal{C}^{+}(0,1)$ , $\tau_{i}^{l}\sim\mathcal{C}^{+}(0,b_{\tau})$ and $c^{2}\sim\mathcal{IG}(\frac{\nu}{2},\nu\frac{s^{2}}{2})$ for some hyper-parameters $b_{\tau},\nu,s^{2}>0$ . The vector $W_{i\cdot}^{(l)}$ represents all weights that interact with the $i$ -th input neuron. The first Normal factor in (10) is a standard Horseshoe prior with a per layer global parameter $\tau^{l}$ that adapts to the overall sparsity in layer $l$ and shrinks all weights in this layer to zero, due to the fact that $\mathcal{C}^{+}(0,b_{\tau})$ allows for substantial mass near zero. The local shrinkage parameter $\lambda^{l}_{i}$ allow for signals in the $i$ -th input neuron because $\mathcal{C}^{+}(0,1)$ is heavy-tailed. However, this can leave large weights un-shrunk, and the second Normal factor in (10) induces a Student- $t_{\nu}(0,s^{2})$ regularisation for weights far from zero, see [56] for details. We can rewrite the model in a non-centred form [55], where the latent parameters are a priori independent, see also [41, 27, 19, 20] for similar variational approximations. We write the model as $\eta_{i}^{l}\sim\mathcal{G}(1/2,1)$ , $\hat{\lambda}_{i}^{l}\sim\mathcal{IG}(1/2,1)$ , $\kappa^{l}\sim\mathcal{G}(1/2,1/b_{\tau}^{2})$ , $\hat{\tau}^{l}\sim\mathcal{IG}(1/2,1)$ , $\beta_{i}^{l}\sim\mathcal{N}(0,I)$ , $W_{i\cdot}^{l}=\tau^{l}\tilde{\lambda}_{i}^{l}{\beta}_{i}^{l}$ , $\tau^{l}=\sqrt{\hat{\tau}^{l}\kappa^{l}}$ , $\lambda_{i}^{l}=\sqrt{\hat{\lambda}_{i}^{l}\eta_{i}^{l}}$ and $(\tilde{\lambda}_{i}^{l})^{2}=c^{2}(\lambda_{i}^{l})^{2}/(c^{2}+(\tau^{l})^{2}(\lambda_{i}^{l})^{2})$ . The target density is the posterior of these variables, after applying a log-transformation if their prior is an (inverse) Gamma law.

We performed classification on MNIST using a $2$ -hidden layer fully-connected network where the hidden layers are of size $200$ each. Further details about the algorithmic details are given in Appendix D. Prediction errors for the variational families as considered in the preceding experiments are given in Table 7. We again find that a copula-like density outperforms the independent copula. Using a copula-like density without the rotation also performs competitively as long as one uses a balanced amount of its antithetic component via the transformation $\mathscr{H}$ with parameter $\delta$ ; ignoring the transformation $\mathscr{H}$ or setting $\delta_{i}=0.99$ for all $i\in\{1,\ldots,d\}$ can limit the representative power of the variational family and can result in high predictive errors. The neural network function for the IAF considered here has two hidden layers of size $100\times 100$ . It can be seen that applying the rotations can be beneficial compared to the IAF for the copula-like density, whereas the two transformations perform similarly for the independent base distribution. We expect that more ad-hoc tricks can be used to adjust the rotations to some computational budget. For instance, one could include additional rotations for a group of latent variables such as those within one layer. Conversely, one could consider the series of sparse rotations $\mathcal{O}_{1},\cdots,\mathcal{O}_{k}$ , but with $2^{k}<d$ , thereby allowing for rotations of the more adjacent latent variables only.

Our experiment illustrates that the proposed approach can be used in high-dimensional structured Bayesian models without having to specify more model-specific dependency assumptions in the variatonal family. The prediction errors are in line with current work for fully connected networks using a Gaussian variational family with Normal priors, cf. [47]. Better predictive performance for a fully connected Bayesian network has been reported in [37] that use RealNVP [10] as a normalising flows in a large network that is reparametrised using a weight normalization [61]. It becomes scalable by opting to consider only variational inference over the Euclidean norm of $W_{i\cdot}^{l}$ and performing point estimation for the direction of the weight vector $W_{i\cdot}^{l}/||W_{i\cdot}^{l}||_{2}$ . Such a parametrisation does not allow for a flexible dependence structure of the weights within one layer; and such a model architecture should be complementary to the proposed variational family in this work.

7 Conclusion

We have addressed the challenging problem of constructing a family of distributions that allows for some flexibility in its dependence structure, whilst also having a reasonable computational complexity. It has been shown experimentally that it can constitute a useful replacement of a Gaussian approximation without requiring many algorithmic changes.

Acknowledgements

Alain Durmus acknowledges support from Chaire BayeScale ”P. Laffitte” and from Polish National Science Center grant: NCN UMO-2018/31/B/ST1/0025. This research has been partly financed by the Alan Turing Institute under the EPSRC grant EP/N510129/1. The authors acknowledge the use of the UCL Myriad High Throughput Computing Facility (Myriad@UCL), and associated support services, in the completion of this work.

Appendix A Proof of Proposition 3

Proof.

Let $f:\mathbb{R}^{d}\to\mathbb{R}_{+}$ be a positive and bounded function. We have by definition, using the expression of the density of the Dirichlet and Beta distributions, see [13], and setting $u_{d}=1-\sum_{i=1}^{d-1}u_{i}$ ,

[TABLE]

where

[TABLE]

Then by symmetry, without loss of generality, we only need to consider $A_{1}$ . Using the change of variable, $(g,u_{1},u_{2},\ldots,u_{d-1})\mapsto(g,u_{1},gu_{2}/u_{1},\ldots,gu_{d-1}/u_{1})$ , which is a $\mathrm{C}^{1}$ -diffeomorphism from $\Delta_{1}=\{(g,u_{1},\ldots,u_{d-1})\in\left[0,1\right]^{d}\,:\,u_{1}=\max_{j\in\{1,\ldots,d\}}u_{j}\}$ to $\tilde{\Delta}_{1}=\{(g,u_{1},w_{2},\ldots,w_{d-1})\in\left[0,1\right]^{d}\,:\,\max_{j\in\{2,\ldots,d-1\}}w_{j}\leqslant g,g/u_{1}-g-\sum_{j=2}^{d-1}w_{j}\leqslant g\}$ , we get that

[TABLE]

Now using the change of variable $(g,u_{1},w_{2},\ldots,w_{d-1})\mapsto(g,g/u_{1}-\sum_{i=2}^{d-1}w_{i},w_{2},\ldots,w_{d-1})=(g,w_{d},\ldots,w_{d-1})$ , which is a $\mathrm{C}^{1}$ -diffeomorphism from $\tilde{\Delta}_{1}$ to

[TABLE]

we obtain since $g/u_{1}=g+\sum_{j=2}^{d}w_{j}$ that

[TABLE]

Combining this result, (11) and (12) completes the proof. ∎

Appendix B Butterfly rotation matrices

Suppose $d=2^{k}$ for some $k\in\mathbb{N}$ and let $c_{i}=\cos\nu_{i}$ and $s_{i}=\sin\nu_{i}$ . For $d=1$ , define $\mathcal{R}_{1}=[1]$ . Assume $\mathcal{R}_{d}$ has been defined. Then define

[TABLE]

where $\tilde{\mathcal{R}}_{d}$ has the same form as $\mathcal{R}_{d}$ except that the $c_{i}$ and $s_{i}$ indices are all increased by $d$ . So for instance

[TABLE]

Suppose now that $d$ is not a power of $2$ and let $k=\lceil\log d\rceil$ . We construct $\mathcal{R}_{d}$ as a product of $k$ factors $\mathcal{O}_{1}\cdots\mathcal{O}_{k}$ as used in the construction of $\mathcal{R}_{2^{k}}$ . For any $i\in\{1,\ldots k\}$ , we then delete from $\mathcal{O}_{i}$ the last $2^{k}-d$ rows and columns. Then for every $c_{i}$ in the remaining $d\times d$ matrix that is in the same column as a deleted $s_{i}$ is replaced by $1$ . As an example, for $d=5$ , we have

[TABLE]

Appendix C Optimization of the variational bound

Recall that for independent random variables $Z_{i}\sim\mathcal{G}(\alpha_{i},1)$ , for $i\in\{1,\ldots d\}$ , we have $\left(\frac{Z_{1}}{\sum_{j=1}^{d}Z_{j}},\ldots\frac{Z_{d}}{\sum_{j=1}^{d}Z_{j}}\right)\sim\text{Dirichlet}(\alpha_{1},\ldots,\alpha_{d})$ , cf. [13]. Similarly, for independent random variables $Z_{d+1}\sim\mathcal{G}(a,1)$ and $Z_{d+2}\sim\mathcal{G}(b,1)$ , it holds that $\frac{Z_{d+1}}{Z_{d+1}+Z_{d+2}}\sim\text{Beta}(a,b)$ . Recall that the parameter of the rotated variational family is $\xi=(\theta,\phi,\delta)$ , where $\theta$ is the parameter of the copula-like base density, whereas $\phi=(\phi_{f},\phi_{\mathcal{R}})$ denotes the parameters of the quantile transformation and the rotation, respectively. Furthermore, the parameter $\delta$ of the transformation $\mathscr{H}$ is kept fix. Using Proposition 3 and Algorithm 1 for some fixed $\delta$ , we can construct a function $(z,\phi)\mapsto f_{\phi,\delta}(z)$ , $z=(z_{1},\ldots z_{d+2})$ , that is almost everywhere continuously differentiable such that $f_{\phi,\delta}(Z_{1},\ldots Z_{d+2})\sim q_{\xi}$ , where $q_{\xi}$ is the density of the proposed variational family with parameter $\xi=(\theta,\phi,\delta)$ , that is the variational density $q_{\xi}$ is the pushforward density of independent Gamma densities with parameter $\theta$ through the transport map $f_{\phi,\delta}$ . Differentiability with respect to $\phi_{f}$ can be achieved by a continuous numerical approximation for the quantile function of a standard Gaussian and applying appropriate (re)normalisation. Furthermore, there exists an invertible standardization function $\mathcal{S}_{\theta}$ with $(z,\theta)\mapsto\mathcal{S}_{\theta}(z)=(\mathbb{P}\left(Z_{1}\leqslant z_{1}\right),\ldots,\mathbb{P}\left(Z_{d+2}\leqslant z_{d+2}\right))$ continuously differentiable such that $\mathcal{S}_{\theta}^{-1}(H)$ is equal to $(Z_{1},\ldots Z_{d+2})$ in distribution, where $H$ is a $(d+2)$ -dimensional vector of iid random variables with uniform marginals on $[0,1]$ . In particular, the distribution of $H$ does not depend on $\xi$ . The cumulative distribution function of $Z_{1}$ say at the point $z_{1}$ is the regularised incomplete Gamma function $\gamma(z_{1},\alpha_{1})$ that lacks an analytical expression though. However, one can apply automatic differentiation to a numerical method that approximates $\gamma(z_{1},\alpha_{1})$ yielding an approximation of $\frac{\partial\gamma(z_{1},\alpha_{1})}{\partial\alpha_{1}}$ . Let us define

[TABLE]

Then $\mathcal{L}(\xi)={\mathbb{E}}\left[l(Z,\phi,\delta)\right]={\mathbb{E}}\left[l(\mathcal{S}^{-1}_{\theta}(H),\phi,\delta)\right]$ , where in the first expectation, the law of the random variable $Z$ depends on $\theta$ . For a differentiable function $g\colon\mathbb{R}^{n}\to\mathbb{R}^{m}$ , we denote by $\nabla_{x}g(x)$ the Jacobian of $g$ , that is $\nabla_{x}g(x)_{ij}=\frac{\partial g_{i}(x)}{\partial x_{j}}$ . Following the arguments in [14], we obtain for the Jacobian of the variational bound

[TABLE]

where $\nabla_{\phi}Z=0$ and $\nabla_{\theta}Z=\nabla_{\theta}\mathcal{S}_{\theta}^{-1}(H)|_{H=\mathcal{S}_{\theta}(Z)}$ can be obtained by implicit differentiation of $S_{\theta}(Z)=H$ which results in $\nabla_{\theta}Z=-(\nabla_{z}\mathcal{S}_{\theta}(Z))^{-1}\nabla_{\theta}\mathcal{S}_{\theta}(Z)$ . So for instance $\frac{\partial Z_{1}}{\partial_{\alpha_{1}}}=-\frac{1}{p_{\alpha_{1}}(Z_{1})}\frac{\partial\gamma(Z_{1},\alpha_{1})}{\partial_{\alpha_{1}}}$ , with $p_{\alpha_{1}}$ being the density function of $Z_{1}$ and recalling that $\theta=(a,b,\alpha_{1},\ldots\alpha_{d})$ . We can thus optimize the variational bound using stochastic gradient descent with unbiased samples from (13). We remark that for instance in tensorflow probability [9], such implicit gradients are used by default as long as one simulates from the copula-like density using Proposition 3, implements the density function $c_{\theta}$ from (6) and applies the bijective transformations according to Algorithm 1. In this case, optimization using the proposed density proceeds analogously as if one would use any reparametrisable variational family such as Gaussian distributions.

Appendix D Additional details for Bayesian Neural Networks with Structured Priors

In the MNIST experiments, we train the network on $50000$ training points out of $60000$ and report the prediction error rates for the test set of $10000$ images. We used a batch-size of $200$ and used $4$ Monte Carlo samples to compute the gradients during training and $100$ Monte Carlo samples for the prediction on the test set. We used Adam with a learning rate in $\{0.0005,0.0002\}$ for $20000$ iterations. The hyper-parameter for the Horseshoe prior were $\nu=4$ , $s=1$ , so $c\sim\mathcal{IG}(2,8)$ , corresponding to a $t_{4}(0,2^{2})$ slab. Furthermore, for the global shrinkage factor, we have used $b_{\tau}\in\{0.1,1\}$ . The variational parameters of the copula-like density are restricted to be positive and we have defined them as the $\text{softmax}\colon x\mapsto\log(\exp(x)+1)$ of unconstrained parameters, initialised so that $\text{softmax}^{-1}(\alpha_{i})\sim\mathcal{N}(2,.01)$ , $\text{softmax}^{-1}(a)=15$ and $\text{softmax}^{-1}(b)=2$ . We have sampled $\delta$ according to (8) and initialised $\nu_{i}\sim\mathcal{U}(-0.2,0.2)$ and the log-standard deviations of the marginal-like distribution as $\log\sigma_{i}=-3$ . We aimed for an initial mean of [math] for $\beta_{i}^{l}$ and of $-3$ for the $\log$ of the remaining variables. We therefore choose $\mu_{i}$ so that the quantile of an initial Monte Carlo estimate for the mean of $V_{i}$ has the desired initial mean.

Appendix E Additional results for Bayesian Neural Networks with Gaussian Priors

Bibliography70

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] David Barber and Christopher M Bishop. Ensemble learning for multi-layer networks. In Advances in neural information processing systems , pages 395–401, 1998.
2[2] Tim Bedford and Roger M Cooke. Probability density decomposition for conditionally dependent random variables modeled by vines. Annals of Mathematics and Artificial intelligence , 32(1-4):245–268, 2001.
3[3] Rianne van den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester normalizing flows for variational inference. ar Xiv preprint ar Xiv:1803.05649 , 2018.
4[4] David M Blei, Alp Kucukelbir, and Jon D Mc Auliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association , 112(518):859–877, 2017.
5[5] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In Proceedings of The 32nd International Conference on Machine Learning , pages 1613–1622, 2015.
6[6] Carlos M Carvalho, Nicholas G Polson, and James G Scott. The horseshoe estimator for sparse signals. Biometrika , 97(2):465–480, 2010.
7[7] Tim R Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M Tomczak. Hyperspherical variational auto-encoders. ar Xiv preprint ar Xiv:1804.00891 , 2018.
8[8] Petros Dellaportas and Mike G Tsionas. Importance sampling from posterior distributions using copula-like approximations. Journal of Econometrics , 2018.