Variational Bayes for high-dimensional linear regression with sparse   priors

Kolyan Ray; Botond Szabo

arXiv:1904.07150·stat.ME·November 20, 2020

Variational Bayes for high-dimensional linear regression with sparse priors

Kolyan Ray, Botond Szabo

PDF

TL;DR

This paper develops a mean-field variational Bayes approach for high-dimensional sparse linear regression, providing theoretical guarantees and practical improvements over existing methods.

Contribution

It introduces a novel prioritized updating scheme for variational inference that enhances performance and offers theoretical oracle inequalities for the approximation.

Findings

01

VB approximation converges at the optimal rate under certain conditions

02

The proposed updating scheme outperforms standard coordinate-ascent in simulations

03

The method performs comparably to state-of-the-art Bayesian variable selection techniques

Abstract

We study a mean-field spike and slab variational Bayes (VB) approximation to Bayesian model selection priors in sparse high-dimensional linear regression. Under compatibility conditions on the design matrix, oracle inequalities are derived for the mean-field VB approximation, implying that it converges to the sparse truth at the optimal rate and gives optimal prediction of the response vector. The empirical performance of our algorithm is studied, showing that it works comparably well as other state-of-the-art Bayesian variable selection methods. We also numerically demonstrate that the widely used coordinate-ascent variational inference (CAVI) algorithm can be highly sensitive to the parameter updating order, leading to potentially poor performance. To mitigate this, we propose a novel prioritized updating scheme that uses a data-driven updating order and performs better in…

Figures40

Click any figure to enlarge with its caption.

Tables7

Table 1. Table 1: We compare the prioritized, lexicographic and random updating schemes in the CAVI algorithm. We take X i j ∼ i i d N ( 0 , 1 ) superscript similar-to 𝑖 𝑖 𝑑 subscript 𝑋 𝑖 𝑗 𝑁 0 1 X_{ij}\stackrel{{\scriptstyle iid}}{{\sim}}N(0,1) , n = 100 𝑛 100 n=100 , p = 200 𝑝 200 p=200 , s = 20 𝑠 20 s=20 , θ i = 10 subscript 𝜃 𝑖 10 \theta_{i}=10 for the non-zero coefficients, which are located at the (i) beginning, (ii) middle, (iii) end, (iv) (uniformly) random locations of the signal. We report the means and standard deviations over 200 runs.

Metric	Method	(i)	(ii)	(iii)	(iv)
$ℓ_{2} - error$	prioritized	1.03 $\pm$ 3.39	1.18 $\pm$ 3.86	1.06 $\pm$ 3.48	0.61 $\pm$ 1.65
	lexicographic	0.71 $\pm$ 2.14	26.61 $\pm$ 15.04	45.72 $\pm$ 5.45	37.91 $\pm$ 5.63
	randomized	27.81 $\pm$ 13.30	27.26 $\pm$ 13.78	25.14 $\pm$ 14.70	35.08 $\pm$ 8.28
FDR	prioritized	0.02 $\pm$ 0.12	0.02 $\pm$ 0.13	0.02 $\pm$ 0.12	0.05 $\pm$ 0.18
	lexicographic	0.01 $\pm$ 0.08	0.63 $\pm$ 0.35	0.87 $\pm$ 0.03	0.54 $\pm$ 0.38
	randomized	0.68 $\pm$ 0.31	0.66 $\pm$ 0.32	0.62 $\pm$ 0.352	0.69 $\pm$ 0.30
TPR	prioritized	1.00 $\pm$ 0.00	1.00 $\pm$ 0.01	1.00 $\pm$ 0.01	1.00 $\pm$ 0.01
	lexicographic	1.00 $\pm$ 0.00	0.93 $\pm$ 0.06	0.75 $\pm$ 0.11	0.95 $\pm$ 0.05
	randomized	0.93 $\pm$ 0.07	0.92 $\pm$ 0.06	0.93 $\pm$ 0.07	0.91 $\pm$ 0.07
runtime (sec)	prioritized	0.28 $\pm$ 0.09	0.24 $\pm$ 0.06	0.26 $\pm$ 0.06	0.24 $\pm$ 0.08
	lexicographic	0.22 $\pm$ 0.06	0.21 $\pm$ 0.05	0.21 $\pm$ 0.04	0.23 $\pm$ 0.06
	randomized	0.24 $\pm$ 0.08	0.22 $\pm$ 0.05	0.23 $\pm$ 0.05	0.25 $\pm$ 0.06

Table 2. Table 2: Linear regression with Gaussian design X i j ∼ i i d N ( 0 , 1 ) superscript similar-to 𝑖 𝑖 𝑑 subscript 𝑋 𝑖 𝑗 𝑁 0 1 X_{ij}\stackrel{{\scriptstyle iid}}{{\sim}}N(0,1) , unknown noise variance ς 2 superscript 𝜍 2 \varsigma^{2} and non-zero signal coefficients θ i = A subscript 𝜃 𝑖 𝐴 \theta_{i}=A , with parameter ( n , p , s , A , ς ) 𝑛 𝑝 𝑠 𝐴 𝜍 (n,p,s,A,\varsigma) choices (i) ( 100 , 400 , 20 , log ⁡ n , 5 ) 100 400 20 𝑛 5 (100,400,20,\log n,5) (non-zero coefficients at the beginning); (ii) ( 100 , 1000 , 3 , ( 1 , 2 , 3 ) , 1 ) 100 1000 3 1 2 3 1 \big{(}100,1000,3,(1,2,3),1\big{)} (at the end); (iii) ( 200 , 800 , 5 , ∼ i i d U ( − 5 , 5 ) , 0.2 ) (200,800,5,\stackrel{{\scriptstyle iid}}{{\sim}}U(-5,5),0.2) (in the middle); (iv) ( 100 , 400 , 20 , 2 log ⁡ n , 5 ) 100 400 20 2 𝑛 5 (100,400,20,2\log n,5) (at the end) .

Metric	Method	(i)	(ii)	(iii)	(iv)
$ℓ_{2} - error$	sparsevb	10.48 $\pm$ 6.84	0.21 $\pm$ 0.14	0.03 $\pm$ 0.01	6.55 $\pm$ 7.80
	varbvs	14.23 $\pm$ 6.51	0.18 $\pm$ 0.07	0.03 $\pm$ 0.01	20.43 $\pm$ 17.15
	EMVS	14.02 $\pm$ 2.46	3.57 $\pm$ 0.03	5.04 $\pm$ 0.33	21.52 $\pm$ 11.29
	SSLASSO	20.62 $\pm$ 0.17	0.16 $\pm$ 0.11	0.09 $\pm$ 0.12	37.92 $\pm$ 9.84
	ebreg	9.38 $\pm$ 6.05	0.18 $\pm$ 0.07	0.17 $\pm$ 0.04	7.39 $\pm$ 7.42
FDR	sparsevb	0.12 $\pm$ 0.17	0.06 $\pm$ 0.16	0.00 $\pm$ 0.00	0.02 $\pm$ 0.07
	varbvs	0.06 $\pm$ 0.11	0.01 $\pm$ 0.04	0.00 $\pm$ 0.00	0.07 $\pm$ 0.15
	EMVS	0.24 $\pm$ 0.13	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.43 $\pm$ 0.25
	SSLASSO	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00
	ebreg	0.38 $\pm$ 0.20	0.01 $\pm$ 0.02	0.00 $\pm$ 0.00	0.28 $\pm$ 0.16
TPR	sparsevb	0.70 $\pm$ 0.31	1.00 $\pm$ 0.00	0.96 $\pm$ 0.13	0.94 $\pm$ 0.18
	varbvs	0.340 $\pm$ 0.37	1.00 $\pm$ 0.00	0.57 $\pm$ 0.43	0.53 $\pm$ 0.44
	EMVS	0.59 $\pm$ 0.14	0.00 $\pm$ 0.00	0.86 $\pm$ 0.09	0.88 $\pm$ 0.10
	SSLASSO	0.01 $\pm$ 0.01	0.94 $\pm$ 0.13	0.10 $\pm$ 0.29	0.09 $\pm$ 0.28
	ebreg	0.88 $\pm$ 0.18	1.00 $\pm$ 0.00	0.98 $\pm$ 0.07	1.00 $\pm$ 0.04
runtime (sec)	sparsevb	0.43 $\pm$ 0.27	0.71 $\pm$ 0.21	0.35 $\pm$ 0.25	0.65 $\pm$ 0.53
	varbvs	0.60 $\pm$ 0.28	2.02 $\pm$ 0.50	0.51 $\pm$ 0.38	0.56 $\pm$ 0.23
	EMVS	0.20 $\pm$ 0.07	1.72 $\pm$ 0.44	0.21 $\pm$ 0.05	0.19 $\pm$ 0.09
	SSLASSO	0.06 $\pm$ 0.03	0.37 $\pm$ 0.11	0.06 $\pm$ 0.01	0.07 $\pm$ 0.03
	ebreg	35.05 $\pm$ 7.03	21.20 $\pm$ 6.05	31.42 $\pm$ 4.01	36.33 $\pm$ 9.13

Table 3. Table 3: Cross-validated ℓ 2 subscript ℓ 2 \ell_{2} -estimation error of Bayesian model selection methods

data \ Method	sparsevb	varbvs	EMVS	SSLASSO
CV error	16.43	$59.49$	$74.45$	$53.28$
model size	$9$	$7$	$14$	$5$
runtime (sec)	$1.49$	$1.14$	0.02	$0.10$

Table 4. Table 4: Linear regression with (i) identity design X = I n 𝑋 subscript 𝐼 𝑛 X=I_{n} , and ( i i ) − ( i v ) 𝑖 𝑖 𝑖 𝑣 (ii)-(iv) Gaussian design X i j ∼ i i d N ( 0 , τ 2 ) superscript similar-to 𝑖 𝑖 𝑑 subscript 𝑋 𝑖 𝑗 𝑁 0 superscript 𝜏 2 X_{ij}\stackrel{{\scriptstyle iid}}{{\sim}}N(0,\tau^{2}) . The non-zero coefficients are located in the beginning of the signal. The parameters ( n , p , s , A ) 𝑛 𝑝 𝑠 𝐴 (n,p,s,A) are set to (i) ( 400 , 400 , 40 , 4 log ⁡ n ) 400 400 40 4 𝑛 (400,400,40,4\sqrt{\log n}) ; (ii) ( 100 , 200 , 20 , U ( 0 , 2 log ⁡ ( n ) ) ) 100 200 20 𝑈 0 2 𝑛 (100,200,20,U(0,2\log(n))) ; (iii) ( 200 , 800 , 40 , 2 log ⁡ n ) 200 800 40 2 𝑛 (200,800,40,2\log n) ; (iv) ( 100 , 400 , 15 , U ( − 8 , 8 ) ) 100 400 15 𝑈 8 8 (100,400,15,U(-8,8)) . We set (ii) τ = 1 𝜏 1 \tau=1 ; (iii) τ = 0.1 𝜏 0.1 \tau=0.1 ; (iv) τ = 0.5 𝜏 0.5 \tau=0.5 . We compare the means and standard deviations over 200 runs for our method and other variations of the VB algorithm.

Metric	Method $\$ Experiment	(i)	(ii)	(iii)	(iv)
$ℓ_{2} - error$	Laplace $𝒫_{M F}$	8.80 $\pm$ 0.85	1.30 $\pm$ 0.26	9.25 $\pm$ 9.73	1.08 $\pm$ 0.20
	Laplace $𝒬_{M F}$	8.80 $\pm$ 0.85	6.73 $\pm$ 1.79	39.98 $\pm$ 6.88	6.56 $\pm$ 1.97
	Gauss	31.06 $\pm$ 0.49	1.93 $\pm$ 0.51	43.58 $\pm$ 2.94	1.40 $\pm$ 0.29
	Gauss (batch-wise)	31.11 $\pm$ 0.48	16.38 $\pm$ 0.79	66.98 $\pm$ 0.00	18.03 $\pm$ 0.00
	Gauss ( $ρ = {‖ θ_{0} ‖}_{2}$ )	6.26 $\pm$ 0.72	1.42 $\pm$ 0.32	58.12 $\pm$ 19.01	2.05 $\pm$ 3.59
FDR	Laplace $𝒫_{M F}$	0.00 $\pm$ 0.00	0.00 $\pm$ 0.01	0.03 $\pm$ 0.11	0.00 $\pm$ 0.02
	Laplace $𝒬_{M F}$	0.00 $\pm$ 0.00	0.70 $\pm$ 0.07	0.45 $\pm$ 0.08	0.55 $\pm$ 0.14
	Gauss	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.50 $\pm$ 0.03	0.01 $\pm$ 0.03
	Gauss (batch-wise)	0.00 $\pm$ 0.00	0.87 $\pm$ 0.01	0.62 $\pm$ 0.03	0.82 $\pm$ 0.03
	Gauss ( $ρ = {‖ θ_{0} ‖}_{2}$ )	0.00 $\pm$ 0.00	0.25 $\pm$ 0.36	0.57 $\pm$ 0.21	0.06 $\pm$ 0.21
TPR	Laplace $𝒫_{M F}$	1.00 $\pm$ 0.00	0.89 $\pm$ 0.02	0.99 $\pm$ 0.06	0.81 $\pm$ 0.03
	Laplace $𝒬_{M F}$	1.00 $\pm$ 0.00	0.81 $\pm$ 0.06	0.88 $\pm$ 0.08	0.74 $\pm$ 0.07
	Gauss	1.00 $\pm$ 0.00	0.89 $\pm$ 0.02	0.94 $\pm$ 0.05	0.81 $\pm$ 0.03
	Gauss (batch-wise)	1.00 $\pm$ 0.00	0.88 $\pm$ 0.07	0.82 $\pm$ 0.07	0.68 $\pm$ 0.08
	Gauss ( $ρ = {‖ θ_{0} ‖}_{2}$ )	1.00 $\pm$ 0.00	0.81 $\pm$ 0.10	0.58 $\pm$ 0.17	0.78 $\pm$ 0.10

Table 5. Table 5: Performance of sparsevb for different hyper-parameter values λ 𝜆 \lambda . We take Gaussian design X i j ∼ i i d N ( 0 , τ 2 ) superscript similar-to 𝑖 𝑖 𝑑 subscript 𝑋 𝑖 𝑗 𝑁 0 superscript 𝜏 2 X_{ij}\stackrel{{\scriptstyle iid}}{{\sim}}N(0,\tau^{2}) , place the non-zero signal coefficients θ 0 , i = A subscript 𝜃 0 𝑖 𝐴 \theta_{0,i}=A at the beginning of the signal, and set the parameters ( n , p , s , τ , A ) 𝑛 𝑝 𝑠 𝜏 𝐴 (n,p,s,\tau,A) equal to (i) ( 200 , 300 , 15 , 0.5 , 2 log ⁡ n ) 200 300 15 0.5 2 𝑛 (200,300,15,0.5,2\log n) ; (ii) ( 500 , 1000 , 50 , 1 , 2 log ⁡ n ) 500 1000 50 1 2 𝑛 (500,1000,50,1,2\log n) ; (iii) ( 200 , 500 , 20 , 0.2 , U ( − 10 , 10 ) ) 200 500 20 0.2 𝑈 10 10 (200,500,20,0.2,U(-10,10)) ; (iv) ( 1000 , 2000 , 15 , 2 , U ( − 8 , 8 ) ) 1000 2000 15 2 𝑈 8 8 (1000,2000,15,2,U(-8,8)) .

Metric	Method	(i)	(ii)	(iii)	(iv)
$ℓ_{2} - error$	$λ = 1 / 20$	0.56 $\pm$ 0.11	35.81 $\pm$ 2.17	2.49 $\pm$ 0.50	0.09 $\pm$ 0.02
	$λ = 1 / 4$	0.57 $\pm$ 0.12	35.28 $\pm$ 2.32	2.34 $\pm$ 0.50	0.08 $\pm$ 0.02
	$λ = 1$	0.57 $\pm$ 0.11	16.38 $\pm$ 14.63	2.38 $\pm$ 0.48	0.08 $\pm$ 0.02
	$λ = 4$	0.67 $\pm$ 0.12	0.34 $\pm$ 0.03	3.56 $\pm$ 0.51	0.07 $\pm$ 0.02
	$λ = 20$	1.85 $\pm$ 0.22	0.47 $\pm$ 0.05	11.94 $\pm$ 1.01	0.07 $\pm$ 0.02
FDR	$λ = 1 / 20$	0.00 $\pm$ 0.00	0.92 $\pm$ 0.00	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00
	$λ = 1 / 4$	0.00 $\pm$ 0.01	0.92 $\pm$ 0.00	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00
	$λ = 1$	0.00 $\pm$ 0.01	0.51 $\pm$ 0.45	0.00 $\pm$ 0.01	0.00 $\pm$ 0.00
	$λ = 4$	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.00 $\pm$ 0.01	0.00 $\pm$ 0.01
	$λ = 20$	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.00 $\pm$ 0.02	0.00 $\pm$ 0.00
TPR	$λ = 1 / 20$	1.00 $\pm$ 0.00	1.00 $\pm$ 0.00	0.91 $\pm$ 0.04	0.95 $\pm$ 0.03
	$λ = 1 / 4$	1.00 $\pm$ 0.00	1.00 $\pm$ 0.01	0.92 $\pm$ 0.04	0.96 $\pm$ 0.03
	$λ = 1$	1.00 $\pm$ 0.00	1.00 $\pm$ 0.01	0.92 $\pm$ 0.04	0.97 $\pm$ 0.03
	$λ = 4$	1.00 $\pm$ 0.00	1.00 $\pm$ 0.01	0.90 $\pm$ 0.04	0.98 $\pm$ 0.03
	$λ = 20$	1.00 $\pm$ 0.00	1.00 $\pm$ 0.00	0.59 $\pm$ 0.07	0.98 $\pm$ 0.03
runtime (sec)	$λ = 1 / 20$	0.04 $\pm$ 0.01	0.75 $\pm$ 0.11	0.08 $\pm$ 0.02	4.23 $\pm$ 0.46
	$λ = 1 / 4$	0.04 $\pm$ 0.01	0.81 $\pm$ 0.18	0.08 $\pm$ 0.02	4.23 $\pm$ 0.50
	$λ = 1$	0.04 $\pm$ 0.01	1.71 $\pm$ 0.63	0.08 $\pm$ 0.02	4.23 $\pm$ 0.46
	$λ = 4$	0.04 $\pm$ 0.02	1.26 $\pm$ 0.15	0.08 $\pm$ 0.02	4.25 $\pm$ 0.50
	$λ = 20$	0.04 $\pm$ 0.01	0.75 $\pm$ 0.06	0.08 $\pm$ 0.01	4.29 $\pm$ 0.65

Table 6. Table 6: Noise misspecification: we compare the robustness of Bayesian model selection methods under misspecified noise. We take Gaussian design X i j ∼ i i d N ( 0 , 2 ) superscript similar-to 𝑖 𝑖 𝑑 subscript 𝑋 𝑖 𝑗 𝑁 0 2 X_{ij}\stackrel{{\scriptstyle iid}}{{\sim}}N(0,2) , set the model parameters n = 200 𝑛 200 n=200 , p = 400 𝑝 400 p=400 , s = 20 𝑠 20 s=20 , and take non-zero coefficients θ i ∼ i i d U ( − 10 , 10 ) superscript similar-to 𝑖 𝑖 𝑑 subscript 𝜃 𝑖 𝑈 10 10 \theta_{i}\stackrel{{\scriptstyle iid}}{{\sim}}U(-10,10) located in the beginning of the signal. We ran each experiment 200 times and report the means and standard deviations.

Metric	Method	(i) $N (0, 1)$	(ii) $Lap (0, 1)$	(iii) $U (- 2, 2)$	(iv) Student $t_{3}$
$ℓ_{2} - error$	sparsevb	0.18 $\pm$ 0.05	0.24 $\pm$ 0.04	0.21 $\pm$ 0.03	0.30 $\pm$ 0.06
	varbvs	0.17 $\pm$ 0.03	0.24 $\pm$ 0.04	0.21 $\pm$ 0.03	0.30 $\pm$ 0.06
	EMVS	0.59 $\pm$ 0.03	1.03 $\pm$ 0.14	0.89 $\pm$ 0.16	1.13 $\pm$ 0.43
	SSLASSO	5.99 $\pm$ 0.98	4.07 $\pm$ 1.02	4.88 $\pm$ 0.62	4.87 $\pm$ 0.78
	ebreg	0.26 $\pm$ 0.05	0.26 $\pm$ 0.07	0.23 $\pm$ 0.05	0.23 $\pm$ 0.05
FDR	sparsevb	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00
	varbvs	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.00 $\pm$ 0.01
	EMVS	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.00 $\pm$ 0.01
	SSLASSO	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00
	ebreg	0.01 $\pm$ 0.02	0.01 $\pm$ 0.05	0.01 $\pm$ 0.05	0.01 $\pm$ 0.03
TPR	sparsevb	1.00 $\pm$ 0.01	1.00 $\pm$ 0.00	0.95 $\pm$ 0.00	0.90 $\pm$ 0.01
	varbvs	1.00 $\pm$ 0.00	1.00 $\pm$ 0.00	0.95 $\pm$ 0.01	0.90 $\pm$ 0.01
	EMVS	0.95 $\pm$ 0.02	0.92 $\pm$ 0.02	0.89 $\pm$ 0.02	0.81 $\pm$ 0.04
	SSLASSO	0.67 $\pm$ 0.04	0.72 $\pm$ 0.05	0.64 $\pm$ 0.02	0.64 $\pm$ 0.04
	ebreg	1.00 $\pm$ 0.01	1.00 $\pm$ 0.00	0.95 $\pm$ 0.01	0.90 $\pm$ 0.01
runtime (sec)	sparsevb	0.22 $\pm$ 0.03	0.24 $\pm$ 0.06	0.24 $\pm$ 0.06	0.26 $\pm$ 0.08
	varbvs	0.32 $\pm$ 0.05	0.32 $\pm$ 0.05	0.35 $\pm$ 0.08	0.35 $\pm$ 0.09
	EMVS	1.24 $\pm$ 0.15	1.19 $\pm$ 0.24	1.26 $\pm$ 0.27	1.31 $\pm$ 0.34
	SSLASSO	0.16 $\pm$ 0.03	0.28 $\pm$ 0.04	0.22 $\pm$ 0.05	0.28 $\pm$ 0.07
	ebreg	24.37 $\pm$ 7.10	24.89 $\pm$ 3.78	127.72 $\pm$ 4.51	28.19 $\pm$ 4.51

Table 7. Table 7: Linear regression with correlated Gaussian design X i ⁣ ⋅ ∼ i i d N p ( 0 , Σ ) superscript similar-to 𝑖 𝑖 𝑑 subscript 𝑋 𝑖 ⋅ subscript 𝑁 𝑝 0 Σ X_{i\cdot}\stackrel{{\scriptstyle iid}}{{\sim}}N_{p}(0,\Sigma) , with correlation Σ j k = ρ subscript Σ 𝑗 𝑘 𝜌 \Sigma_{jk}=\rho for j ≠ k 𝑗 𝑘 j\neq k and Σ j j = 1 subscript Σ 𝑗 𝑗 1 \Sigma_{jj}=1 . The noise variance ς 2 superscript 𝜍 2 \varsigma^{2} is unknown and the non-zero signal coefficients equal θ i = A subscript 𝜃 𝑖 𝐴 \theta_{i}=A . We take the parameters ( n , p , s , A , ρ , ς ) 𝑛 𝑝 𝑠 𝐴 𝜌 𝜍 (n,p,s,A,\rho,\varsigma) equal to (i) ( 100 , 400 , 10 , ∼ i i d U ( − 3 , 3 ) , 0.3 , 0.2 ) (100,400,10,\stackrel{{\scriptstyle iid}}{{\sim}}U(-3,3),0.3,0.2) (non-zero coefficients at the beginning); (ii) ( 100 , 400 , 10 , ∼ i i d U ( − 3 , 3 ) , 0.7 , 0.2 ) (100,400,10,\stackrel{{\scriptstyle iid}}{{\sim}}U(-3,3),0.7,0.2) (at the beginning); (iii) ( 200 , 800 , 20 , 2 log ⁡ n , 0.3 , 5 ) 200 800 20 2 𝑛 0.3 5 (200,800,20,2\log n,0.3,5) (at the end); (iv) ( 200 , 800 , 20 , 2 log ⁡ n , 0.7 , 5 ) 200 800 20 2 𝑛 0.7 5 (200,800,20,2\log n,0.7,5) (at the end). We compare the means and standard deviations over 100 runs.

Metric	Method	(i)	(ii)	(iii)	(iv)
$ℓ_{2} - error$	sparsevb	0.12 $\pm$ 0.06	0.89 $\pm$ 1.40	1.97 $\pm$ 0.37	4.85 $\pm$ 1.29
	varbvs	0.13 $\pm$ 0.06	0.30 $\pm$ 0.10	2.10 $\pm$ 0.43	27.18 $\pm$ 23.59
	EMVS	4.80 $\pm$ 0.21	5.29 $\pm$ 0.26	4.04 $\pm$ 0.30	7.04 $\pm$ 0.98
	SSLASSO	1.62 $\pm$ 0.35	0.97 $\pm$ 0.36	56.70 $\pm$ 7.78	79.17 $\pm$ 4.95
	ebreg	0.34 $\pm$ 0.06	0.56 $\pm$ 0.14	5.41 $\pm$ 0.67	6.41 $\pm$ 1.21
FDR	sparsevb	0.00 $\pm$ 0.00	0.18 $\pm$ 0.34	0.00 $\pm$ 0.01	0.00 $\pm$ 0.00
	varbvs	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.01 $\pm$ 0.02	0.31 $\pm$ 0.26
	EMVS	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.00 $\pm$ 0.01	0.14 $\pm$ 0.08
	SSLASSO	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.18 $\pm$ 0.16	0.41 $\pm$ 0.19
	ebreg	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.43 $\pm$ 0.05	0.28 $\pm$ 0.08
TPR	sparsevb	0.96 $\pm$ 0.05	0.95 $\pm$ 0.10	1.00 $\pm$ 0.00	1.00 $\pm$ 0.00
	varbvs	0.95 $\pm$ 0.05	1.00 $\pm$ 0.00	1.00 $\pm$ 0.00	0.69 $\pm$ 0.32
	EMVS	0.01 $\pm$ 0.03	0.02 $\pm$ 0.04	1.00 $\pm$ 0.00	1.00 $\pm$ 0.00
	SSLASSO	0.48 $\pm$ 0.04	0.81 $\pm$ 0.08	0.34 $\pm$ 0.10	0.18 $\pm$ 0.05
	ebreg	0.90 $\pm$ 0.01	0.96 $\pm$ 0.05	1.00 $\pm$ 0.00	1.00 $\pm$ 0.00
runtime (sec)	sparsevb	0.22 $\pm$ 0.06	0.30 $\pm$ 0.06	0.91 $\pm$ 0.11	1.28 $\pm$ 0.17
	varbvs	0.51 $\pm$ 0.19	0.80 $\pm$ 0.52	12.31 $\pm$ 3.34	30.54 $\pm$ 6.38
	EMVS	0.14 $\pm$ 0.07	0.15 $\pm$ 0.06	0.90 $\pm$ 0.18	1.06 $\pm$ 0.26
	SSLASSO	0.33 $\pm$ 0.07	0.38 $\pm$ 0.19	0.16 $\pm$ 0.03	0.16 $\pm$ 0.02
	ebreg	13.90 $\pm$ 1.62	15.06 $\pm$ 2.96	64.14 $\pm$ 6.36	59.98 $\pm$ 4.41

Equations252

Y = X θ + Z,

Y = X θ + Z,

∥ X ∥ := 1 \leq i \leq p max ∥ X_{\cdot i} ∥_{2} = 1 \leq i \leq p max (X^{T} X)_{ii}^{1/2} .

∥ X ∥ := 1 \leq i \leq p max ∥ X_{\cdot i} ∥_{2} = 1 \leq i \leq p max (X^{T} X)_{ii}^{1/2} .

s \sim π_{p} (s) S ∣∣ S ∣ = s \sim Unif_{p, s} θ_{i} \sim in d {Lap (λ), δ_{0}, i \in S, i \neq \in S,

s \sim π_{p} (s) S ∣∣ S ∣ = s \sim Unif_{p, s} θ_{i} \sim in d {Lap (λ), δ_{0}, i \in S, i \neq \in S,

A_{1} p^{- A_{3}} π_{p} (s - 1) \leq π_{p} (s) \leq A_{2} p^{- A_{4}} π_{p} (s - 1), s = 1, \dots, p .

A_{1} p^{- A_{3}} π_{p} (s - 1) \leq π_{p} (s) \leq A_{2} p^{- A_{4}} π_{p} (s - 1), s = 1, \dots, p .

\frac{∥ X ∥}{p} \leq λ \leq 2 \overset{ˉ}{λ}, \overset{ˉ}{λ} = 2∥ X ∥ lo g p,

\frac{∥ X ∥}{p} \leq λ \leq 2 \overset{ˉ}{λ}, \overset{ˉ}{λ} = 2∥ X ∥ lo g p,

Y / ς = (X / ς) θ + Z,

Y / ς = (X / ς) θ + Z,

P_{M F} = {P_{μ, σ, γ} = i = 1 ⨂ p [γ_{i} N (μ_{i}, σ_{i}^{2}) + (1 - γ_{i}) δ_{0}] : μ_{i} \in R, σ_{i} \in R^{+}, γ_{i} \in [0, 1]},

P_{M F} = {P_{μ, σ, γ} = i = 1 ⨂ p [γ_{i} N (μ_{i}, σ_{i}^{2}) + (1 - γ_{i}) δ_{0}] : μ_{i} \in R, σ_{i} \in R^{+}, γ_{i} \in [0, 1]},

Π = P_{μ, σ, γ} \in P_{M F} argmin KL (P_{μ, σ, γ} ∣∣Π (\cdot ∣ Y)),

Π = P_{μ, σ, γ} \in P_{M F} argmin KL (P_{μ, σ, γ} ∣∣Π (\cdot ∣ Y)),

Q = {N_{S} (μ_{S}, Σ_{S}) \otimes δ_{S^{c}} : S \subseteq {1, 2, ..., p}, μ_{S} \in R^{∣ S ∣}, Σ_{S} \in R^{∣ S ∣ \times ∣ S ∣} a positive definite covariance matrix},

Q = {N_{S} (μ_{S}, Σ_{S}) \otimes δ_{S^{c}} : S \subseteq {1, 2, ..., p}, μ_{S} \in R^{∣ S ∣}, Σ_{S} \in R^{∣ S ∣ \times ∣ S ∣} a positive definite covariance matrix},

Q_{M F} = {N_{S} (μ_{S}, D_{S}) \otimes δ_{S^{c}} : S \subseteq {1, 2, ..., p}, μ_{S} \in R^{∣ S ∣}, D_{S} \in R^{∣ S ∣ \times ∣ S ∣} a positive definite diagonal matrix} .

Q_{M F} = {N_{S} (μ_{S}, D_{S}) \otimes δ_{S^{c}} : S \subseteq {1, 2, ..., p}, μ_{S} \in R^{∣ S ∣}, D_{S} \in R^{∣ S ∣ \times ∣ S ∣} a positive definite diagonal matrix} .

\hat{Q} = Q \in Q argmin KL (Q ∣∣Π (\cdot ∣ Y)), Q = Q \in Q_{M F} argmin KL (Q ∥Π (\cdot ∣ Y)) .

\hat{Q} = Q \in Q argmin KL (Q ∣∣Π (\cdot ∣ Y)), Q = Q \in Q_{M F} argmin KL (Q ∥Π (\cdot ∣ Y)) .

θ_{0} \in {θ : # (j : θ_{j} \neq = 0) \leq s_{n}}, for some s_{n} = o (n) .

θ_{0} \in {θ : # (j : θ_{j} \neq = 0) \leq s_{n}}, for some s_{n} = o (n) .

ϕ (S) = in f {\frac{∥ X θ ∥ _{2} ∣ S ∣ ^{1/2}}{∥ X ∥∥ θ _{S} ∥ _{1}} : ∥ θ_{S^{c}} ∥_{1} \leq 7∥ θ_{S} ∥_{1}, θ_{S} \neq = 0} .

ϕ (S) = in f {\frac{∥ X θ ∥ _{2} ∣ S ∣ ^{1/2}}{∥ X ∥∥ θ _{S} ∥ _{1}} : ∥ θ_{S^{c}} ∥_{1} \leq 7∥ θ_{S} ∥_{1}, θ_{S} \neq = 0} .

\overline{ϕ} (s) = in f {\frac{∥ X θ ∥ _{2} ∣ S _{θ} ∣ ^{1/2}}{∥ X ∥∥ θ ∥ _{1}} : 0 \neq = ∣ S_{θ} ∣ \leq s} .

\overline{ϕ} (s) = in f {\frac{∥ X θ ∥ _{2} ∣ S _{θ} ∣ ^{1/2}}{∥ X ∥∥ θ ∥ _{1}} : 0 \neq = ∣ S_{θ} ∣ \leq s} .

ϕ (s) := in f {\frac{∥ X θ ∥ _{2}}{∥ X ∥∥ θ ∥ _{2}} : 0 \neq = ∣ S_{θ} ∣ \leq s} .

ϕ (s) := in f {\frac{∥ X θ ∥ _{2}}{∥ X ∥∥ θ ∥ _{2}} : 0 \neq = ∣ S_{θ} ∣ \leq s} .

\overline{ψ}_{M} (S) = \overline{ϕ} ((2 + \frac{4 M}{A _{4}} (1 + \frac{16}{ϕ ( S ) ^{2}} \frac{λ}{λ ˉ})) ∣ S ∣), ψ_{M} (S) = ϕ ((2 + \frac{4 M}{A _{4}} (1 + \frac{16}{ϕ ( S ) ^{2}} \frac{λ}{λ ˉ})) ∣ S ∣) .

\overline{ψ}_{M} (S) = \overline{ϕ} ((2 + \frac{4 M}{A _{4}} (1 + \frac{16}{ϕ ( S ) ^{2}} \frac{λ}{λ ˉ})) ∣ S ∣), ψ_{M} (S) = ϕ ((2 + \frac{4 M}{A _{4}} (1 + \frac{16}{ϕ ( S ) ^{2}} \frac{λ}{λ ˉ})) ∣ S ∣) .

Θ_{ρ_{n}, s_{n}} := {θ \in R^{p} : ϕ (S_{0}) \geq c_{0}, ∣ S_{0} ∣ \leq s_{n}, ψ_{ρ_{n}} (S_{0}) \geq c_{0}},

Θ_{ρ_{n}, s_{n}} := {θ \in R^{p} : ϕ (S_{0}) \geq c_{0}, ∣ S_{0} ∣ \leq s_{n}, ψ_{ρ_{n}} (S_{0}) \geq c_{0}},

θ_{0} \in Θ_{ρ_{n}, s_{n}} sup E_{θ_{0}} Π (θ : ∥ X (θ - θ_{0}) ∥_{2} \geq \frac{M ρ _{n}^{1/2}}{ψ _{ρ_{n}} ( S _{0} )} \frac{∣ S _{0} ∣ lo g p}{ϕ ( S _{0} )}) \to 0,

θ_{0} \in Θ_{ρ_{n}, s_{n}} sup E_{θ_{0}} Π (θ : ∥ X (θ - θ_{0}) ∥_{2} \geq \frac{M ρ _{n}^{1/2}}{ψ _{ρ_{n}} ( S _{0} )} \frac{∣ S _{0} ∣ lo g p}{ϕ ( S _{0} )}) \to 0,

θ_{0} \in Θ_{ρ_{n}, s_{n}} sup E_{θ_{0}} Π (θ : ∥ θ - θ_{0} ∥_{1} > \frac{M ρ _{n}}{ψ _{ρ_{n}} ( S _{0} ) ^{2}} \frac{∣ S _{0} ∣ lo g p}{∥ X ∥ ϕ ( S _{0} ) ^{2}}) \to 0,

θ_{0} \in Θ_{ρ_{n}, s_{n}} sup E_{θ_{0}} Π (θ : ∥ θ - θ_{0} ∥_{1} > \frac{M ρ _{n}}{ψ _{ρ_{n}} ( S _{0} ) ^{2}} \frac{∣ S _{0} ∣ lo g p}{∥ X ∥ ϕ ( S _{0} ) ^{2}}) \to 0,

θ_{0} \in Θ_{ρ_{n}, s_{n}} sup E_{θ_{0}} Π (θ : ∥ θ - θ_{0} ∥_{2} > \frac{M ρ _{n}^{1/2}}{∥ X ∥ ψ _{ρ_{n}} ( S _{0} ) ^{2}} \frac{∣ S _{0} ∣ lo g p}{ϕ ( S _{0} )}) \to 0

θ_{0} \in Θ_{ρ_{n}, s_{n}} sup E_{θ_{0}} Π (θ : ∥ θ - θ_{0} ∥_{2} > \frac{M ρ _{n}^{1/2}}{∥ X ∥ ψ _{ρ_{n}} ( S _{0} ) ^{2}} \frac{∣ S _{0} ∣ lo g p}{ϕ ( S _{0} )}) \to 0

θ_{0} \in Θ_{ρ_{n}, s_{n}} sup E_{θ_{0}} Π (θ : ∣ S_{θ} ∣ \geq ∣ S_{0} ∣ + M ρ_{n} (1 + \frac{16}{ϕ ( S _{0} ) ^{2}} \frac{λ}{λ ˉ}) ∣ S_{0} ∣) \to 0,

θ_{0} \in Θ_{ρ_{n}, s_{n}} sup E_{θ_{0}} Π (θ : ∣ S_{θ} ∣ \geq ∣ S_{0} ∣ + M ρ_{n} (1 + \frac{16}{ϕ ( S _{0} ) ^{2}} \frac{λ}{λ ˉ}) ∣ S_{0} ∣) \to 0,

θ_{0} \in Θ_{ρ_{n}, s_{n}} sup E_{θ_{0}} Π (θ : ∥ X (θ - θ_{0}) ∥_{2} \geq \frac{M ρ _{n}^{1/2}}{ψ _{ρ_{n}} ( S _{0} )} [\frac{s _{*} lo g p}{ϕ ( S _{*} )} + ∥ X (θ_{0} - θ_{*}) ∥_{2}]) \to 0,

θ_{0} \in Θ_{ρ_{n}, s_{n}} sup E_{θ_{0}} Π (θ : ∥ X (θ - θ_{0}) ∥_{2} \geq \frac{M ρ _{n}^{1/2}}{ψ _{ρ_{n}} ( S _{0} )} [\frac{s _{*} lo g p}{ϕ ( S _{*} )} + ∥ X (θ_{0} - θ_{*}) ∥_{2}]) \to 0,

θ_{0} \in Θ_{ρ_{n}, s_{n}} sup E_{θ_{0}} Π (θ : ∥ θ - θ_{0} ∥_{1} > ∥ θ_{0} - θ_{*} ∥_{1} + \frac{M ρ _{n}}{ψ _{ρ_{n}} ( S _{0} ) ^{2}} [\frac{s _{*} lo g p}{∥ X ∥ ϕ ( S _{*} ) ^{2}} + \frac{∥ X ( θ _{0} - θ _{*} ) ∥ _{2}^{2}}{∥ X ∥ lo g p}]) \to 0,

θ_{0} \in Θ_{ρ_{n}, s_{n}} sup E_{θ_{0}} Π (θ : ∥ θ - θ_{0} ∥_{1} > ∥ θ_{0} - θ_{*} ∥_{1} + \frac{M ρ _{n}}{ψ _{ρ_{n}} ( S _{0} ) ^{2}} [\frac{s _{*} lo g p}{∥ X ∥ ϕ ( S _{*} ) ^{2}} + \frac{∥ X ( θ _{0} - θ _{*} ) ∥ _{2}^{2}}{∥ X ∥ lo g p}]) \to 0,

θ_{0} \in Θ_{ρ_{n}, s_{n}} sup E_{θ_{0}} Π (θ : ∥ θ - θ_{0} ∥_{2} > \frac{M ρ _{n}^{1/2}}{∥ X ∥ ψ _{ρ_{n}} ( S _{0} ) ^{2}} [\frac{s _{*} lo g p}{ϕ ( S _{*} )} + ∥ X (θ_{0} - θ_{*}) ∥_{2}]) \to 0

θ_{0} \in Θ_{ρ_{n}, s_{n}} sup E_{θ_{0}} Π (θ : ∥ θ - θ_{0} ∥_{2} > \frac{M ρ _{n}^{1/2}}{∥ X ∥ ψ _{ρ_{n}} ( S _{0} ) ^{2}} [\frac{s _{*} lo g p}{ϕ ( S _{*} )} + ∥ X (θ_{0} - θ_{*}) ∥_{2}]) \to 0

θ_{0} sup E_{θ_{0}} Π (θ : ∥ θ - θ_{0} ∥_{2} \geq M ρ_{n}^{1/2} ∥ θ_{0} ∥_{2}) \to 0

θ_{0} sup E_{θ_{0}} Π (θ : ∥ θ - θ_{0} ∥_{2} \geq M ρ_{n}^{1/2} ∥ θ_{0} ∥_{2}) \to 0

θ_{0} \in Θ_{ρ_{n}, s_{n}} sup E_{θ_{0}} Π (θ : ∣ S_{θ} ∣ \geq ∣ S_{*} ∣ + M ρ_{n} [(1 + \frac{16}{ϕ ( S _{*} ) ^{2}} \frac{λ}{λ ˉ}) ∣ S_{*} ∣ + \frac{∥ X ( θ _{0} - θ _{*} ) ∥ _{2}^{2}}{l o g p}]) \to 0,

θ_{0} \in Θ_{ρ_{n}, s_{n}} sup E_{θ_{0}} Π (θ : ∣ S_{θ} ∣ \geq ∣ S_{*} ∣ + M ρ_{n} [(1 + \frac{16}{ϕ ( S _{*} ) ^{2}} \frac{λ}{λ ˉ}) ∣ S_{*} ∣ + \frac{∥ X ( θ _{0} - θ _{*} ) ∥ _{2}^{2}}{l o g p}]) \to 0,

E_{θ_{0}} Π (θ \in Θ_{n} ∣ Y) 1_{A} \leq C e^{- δ_{n}},

E_{θ_{0}} Π (θ \in Θ_{n} ∣ Y) 1_{A} \leq C e^{- δ_{n}},

\displaystyle E_{\theta_{0}}Q(\theta\in\Theta_{n})1_{A}\leq\frac{2}{\delta_{n}}\Big{[}E_{\theta_{0}}\emph{KL}(Q\|\Pi(\cdot|Y))1_{A}+Ce^{-\delta_{n}/2}\Big{]}.

\displaystyle E_{\theta_{0}}Q(\theta\in\Theta_{n})1_{A}\leq\frac{2}{\delta_{n}}\Big{[}E_{\theta_{0}}\emph{KL}(Q\|\Pi(\cdot|Y))1_{A}+Ce^{-\delta_{n}/2}\Big{]}.

KL (Q ∥ P) = f sup [\int f d Q - lo g \int e^{f} d P],

KL (Q ∥ P) = f sup [\int f d Q - lo g \int e^{f} d P],

\int f(\theta)dQ(\theta)\leq\text{KL}\big{(}Q\big{\|}\Pi(\cdot|Y)\big{)}+\log\int e^{f(\theta)}d\Pi(\theta|Y).

\int f(\theta)dQ(\theta)\leq\text{KL}\big{(}Q\big{\|}\Pi(\cdot|Y)\big{)}+\log\int e^{f(\theta)}d\Pi(\theta|Y).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Variational Bayes for high-dimensional linear regression with sparse priors

Kolyan Ray111Department of Mathematics, Imperial College London. E-mail: [email protected] and Botond Szabó222Department of Mathematics, Vrije Universiteit Amsterdam. E-mail: [email protected]

Botond Szabó received funding from the Netherlands Organization for Scientific Research (NWO) under Project number: 639.031.654.

Imperial College London and Vrije Universiteit Amsterdam

Abstract

We study a mean-field spike and slab variational Bayes (VB) approximation to Bayesian model selection priors in sparse high-dimensional linear regression. Under compatibility conditions on the design matrix, oracle inequalities are derived for the mean-field VB approximation, implying that it converges to the sparse truth at the optimal rate and gives optimal prediction of the response vector. The empirical performance of our algorithm is studied, showing that it works comparably well as other state-of-the-art Bayesian variable selection methods. We also numerically demonstrate that the widely used coordinate-ascent variational inference (CAVI) algorithm can be highly sensitive to the parameter updating order, leading to potentially poor performance. To mitigate this, we propose a novel prioritized updating scheme that uses a data-driven updating order and performs better in simulations. The variational algorithm is implemented in the R package sparsevb.

AMS 2000 subject classifications: Primary 62G20; secondary 62G05, 65K10.

Keywords and phrases: Variational Bayes, spike-and-slab prior, model selection, sparsity, oracle inequalities.

1 Introduction

Inference under sparsity constraints has found many applications in statistics and machine learning [31, 40]. Perhaps the most widely applied such model is sparse linear regression, where we observe

[TABLE]

where $Y\in\mathbb{R}^{n}$ , $X$ is a given, deterministic $n\times p$ design matrix, $\theta\in\mathbb{R}^{p}$ is the parameter of interest and $Z\sim N_{n}(0,I_{n})$ is additive Gaussian noise. We are interested in the sparse high-dimensional setting, where $n\leq p$ and typically $n\ll p$ , and many of the coefficients $\theta_{i}$ are (close to) zero.

From a Bayesian perspective, perhaps the most natural way to impose sparsity is through a model selection prior, which assigns probabilistic weights to each potential model, i.e. each subset of $\{1,\dots,p\}$ corresponding to selecting the non-zero coordinates of $\theta\in\mathbb{R}^{p}$ . This is one of the most widely used approaches within the Bayesian community [18, 19, 31, 44] and includes the popular spike-and-slab prior, which is often considered the gold standard in sparse Bayesian linear regression. Such priors have been shown to perform well for estimation and prediction [26, 15, 13, 16], uncertainty quantification [35, 14] and multiple hypothesis testing [12], see [2] for a recent review.

However, while these priors perform excellently both empirically and theoretically, the discrete model selection component of the prior can make computation hugely challenging. For $\theta\in\mathbb{R}^{p}$ , inference using the spike-and-slab prior generally involves a combinatorial search over all $2^{p}$ possible models, a hugely expensive task for even moderate $p$ . Fast algorithms for exact posterior computation are thus usually restricted to the diagonal design case [15, 42], while Markov chain Monte Carlo methods are known to have problems mixing for typical problem sizes of interest [22].

A popular scalable alternative is variational Bayes (VB), which recasts posterior approximation as an optimization problem. One minimizes the VB objective function, consisting of the Kullback-Leibler (KL) divergence between a family of tractable distributions, called the variational family, and the posterior. Though the resulting approximation does not provide exact Bayesian inference, picking a computationally convenient variational class can dramatically increase scalability, see for example [6, 23]. An especially popular variational family consists of distributions under which the model parameters are independent, so called mean-field variational Bayes. For a nice recent review of VB, see [5].

In this work, we consider a mean field family consisting of distributions independently assigning each coordinate of $\theta$ an independent mixture of a Gaussian and Dirac mass at zero, thereby mirroring the form of the spike-and-slab prior (but crucially not the form of the posterior). Such a computational relaxation is significant, reducing the posterior dimension to a much more tractable $O(p)$ . This is a natural approximation since it keeps the discrete model selection aspect and many of the interpretable features of the original posterior, for example access to posterior probabilities of submodels and inclusion probabilities of particular covariates. This sparse variational family has been applied in practice [27, 41, 11, 25, 33], but comes with few theoretical guarantees.

We study this VB procedure under the frequentist assumption that the data $Y$ has been generated according to a given sparse parameter $\theta_{0}$ . Under standard conditions on the design matrix, we obtain refined oracle type contraction rates for the mean-field VB approximation of model selection priors. As a consequence, these imply that the VB posterior performs optimally regarding both estimation of a sparse $\theta$ and for prediction of the response vector. This provides a theoretical justification for this attractive approximation algorithm in a sparsity context.

While similar VB approaches have been applied in the methodological literature [27, 41, 11, 25, 33], our contribution also possesses a crucial methodological difference. These existing works typically use Gaussian slabs for the prior, which allows analytic evaluation of certain formulas in the variational algorithm leading to fast optimization. However, Gaussian slabs are inappropriate for recovering the true signal $\theta_{0}$ since the true underlying posterior performs excessive shrinkage causing poor performance [15]. One cannot typically expect a VB approximation based on a poorly performing underlying posterior to perform well for recovery. We instead consider Laplace slabs for the prior, which result in optimal recovery when using the true posterior [15, 13]. We are thus using a similar variational family to estimate a different posterior distribution compared to previous works. Another way to correct the original posterior is to explicitly shift the posterior mean using an empirical Bayes approach [29, 30, 3, 4].

We provide the methodological details for applying the widely-used coordinate-ascent variational inference (CAVI) algorithm [5] with Laplace slabs and investigate our method numerically on both simulated and real world ozone interaction data. As predicted by the theory, our method performs well in a number of settings and typically outperforms VB approaches with prior Gaussian slabs. In fact, we find that our approach generally performs at least as well as other state-of-the-art Bayesian variable selection methods. We have implemented our algorithm in the R-package sparsevb [17].

Our simulations also show that the CAVI algorithm is highly sensitive to the updating order of the parameters. Since the VB objective function is non-convex and typically has multiple local minima, a poorly chosen updating order can trap the algorithm near a highly suboptimal local minimum causing poor performance. To resolve this, we propose a novel prioritized update scheme where we base the CAVI parameter update order on the estimated size of the coefficients via a preliminary estimator. Our simulations indicate that such a data-driven updating order performs better than using either a naive or random update order and provides more robustness against being trapped at a suboptimal local minimum. This idea is applicable beyond the present setting and may be useful for other CAVI approaches.

Related work. Whilst VB has found increasing usage in practice, its theoretical understanding is still in the early stages. In low dimensional settings, some Bernstein-von Mises type results have been derived [28, 43], while in high-dimensional and nonparametric settings, first results have only recently appeared [47, 48, 34]. There has also been theoretical work on studying variational approximations to fractional posteriors, which down-weight the likelihood [1, 46, 45]. The papers [48, 34, 46] provide general proof methods which employ the classical prior mass and testing approach of Bayesian nonparametrics [21]. However, since it is known that posterior convergence rates, let alone oracle rates as we derive here, for model selection priors cannot easily be established using this approach [15, 13], their results do not apply to our setting. We have extended some of the present results to high-dimensional logistic regression in follow up work [36].

Organization. In Section 2 we give details of the prior, variational approximation and conditions on the design matrix. We present our main results in Section 3, details of the VB algorithm in Section 4, numerical results in Section 5 and conclusions in Section 6. In the supplementary material, we give additional numerical results in Section A, full oracle results and proofs in Section B, additional methodological details in Section C and further discussion of the design matrix assumptions in Section D.

Notation. Let $P_{\theta}$ be the probability distribution of the observation $Y$ arising in model (1) and let $E_{\theta}$ denote the corresponding expectation. For two probability distributions $P,Q$ , $\text{KL}(P\|Q)=\int\log\tfrac{dP}{dQ}dP$ denotes the Kullback-Leibler divergence. For $x\in\mathbb{R}^{d}$ , we write $\|x\|_{2}=(\sum_{i=1}^{d}|x_{i}|^{2})^{1/2}$ for the Euclidean norm. For a vector $\theta\in\mathbb{R}^{p}$ and a subset $S\subseteq\{1,\dots,p\}$ of indices, set $\theta_{S}$ to be the vector $(\theta_{i})_{i\in S}$ in $\mathbb{R}^{|S|}$ , where $|S|$ denotes the cardinality of $S$ . Further let $S_{\theta}=\{i:\theta_{i}\neq 0\}$ be the set of non-zero coefficients of $\theta$ . We will often write $S_{0}=S_{\theta_{0}}$ and $s_{0}=|S_{\theta_{0}}|$ , where $\theta_{0}$ is the true vector. For $X_{\cdot i}$ the $i^{th}$ column of $X$ , set

[TABLE]

2 Prior, variational families and design matrix

2.1 Model selection priors

We first present the desirable, but computationally challenging, model selection priors that underlie our VB approximation. Consider a prior for $\theta\in\mathbb{R}^{p}$ that first selects a dimension $s$ from a prior $\pi_{p}$ on $\{0,\dots,p\}$ , then uniformly selects a random subset $S\subset\{1,\dots,p\}$ of cardinality $|S|=s$ and lastly a set of non-zero values $\theta_{S}=\{\theta_{i}:i\in S\}$ from a prior density $g_{S}$ on $\mathbb{R}^{|S|}$ . Since it is known that the ‘slab’ distribution should have exponential tails or heavier to achieve good recovery [15], we restrict to the case where $g_{S}=\prod_{i\in S}\text{Lap}(\lambda)$ is a product of centered Laplace densities with parameter $\lambda>0$ on $\mathbb{R}^{s}$ . This yields the hierarchical prior:

[TABLE]

where $\text{Unif}_{p,s}$ selects $S$ from the $p\choose s$ possible subsets of $\{1,\dots,p\}$ of size $s$ with equal probability and $\delta_{0}$ denotes the Dirac mass at zero. Since we wish the prior to perform model selection via the prior $\pi_{p}$ on the dimension $s$ rather than via shrinkage of the Laplace distribution, the choice of prior $\pi_{p}$ is crucial. The aim is to select a distribution which sufficiently downweights large models while simultaneously placing enough mass to the true model. Following [13], we select an exponentially decreasing prior: we assume that there are constants $A_{1},A_{2},A_{3},A_{4}>0$ with

[TABLE]

Assumption (4) is satisfied by a variety of piors, including those of the form $\pi_{p}(s)\propto a^{-s}p^{-bs}$ for constants $a,b>0$ (‘complexity priors’ [15]) and binomial priors. The spike-and-slab prior, where we model $\theta_{i}\stackrel{{\scriptstyle iid}}{{\sim}}r\text{Lap}(\lambda)+(1-r)\delta_{0}$ , falls within this framework by taking $\pi_{p}$ to be $\text{Bin}(p,r)$ . The value $r$ is the prior inclusion probability of the coordinate $i$ and controls the model selection. Taking a hyperprior $r\sim\text{Beta}(1,p^{u})$ for $u>1$ also satisfies (4) ([15], Example 2.2), allows mixing over the sparsity level $r$ and gives a prior that does not depend on unknown hyper-parameters.

The regularization parameter $\lambda$ in the slab distribution in (3) is allowed to vary with $p$ within the range

[TABLE]

where the norm $\|X\|$ is the maximal column norm defined in (2). The quantity $\bar{\lambda}$ is the usual value of the regularization parameter of the LASSO ([9], Chapter 6). Large values of $\lambda$ may shrink many coordinates $\theta_{i}$ in the slab towards zero, which is undesirable in our Bayesian setup since we wish to induce sparsity via $\pi_{p}$ instead. Indeed, since the slab component identifies the non-zero coordinates, it is unnatural to further shrink these values. It is natural to take fixed values of $\lambda$ or $\lambda\to 0$ , both of which are typically allowed by (5) depending on the specific design matrix and regression setting. Specific values of $\|X\|$ for some examples of design matrices are given in Section D in the supplement.

The theoretical frequentist behaviour of the full posterior arising from prior (3) has been studied in [15, 13], who obtain oracle contraction rates amongst other things. We build on their work to show that these results extend to the scalable variational approximation.

We briefly comment on the more realistic situation that the model has unknown variance $\varsigma^{2}$ , in which case we instead observe $Y=X\theta+\varsigma Z$ . Since then

[TABLE]

one may first rescale the data using an estimate $\hat{\varsigma}$ of $\varsigma$ and as before endow $\theta$ with the prior (3), thereby obtaining an empirical Bayes approach. We investigate this empirical Bayes approach numerically in Section 5.2, showing that our method continues to perform well in the more realistic scenario of unknown noise level. One can alternatively use a hierarchical Bayesian approach by endowing $\varsigma$ with a hyper-prior, common choices including the inverse Gamma distribution, $c/\varsigma^{2}$ or the improper prior $1/\varsigma$ .

2.2 Variational approximations

The posterior $\Pi(\cdot|Y)$ arising from the prior (3) and data (1) assigns weights to all the $2^{p}$ possible models, except for very special instances of the design matrix $X$ and prior. Since the posterior is difficult to compute for even moderate $p$ , we take a VB approximation using the mean-field variational family

[TABLE]

with corresponding VB posterior

[TABLE]

the minimizer of the Kullback-Leibler (KL) divergence with respect to the posterior. Under $P_{\mu,\sigma,\gamma}$ , we have $\theta_{i}\sim\gamma_{i}N(\mu_{i},\sigma_{i}^{2})+(1-\gamma_{i})\delta_{0}$ independent. We thus approximate the posterior with a spike-and-slab distribution with Gaussian slabs under which every coordinate is independent. Note that while the prior may take the form (7), the posterior will in general not. The key reduction here is that we replace the $2^{p}$ model weights with the $p$ VB inclusion probabilities $(\gamma_{i})$ , thereby dramatically shrinking the posterior dimension. The VB approximation (8) forces (substantial) additional independence into the resulting distribution, breaking dependencies between the variables. For instance, pairwise information that two coefficients $\theta_{i}$ and $\theta_{j}$ are likely to be selected simultaneously or not at all is lost.

While we use Gaussian slabs in our variational family, it is crucial the true prior has slab distributions with at least exponential tails (e.g. Laplace) [15]. The reason a Gaussian approximation works well here is that the likelihood induces Gaussian tails in the posterior. We emphasize that we use the same variational family to estimate a different posterior compared to previous works [27, 41, 11, 25, 33], which use Gaussian prior slabs. While using Gaussian prior slabs is particularly efficient computationally, it can yield poor performance due to excessive shrinkage of the estimated coefficients, as we demonstrate numerically in Section A.2 in the supplement. Computing the VB estimate (8) is an optimization problem that can be tackled using coordinate-ascent variational inference (CAVI), see Section 4 for details.

While the family $\mathcal{P}_{MF}$ is our main object of interest, our proofs yield similar theoretical results for two other closely related variational families. Consider the family of distributions consisting of products of a single multivariate normal distribution with a Dirac measure:

[TABLE]

where $\delta_{S^{c}}$ denotes the Dirac measure on the coordinates $S^{c}$ . This family is more rigid on the model selection level than $\mathcal{P}_{MF}$ , selecting a distribution with a single fixed support set $S$ . On this set, however, the family permits a richer representation for the non-zero coefficients, allowing non-zero correlations. Next consider the mean field subclass of $\mathcal{Q}$ :

[TABLE]

This family again allows distributions with only a single fixed support set $S$ , but further forces independence of the non-zero coefficients. This class is contained in $\mathcal{P}_{MF}$ by considering distributions $P_{\mu,\sigma,\gamma}$ with inclusion probabilities restricted to $\gamma_{i}\in\{0,1\}$ . We define the corresponding VB posteriors by

[TABLE]

While all our theoretical results also apply to the VB posteriors $\hat{Q}$ and $\widetilde{Q}$ , these seem to perform worse in practice than $\widetilde{\Pi}$ , see Section A.2 in the supplement. This is potentially due to the discrete constraint $\gamma_{i}\in\{0,1\}$ for these two families, which renders the highly non-convex optimization problems (11) difficult to solve.

2.3 Design matrix

The parameter $\theta$ in model (1) is not estimable without further conditions on the regression matrix $X$ . For the high-dimensional case $p>n$ , which is of most interest to us, $\theta$ is not even identifiable without additional assumptions. We thus assume that there is some “true” sparse $\theta_{0}$ generating the observation (1) with at most $s_{n}$ non-zero coefficients:

[TABLE]

In the sparse setting, it suffices for estimation to have ‘local invertibility’ of the Gram matrix $X^{T}X$ . The notion of invertibility can be made more precise using the following definitions, which are based on the sparse high-dimensional literature (e.g. [9]), and have been adapted to the Bayesian setting in [13]. We provide only a brief description, referring the interested reader to Section 2.2 of [13] for further discussion.

Definition 1 (Compatibility).

A model $S\subseteq\{1,\dots,p\}$ has compatibility number

[TABLE]

A model is considered ‘compatible’ if $\phi(S)>0$ , in which case $\|X\theta\|_{2}|S|^{1/2}\geq\phi(S)\|X\|\||\theta_{S}\|_{1}$ for all $\theta$ in the above set. The number 7 is not important and is taken in Definition 2.1 of [13] to provide a specific numerical value; since we use several results from [13], we employ the same convention. The compatibility number does not directly require sparsity, but reduces the problem to approximate sparsity by considering only vectors $\theta$ whose coordinates are small outside $S$ . Conversely, the following two definitions deal only with sparse vectors.

Definition 2 (Uniform compatibility for sparse vectors).

The compatibility number for vectors of dimension $s$ is

[TABLE]

Definition 3 (Smallest scaled sparse singular value).

The smallest scaled sparse singular value of dimension $s$ is

[TABLE]

We shall require that these numbers are bounded away from zero for $s$ a multiple of the true model size. If $\|X\|=1$ , then $\widetilde{\phi}(s)$ is simply the smallest scaled singular value of a submatrix of $X$ of dimension $s$ . Note that Definitions 1-3 are Definitions 2.1-2.3 of [13]. Such compatibility conditions are standard for sparse recovery problems, see Sections 6.13 and 7.15 of [9] for further discussion.

These compatibility type constants are bounded away from zero for many standard design matrices, such as diagonal matrices, orthogonal designs, i.i.d. (including Gaussian) random matrices and matrices satisfying the ‘strong irrepresentability condition’ of [49]. Details of these examples are provided in Section D in the supplement.

3 Main results

We now provide the main theoretical results of this paper concerning the frequentist behaviour of the VB posterior $\widetilde{\Pi}$ in the asymptotic regime $n,p\rightarrow\infty$ . While the results are obtained assuming Gaussian noise in model (1), they are in fact robust to misspecification of the error distribution, see Remark B.1 in Section B. This robustness to misspecification is reflected in practice, see Section A.4 in the supplement for numerical results.

Our first result establishes contraction rates for the VB posterior to a sparse truth in $\ell_{1}$ -loss, $\ell_{2}$ -loss and prediction error $\|X(\theta-\theta_{0})\|_{2}$ . Apart from the sparsity level, the rate also depends on compatibility. For $M>0$ , set

[TABLE]

In the natural case $\lambda\ll\bar{\lambda}$ , these constants are asymptotically bounded from below by $\overline{\phi}((2+\tfrac{4M}{A_{4}})|S|)$ and $\widetilde{\phi}((2+\tfrac{4M}{A_{4}})|S|)$ if $\phi(S)$ is bounded away from zero. Our results are uniform over vectors in sets of the form

[TABLE]

for $S_{0}=S_{\theta_{0}}$ , $s_{n}\geq 1$ , $c_{0}>0$ and $\rho_{n}\to\infty$ (arbitrarily slowly).

Theorem 1 (Recovery).

Suppose the model selection prior (3) satisfies (4), (5) and $\lambda=O(\|X\|\sqrt{\log p}/s_{n})$ . Then the variational Bayes posterior $\widetilde{\Pi}$ satisfies, with $S_{0}=S_{\theta_{0}}$ ,

[TABLE]

for any $\rho_{n}\to\infty$ (arbitrarily slowly), $\Theta_{\rho_{n},s_{n}}$ defined in (13) and where $M>0$ depends only on the prior. Moreover, the same holds true for the variational Bayes posteriors $\hat{Q}$ and $\widetilde{Q}$ .

Theorem 1 follows directly from the oracle type Theorem 3 below upon setting $\theta_{*}=\theta_{0}$ . Recall that we are working under the frequentist model where there is a “true” $\theta_{0}$ generating data $Y$ of the form (1). Since the above rates equal the minimax estimation rates over $|S_{0}|$ -sparse vectors, Theorem 1 states that the VB posterior puts most of its mass in a neighbourhood of optimal size around the truth with high $P_{\theta_{0}}$ -probability in terms of $\ell_{1}$ , $\ell_{2}$ and prediction loss. Thus for estimating $\theta_{0}$ , the VB approximation behaves optimally from a theoretical frequentist perspective. This backs up the empirical evidence that VB can provide excellent scalable estimation.

The VB posterior mean often provides a good point estimator and the VB posterior is known to typically underestimate the marginal posterior variance (see e.g. [5] - this is a result of using the KL divergence as optimization criterion). The combination of good centering point and the posterior shrinking at least as fast as the true posterior explains why the VB posterior still provides optimal recovery, despite the loss of information from using a mean-field approximation.

Since the prior and variational family do not depend on the unknown sparsity level $|S_{0}|$ and the VB estimate contracts around the truth at the minimax rate, the procedure is adaptive. That is, the procedure can recover an $|S_{0}|$ -sparse truth nearly as well as if we knew the exact level of sparsity of the unknown $\theta_{0}$ . However, the choice of tuning parameters still has an effect on the finite-sample performance, see Section A.3 for a numerical investigation of the effect of the hyper-parameter $\lambda$ . Note that Theorem 1 does not imply that the VB posterior $\widetilde{\Pi}$ converges to the true posterior $\Pi(\cdot|Y)$ . Indeed, this is neither a typical situation nor a necessary property since the VB estimate should be substantially simpler than the true posterior to be useful.

Theorem 1 implies the variational families $\mathcal{Q}$ and $\mathcal{Q}_{MF}$ also provide optimal asymptotic estimation of $\theta_{0}$ in $\ell_{1}$ , $\ell_{2}$ and prediction loss. However, the corresponding optimization routine seems to yield worse performance in practice, see Section A.2.

An important motivation for using model selection priors is their ability to perform variable selection. The following result shows that the variational approximation puts most of its mass on models of size at most a multiple of the true dimension, thereby bounding the number of false positives.

Theorem 2 (Dimension).

Suppose the model selection prior (3) satisfies (4), (5) and $\lambda=O(\|X\|\sqrt{\log p}/s_{n})$ . Then the variational Bayes posterior $\widetilde{\Pi}$ satisfies, with $S_{0}=S_{\theta_{0}}$ ,

[TABLE]

for any $\rho_{n}\to\infty$ (arbitrarily slowly), $\Theta_{\rho_{n},s_{n}}$ defined in (13) and where $M>0$ depends only on the prior. Moreover, the same holds true for the variational Bayes posteriors $\hat{Q}$ and $\widetilde{Q}$ .

Theorem 2 follows directly from the oracle type Theorem 4 below upon setting $\theta_{*}=\theta_{0}$ . In the interesting case $\lambda\ll\bar{\lambda}$ , the factor in Theorem 2 can be simplified to $(1+M\rho_{n})$ if the true parameter is compatible. Note also that under the conditions of Theorems 1 and 2, it is not possible to consistently estimate the true support $S_{\theta_{0}}$ of $\theta_{0}$ since one cannot separate small and exactly zero signals.

Since the variational families $\mathcal{Q}$ and $\mathcal{Q}_{MF}$ contain only distributions with a single support set $S$ , the last statement says the resulting VB posteriors will select such a set of size at most a multiple times $|S_{0}|$ with high $P_{\theta_{0}}$ -probability. The VB estimates based on these two variational families perform model selection in a hard-thresholding manner, reporting only whether a variable is selected or not. On the other hand, the more flexible family $\mathcal{P}_{MF}$ quantifies the individual variable selection via the reported non-trivial inclusion probabilities $0\leq\gamma_{i}\leq 1$ , and in this regard provides a richer approximation of the target posterior. Information on pairwise variable inclusion is obviously lost given the mean-field nature of the approximation. Nevertheless, it is interesting to note that all these families still permit good estimation of $\theta_{0}$ .

We now provide more refined oracle-type versions of Theorems 1 and 2 as are known to hold for the true posterior [13].

Theorem 3 (Oracle recovery).

Suppose the model selection prior (3) satisfies (4), (5) and $\lambda=O(\|X\|\sqrt{\log p}/s_{n})$ . For $\theta_{0}\in\mathbb{R}^{p}\backslash\{0\}$ , let $\theta_{*}\in\mathbb{R}^{p}$ be any vector satisfying $1\leq s_{*}=|S_{\theta_{*}}|\leq|S_{\theta_{0}}|=s_{0}$ and $\|X(\theta_{0}-\theta_{*})\|_{2}^{2}\leq(s_{0}-s_{*})\log p.$ Then the variational Bayes posterior $\widetilde{\Pi}$ satisfies, for any $\theta_{*}$ as above,

[TABLE]

for any $\rho_{n}\to\infty$ (arbitrarily slowly), $\Theta_{\rho_{n},s_{n}}$ defined in (13) and where $M>0$ depends only on the prior. Moreover, the same holds true for the variational Bayes posteriors $\hat{Q}$ and $\widetilde{Q}$ .

This can yield better rates than Theorem 1 for certain parameters and choices of $\theta_{*}$ . For example, if $X=I$ is the identity matrix so that $\overline{\psi}_{\rho_{n}}(S)=\phi(S)=1$ for all $S$ , setting $\theta_{*}=0$ yields

[TABLE]

for any $\rho_{n}\to\infty$ . If $\|\theta_{0}\|_{2}^{2}\ll|S_{0}|\log p$ , this improves upon the rate $\sqrt{|S_{0}|\log p}$ in Theorem 1 by accounting for the size of the coefficients of $\theta_{0}$ and not only its sparsity level.

The advantage of the oracle bound is it can take into account small non-zero coefficients of $\theta_{0}$ and capture its ‘effective sparsity’. If $S_{*}\subset S_{0}$ , as one typically takes, the condition $\|X(\theta_{0}-\theta_{*})\|_{2}^{2}=\|X\theta_{0,S_{*}^{c}}\|_{2}^{2}\leq(s_{0}-s_{*})\log p$ implies that the coordinates of $\theta_{0}$ in $S_{0}\backslash S_{*}$ contribute on average at most $\log p$ to the squared prediction error. Thus if the coefficient contributes less than $\log p$ to the squared prediction loss, it is preferable to assign it as bias rather than pay the full $\log p$ term required by the squared minimax rate $s_{0}\log p$ , which accounts only for sparsity irrespective of signal size.

Theorem 4 (Oracle dimension).

Suppose the model selection prior (3) satisfies (4), (5),and $\lambda=O(\|X\|\sqrt{\log p}/s_{n})$ . For $\theta_{0}\in\mathbb{R}^{p}\backslash\{0\}$ , let $\theta_{*}\in\mathbb{R}^{p}$ be any vector satisfying $1\leq s_{*}=|S_{\theta_{*}}|\leq|S_{\theta_{0}}|=s_{0}$ and $\|X(\theta_{0}-\theta_{*})\|_{2}^{2}\leq(s_{0}-s_{*})\log p.$ Then the variational Bayes posterior $\widetilde{\Pi}$ satisfies, for any $\theta_{*}$ as above,

[TABLE]

for any $\rho_{n}\to\infty$ (arbitrarily slowly), $\Theta_{\rho_{n},s_{n}}$ defined in (13) and where $M>0$ depends only on the prior. Moreover, the same holds true for the variational Bayes posteriors $\hat{Q}$ and $\widetilde{Q}$ .

Theorems 3 and 4 are special cases of the finite-sample Theorems B.1 and B.2 in the supplement. Our proofs are based on the following crucial result, which allows one to exploit exponential probability bounds for the posterior to control the corresponding probability under the variational approximation.

Theorem 5.

Let $\Theta_{n}$ be a subset of the parameter space, $A$ be an event and $Q$ be a distribution for $\theta$ . If there exist $C>0$ and $\delta_{n}>0$ such that

[TABLE]

then

[TABLE]

Proof.

Recall the duality formula for the Kullback-Leibler divergence ([7], Corollary 4.15)

[TABLE]

where the supremum is taken over all measurable $f$ such that $\int e^{f}dP<\infty$ . In particular,

[TABLE]

Applying this inequality with $f(\theta)=\tfrac{1}{2}\delta_{n}1_{\Theta_{n}}(\theta)$ and using that $\log(1+x)\leq x$ for $x\geq 0$ ,

[TABLE]

Taking $E_{\theta_{0}}$ -expectations on both sides and using (14) gives the result. ∎

When deriving oracle rates for the original posterior, the exponent $e^{-\delta_{n}}$ in (14) depends on the oracle quantity, see Section B.3. To apply Theorem 5, we must thus develop novel oracle type bounds on the KL divergence $\text{KL}(\widetilde{\Pi}\|\Pi(\cdot|Y))$ , which is the main technical difficulty in establishing our results, see Section B.2. The proof uses an iterative structure, using successive posterior localizations to eventually bound the KL divergence (see e.g. [32] for a similar idea).

4 Variational Bayes algorithm

4.1 Coordinate update equations

We now provide a coordinate-ascent variational inference (CAVI) algorithm (see for instance [5]) to compute the mean-field VB posterior $\widetilde{\Pi}$ based on the spike-and-slab prior with Laplace slabs. Since in the literature [27, 11, 25] the VB approximation is typically considered for Gaussian prior slabs, and can therefore take advantage of explicit analytic formulas, our algorithm requires modification.

Introducing binary latent variables $(z_{i})_{i=1}^{p}$ , the spike and slab prior can be rewritten as

[TABLE]

The prior inclusion probability equals $\Pi(z_{i}=1)=\int wd\pi(w)=a_{0}/(a_{0}+b_{0})$ , the expectation of a beta random variable. In CAVI, we sequentially update the parameters $\gamma_{i},\sigma_{i},\mu_{i}$ , $i=1,...,p$ , of the VB posterior by minimizing the KL divergence between the variational class with the rest of the parameters kept fixed and the true posterior. We iterate this algorithm until convergence, measured by the change in entropy.

We now give the component-wise variational updates in the algorithm. Fixing the latent variable $z_{i}=1$ and all variational factors except $\mu_{i}$ or $\sigma_{i}$ (i.e. using vector notation, $\bm{\mu}_{-i},\bm{\sigma},\bm{\gamma}$ or $\bm{\mu},\bm{\sigma}_{-i},\bm{\gamma}$ are all fixed), the minimizer of the conditional KL divergence between $\mathcal{P}_{MF}$ and the posterior is the same as the minimizer of

[TABLE]

respectively (see Section C.1 of the supplement for the proof of the above assertion), where $\Phi$ denotes the cdf of the standard normal distribution. The minimizers of these functions do not have closed form expressions and hence must be computed by optimization; in our R implementation, we used the built-in optimize() function.

The minimizer $\gamma_{i}$ of the conditional KL divergence given $\bm{\mu},\bm{\sigma},\bm{\gamma}_{-i}$ solves

[TABLE]

see Section C.1 of the supplement for the proof.

Following [25], we terminate the procedure once the coordinate-wise maximal change in binary entropy of the posterior inclusion probabilities falls below a prespecified small threshold $\varepsilon$ (e.g. $\varepsilon=10^{-3}$ ), i.e. stop when $\Delta_{H}:=\max_{i=1,...,p}|H(\gamma_{i})-H(\gamma_{i}^{\prime})|\leq\varepsilon$ , where $H(p)=-p\log p-(1-p)\log(1-p)$ , $p\in(0,1)$ , and $\gamma_{i}$ , $\gamma_{i}^{\prime}$ are the $i$ th coordinate of the starting and updated parameters $\bm{\gamma}$ , $\bm{\gamma}^{\prime}$ , respectively. The full algorithm is present in Algorithm 1.

4.2 Prioritized updating order

The VB objective function is generally non-convex and so CAVI can be sensitive to initialization [5]. It turns out the algorithm is also highly sensitive to the order of the component-wise updates. In fact, naively updating the coordinates in lexicographic order $i=1,...,p$ is typically suboptimal in our setting. We demonstrate in the next section on various simulated data sets that, unless the significant non-zero coefficients are located at the beginning of the signal, the procedure typically converges to a poor local minimum and gives misleading, inconsistent answers. In particular, CAVI returns a solution that is far from the desired VB posterior it is trying to compute. It is clearly undesirable that the algorithm’s performance depends on the arbitrary ordering of the parameter coordinates. A natural fix is to randomize the order of the coordinate-wise updates and use different initializations, choosing the local minimum which provides the smallest overall KL-divergence to the posterior. We show, however, that due to the large number of local minima and their substantially different behaviour, this approach can also perform badly (although somewhat better than the lexicographic approach).

We instead propose a novel prioritized update scheme. In a first preprocessing step, we compute an initial estimator $\hat{\mu}^{(0)}$ of the mean vector $\bm{\mu}$ of the variational class. We then place the coefficients in decreasing order with respect to the absolute value of their estimate and update the parameters coordinate-wise in the corresponding order, i.e. denoting by $\bm{a}=(a_{1},...,a_{p})$ the permutation of the indices $(1,2,...,p)$ such that $|\hat{\mu}^{(0)}_{a_{i}}|\geq|\hat{\mu}^{(0)}_{a_{j}}|$ for every $1\leq i<j\leq p$ , we update the coordinates in the order $\mu_{a_{i}},\sigma_{a_{i}},\lambda_{a_{i}}$ , $i=1,...,p$ .

The intuition behind this method is that when CAVI begins by updating indices whose signal coefficients are small or zero in the target VB posterior, it may incorrectly assign signal strength to such indices to better fit the data (this is especially the case if the initialization value of the signal coefficient is far from its value in the target VB posterior). Consequently, the estimates of the significant non-zero signal components may be overly small since part of the signal strength has already been falsely assigned to signal coefficients that should in fact be small under the VB posterior. This can trap the algorithm near a poor local minimum from which it cannot escape, see the corresponding simulation study in Section 5.

To avoid this, we wish to first update those coefficients which are large in the target VB posterior. Since these are unknown, the idea here is to identify them using a preliminary estimator: if the target VB posterior does a good job of estimating the signal, these large coefficients should roughly match those that are large in the true underlying signal, which can be identified using a reasonable estimator. The algorithm is given in Algorithm 1, where the function $order(|\bm{\mu}|)$ returns the indices of $|\bm{\mu}|$ in descending order.

Instead of the prior (15), one can instead take the $w_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\text{Beta}(a_{0},b_{0})$ and $z_{i}|w_{i}\sim^{ind}\text{Bernoulli}(w_{i})$ , so that the probabilities $w_{i}$ vary with $i$ . This results in exactly the same variational algorithm since we are using a mean-field approximation. If one instead takes deterministic weights $w_{i}$ , the above algorithm can be easily adapted by using the same update steps for $\mu_{i}$ and $\sigma_{i}$ , while updating $\gamma_{i}$ as the solution to

[TABLE]

The closely related algorithm for computing the VB posterior $\widetilde{Q}$ based on the family $\mathcal{Q}_{MF}$ is given in Algorithm 4 in Section C.2 in the supplement.

5 Numerical study

In this section, we empirically compare the performance of our VB method using Laplace prior slabs, implemented in the sparsevb package [17], with various state-of-the-art Bayesian model selection methods on simulated data. We also demonstrate the importance of the prioritized updating scheme compared with standard CAVI implementations.

Additional numerical results are provided in the supplementary material as follows:

Section A.1: we apply our method and other Bayesian model selection methods to real world data.

-

Section A.2: we show that Laplace prior slabs provide better estimation and model selection than Gaussian prior slabs. We also show that the optimization problem for finding the KL-optimizer for the class $\mathcal{Q}_{MF}$ is substantially harder than for the class $\mathcal{P}_{MF}$ , with the former typically ending up at a poor local minimum.

-

Section A.3: we show that although the theory indicates that the VB approach is (asymptotically) robust to the choice of the hyper-parameter $\lambda$ , in finite-sample cases it can still have an effect and it may be helpful to use a data-driven choice in practice (e.g. cross validation).

-

Section A.4: we show that several Bayesian model selection methods are robust to noise misspecification

-

Section A.5: we compare different Bayesian model selection methods when the inputs are correlated.

We ran each experiment multiple times and report the average $\ell_{2}$ -distance between the posterior mean (or maximum a posteriori (MAP) estimate for the SSLASSO) and the true parameter $\theta_{0}$ , the false discovery rate (FDR), the true positive rate (TPR) and the computational time in seconds. We also report the standard deviations of these indicators to quantify their spread. For our computations, we used a MacBook Pro laptop with 2.9 GHz Intel Core i5 processor and 8 GB memory. Throughout the numerical study, we use the hyper-parameter choices $a_{0}=1$ , $b_{0}=p$ , $\lambda=1$ (except in Section A.3) and set the stopping threshold for the entropy change to $\Delta_{H}=10^{-5}$ , see Algorithm 1. In each experiment and for every method, we take the ridge regression estimator $\hat{\mu}^{(0)}=(X^{T}X+I)^{-1}X^{T}Y$ as initialization. Given the sparsity framework, it may be tempting to take the LASSO as initialization, however this is not recommended. The LASSO shrinks some coordinates to exactly zero and so is not suitable for $\mu$ , which represents the estimated coefficients given that they are included in the model, i.e. non-zero [the LASSO solution should be compared to $(\gamma_{1}\mu_{1},\dots,\gamma_{p}\mu_{p})$ rather than $(\mu_{1},\dots,\mu_{p})$ ].

5.1 Prioritized updates

We demonstrate here the relevance of our prioritized updating scheme for CAVI by comparing its performance with lexicographic and randomized updating orders, which are standard implementations for CAVI. We take $n=100$ , $p=200$ , $s=20$ , $\theta_{i}=10$ for the non-zero coefficients, $\varsigma=1$ assumed to be known, $X_{ij}\stackrel{{\scriptstyle iid}}{{\sim}}N(0,1)$ , and consider four scenarios for the locations of the non-zero signal components. We place all non-zero coordinates (i) at the beginning of the signal, (ii) at the end of the signal, (iii) in the middle of the signal and (iv) uniformly at random. We ran the experiments 200 times and report the results in Table 1 (for the FDR and TPR, the $i$ th coefficient is selected if $\gamma_{i}>0.5$ ). We also plot the posterior means resulting from a typical run in Figure 1.

Apart from the first scenario, where the significant signal coefficients are all located at the beginning of the signal, the prioritized method substantially outperforms both the randomized and lexicographic updating schemes for parameter estimation and model selection (recall that all three methods are trying to compute the same VB estimate). The random updating order also slightly improves upon the lexicographic order, except for the first scenario, where the lexicographic order naturally updates the largest coefficients first. As well as being sensitive to initialization [5], it seems CAVI can also be very sensitive to the updating order of the parameters. Indeed, we see here that without prioritized ordering, the algorithm often terminates at poor local minima of the VB objective function. Since the VB objective is non-convex, naive (or random) update orderings may cause CAVI to return a solution that is far from the true minimizer of the KL divergence that it is trying to compute. Performing updates in a prioritized order can add some robustness against this, see Section 4 for some heuristics behind this idea. We also note that the runtime is comparable for the three updating orders.

5.2 Comparing Bayesian variable selection methods

We consider here the realistic situation of unknown noise variance $\varsigma^{2}$ , that is the model $Y=X\theta+\varsigma Z$ . As mentioned in Section 2 (see (6)), dividing both sides of this model by an empirical estimator $\hat{\varsigma}$ for the noise standard deviation $\varsigma$ gives $\tilde{Y}=\tilde{X}\theta+\tilde{Z},$ where $\tilde{Y}=Y/\hat{\varsigma}$ , $\tilde{X}=X/\hat{\varsigma}$ and $\tilde{Z}=(\varsigma/\hat{\varsigma})Z$ , $Z\sim N(0,I_{n})$ . Endowing $\theta$ with the spike-and-slab prior and if the estimator $\hat{\varsigma}$ is close to $\varsigma$ , we should approximately recover the $\varsigma=1$ case studied above. We thus compute our VB estimator as described above based on the design matrix $\tilde{X}$ and data $\tilde{Y}$ . For estimating $\varsigma$ , we have used the R package selectiveInference, see [37].

We compare the performance of our VB method with various Bayesian (based) variable selection algorithms for sparse linear regression using simulated data. We consider the varbvs R-package (variational Bayes for spike-and-slab priors with Gaussian prior slabs using an importance sampling outer circle for estimating the posterior inclusion probabilities and noise variance [11]), EMVS R-package (an expectation-maximization algorithm for spike-and-slab [38]), SSLASSO R-package (spike-and-slab LASSO [39]) and ‘ebreg.R’ R-function (a fractional likelihood empirical Bayes approach using MCMC for re-centered Gaussian slab priors [29] - the function is available on the first author’s website).

For varbvs, we set $tol=10^{-4}$ and $maxiter=10^{4}$ . For EMVS we took $v_{0}\in\{0.1,0.2,...,2\}$ , $v_{1}=1000$ (these quantities were used in one of the examples provided in the package), $a=1$ , $b=p$ and $\epsilon=10^{-5}$ and report the posterior mean corresponding to the $v_{0}=0.1$ case. For SSLASSO, we took $\lambda_{1}=0.01$ , $\lambda_{0}$ an arithmetic series between $\lambda_{1}$ and $p$ with 200 elements, set the variance “unknown”, $a=1$ , $b=p$ , and penalty=“adaptive”, and report the results corresponding to the stabilized $\lambda_{0}$ value as recommended by the authors [39]. In the ebreg algorithm, we took the default parameters $M=5000$ , $\alpha=0.99$ , $\gamma=0.001$ and used the selectiveInference R-package for the estimation of $\varsigma$ . We note that for most of these methods, additional careful hyper-parameter tuning beyond the default settings can often lead to improved performance, see Section A.3 for our VB method or Section 5 of [20] for discussion concerning the SSLASSO.

We first consider (i) $n=100$ , $p=400$ , $s=20$ , $\varsigma=5$ with the non-zero signal coefficients set to $\theta_{i}=A$ , with $A=\log n$ , and located at the end of the signal. The entries of the design matrix are taken to be iid normal random variables $X_{ij}\stackrel{{\scriptstyle iid}}{{\sim}}N(0,1)$ . In the other experiments, we take $(n,p,s,\varsigma)$ equal to (ii) $(100,1000,40,1)$ (with non-zero coefficients at the beginning of the signal) and set the non-zero parameters to be $1,2,3$ ; (iii) $(200,800,5,0.2)$ (in the middle) and take $\theta_{i}\stackrel{{\scriptstyle iid}}{{\sim}}U(-5,5)$ ; (iv) $(100,400,20,5)$ (at the end) and take $\theta_{i}=2\log n$ . We ran each algorithm 100 times and report the results in Table 2. Our method performs well compared to the other methods, in some cases providing substantially better estimation and model selection.

6 Conclusion

We studied theoretical oracle contraction rates of a natural sparsity-inducing mean-field VB approximation to posteriors arising from widely used, but computationally challenging, model selection priors in high-dimensional sparse linear regression. We showed that under compatibility conditions on the design matrix, such an approximation converges to a sparse truth at an oracle rate in $\ell_{1}$ , $\ell_{2}$ and prediction loss, implying optimal (minimax) recovery, and also performs suitable dimension selection. This provides a theoretical justification for this approximation algorithm in a sparsity context. Minimax guarantees for this VB method extend to high-dimensional logistic regression, as we show in the follow up work [36].

We investigated the empirical performance of our algorithm via simulated and real world data and showed that it generally performs at least as well as other state-of-the-art Bayesian variable selection methods, including existing VB approaches. We also demonstrated how the widely used coordinate-ascent variational inference (CAVI) algorithm can be highly sensitive to the updating order of the parameters. We therefore proposed a novel prioritized updating scheme that uses a data-driven updating order and performs better in simulations. This idea may be applicable for CAVI approaches in other settings. Our variational algorithm is implemented in the R-package sparsevb [17].

Acknowledgements. We would like to thank two referees for valuable comments that helped considerably improve this manuscript.

SUPPLEMENTARY MATERIAL

Supplementary material.

In Section A, additional numerical results are given. First, we provide a real world data example, where we compare Bayesian model selection methods. We then consider various VB methods, demonstrating the advantages of using Laplace instead of Gaussian prior slabs, investigate the effect of the hyper-parameter $\lambda$ and further study Bayesian variable selection methods under noise misspecification and correlated inputs. Section B contains full oracle results and all proofs, Section C contains additional methodological details and Section D contains further discussion of the design matrix assumption, including examples.

Appendix A Additional numerical results

A.1 Ozone interaction data

We apply our method to the real world ozone interaction data investigated in [8]. The dataset contains $n=203$ readings of maximal daily ozone measured in the Los Angeles basin and $p=134$ variables modeling the pairwise interaction of 9 meteorological and 3 time variables. We firstly normalize the design matrix by centering and rescaling each column to have Euclidean norm equal to $\sqrt{n}$ and then add a column vector of ones to add an intercept to the model.333Except for EMVS, since adding an intercept resulted in an error message. We apply the four methods investigated above (i.e. our method sparsevb [17], varbvs, EMVS, SSLASSO) with unknown noise variance $\varsigma^{2}$ , using the method settings described in Section 5.2. We also tried to apply the ebreg method, but due to the highly co-linear nature of the design matrix, the code gave errors when trying to compute the Cholesky decomposition.

As we do not know the underlying truth, we consider the 10-fold cross validation prediction error, i.e. we use nine folds to compute the posterior mean or MAP $\hat{\theta}$ and then use the 10th fold to compute the prediction error $\|Y-X\hat{\theta}\|_{2}$ . We report the averaged out cross-validation errors in Table 3, together with the runtimes and number of selected covariates. Our method outperforms the other approaches in cross-validated prediction loss. Furthermore, while there is some overlap between the models selected by the various methods, the results are quite different, see Figure 2.

A.2 Comparing the VB algorithms

We compare our VB method with Laplace slabs (Algorithm 1) with different variations of the VB algorithm. First, we consider the other mean-field VB posterior $\widetilde{Q}$ derived from the variational class $\mathcal{Q}_{MF}$ (Algorithm 4 in Section C.2). Next, we consider the VB method with Gaussian prior slabs, which is the standard choice in the literature, see for instance [27, 11, 25], both with component-wise and batch-wise computational approaches, see Algorithms 2 and 3 in Section C.2. To compensate for the over-shrinkage of the posterior mean caused by the light tail of the Gaussian slabs, we also consider centered Gaussian prior slabs with standard deviation set to the (unknown) oracle $\rho=\|\theta_{0}\|_{2}$ , as proposed by [15] for the sequence model (i.e. $X=I$ the identity matrix).

In all experiments, we placed the non-zero signal components $\theta_{i}=A$ at the beginning of the signal. In the first experiment, (i) we take the identity design matrix $X=I_{n}$ and set $n=p=400$ , $s=40$ , $A=4\sqrt{\log n}$ . In the other three experiments, we consider a Gaussian design matrix with entries $X_{ij}\stackrel{{\scriptstyle iid}}{{\sim}}N(0,\tau^{2})$ and vary the parameters $n,p,s,\tau$ and $A$ . We take (ii) $(n,p,s,\tau)=(100,200,20,1)$ , $A\stackrel{{\scriptstyle iid}}{{\sim}}U(0,2\log n)$ ; (iii) $(n,p,s,\tau)=(200,800,40,0.1)$ , $A=2\log n$ ; (iv) $(n,p,s,\tau)=(100,400,15,0.5)$ , $A\stackrel{{\scriptstyle iid}}{{\sim}}U(-8,8)$ . In all experiments, we take $\varsigma=1$ assumed to be known. The results over 200 runs are reported in Table 4 and we plot the outcome of a typical run in Figure 3.

Our Laplace VB method (sparsevb) with variational class $\mathcal{P}_{MF}$ typically outperforms the other VB algorithms. From the identity design case (i), it is clear that Gaussian prior slabs provide suboptimal recovery for $\theta_{0}$ unless the prior slab variance is rescaled by the norm of $\theta_{0}$ . However, the rescaled Gaussian slabs perform much less well in the Gaussian design cases (ii)-(iv). The other mean-field variational class $\mathcal{Q}_{MF}$ performs similarly to our main method in the identity design case, but significantly worse in the more complicated Gaussian design cases. This is due to discrete nature of the variational parameter $\gamma\in\{0,1\}$ in this family, which makes the optimization problem even more difficult, causing the method to frequently get stuck at a poor local minimum. We do not report run times as the sparsevb R-package is optimized for computation and therefore runs substantially faster than the other methods, which are more simply implemented.

A.3 The effect of the hyper-parameter $\lambda$

Theorem 1 states that for a wide range of hyper-parameter values $\lambda\in[\frac{\|X\|}{p},\frac{C\|X\|\sqrt{\log p}}{s_{0}}]$ , our VB algorithm has good asymptotic properties. However, the finite-sample performance depends on $\lambda$ as we now investigate. We ran our algorithm for different choices of $\lambda$ , ranging from $1/20$ to $20$ , on simulated data similar to that in the preceding subsections.

We consider four different settings, each with Gaussian design with entries $X_{ij}\stackrel{{\scriptstyle iid}}{{\sim}}N(0,\tau^{2})$ , non-zero signal components set to $\theta_{i}=A$ and noise variance $\varsigma^{2}=1$ assumed to be known. We take (i) $(n,p,s,\tau)=(200,300,15,0.5)$ , $A=2\log n$ ; (ii) $(n,p,s,\tau)=(500,1000,50,1)$ , $A=2\log n$ ; (iii) $(n,p,s,\tau)=(200,500,20,0.2)$ , $A\stackrel{{\scriptstyle iid}}{{\sim}}U(-10,10)$ ; and (iv) $(n,p,s,\tau)=(1000,2000,15,2)$ , $A\stackrel{{\scriptstyle iid}}{{\sim}}U(-8,8)$ . In all cases, the non-zero signal components are located at the beginning of the signal. We ran each algorithm 200 times and report the results in Table 5. The choice of $\lambda$ can indeed significantly influence the finite-sample behaviour of the algorithm (e.g. cases (ii) and (iii)), but not always ((i) and (iv)). There was not clear evidence to support a particular fixed choice of $\lambda$ , since larger values sometimes performed better ((ii) and (iv)) and sometime worse ((i) and (iii)). This suggests using a data-driven choice of $\lambda$ may be helpful in practice. As expected, larger choices for $\lambda$ , which cause more shrinkage, result in smaller FDR and TPR. The runtime across hyper-parameter choices were broadly comparable.

A.4 Noise misspecification

We investigate the robustness of the Bayesian model selection methods to misspecification of the noise distribution in practice. Note that our theoretical results are also robust to some misspecification, see Remark B.1 in Section B below. We consider Gaussian design $X_{ij}\stackrel{{\scriptstyle iid}}{{\sim}}N(0,2)$ , set the model parameters $n=200$ , $p=400$ , $s=20$ , and take non-zero signal coefficients $\theta_{i}\stackrel{{\scriptstyle iid}}{{\sim}}U(-10,10)$ located in the beginning of $\theta$ . We compare the correctly-specified Gaussian noise case (i) $Z_{i}\stackrel{{\scriptstyle iid}}{{\sim}}N(0,1)$ in model (1) with the misspecified noise cases: (ii) Laplace noise $Z_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\text{Lap}(0,1)$ ; (iii) uniform noise $Z_{i}\stackrel{{\scriptstyle iid}}{{\sim}}U(-2,2)$ ; (iv) Student noise with 3 degrees of freedom $Z_{i}\stackrel{{\scriptstyle iid}}{{\sim}}t_{3}$ . We apply the same parametrizations of the methods as in Section 5.2. We ran each experiments 200 times and collect the results in Table 6. Our method (sparsevb) gave similar results to varbvs, ebreg and EMVS, while the SSLASSO performed slightly worse. The noise distribution does not seem to have a major effect on the results, hence these algorithms seem robust to noise misspecification. It is worthwhile to further investigate this phenomenon both empirically and analytically.

A.5 Bayesian variable selection methods under correlated inputs

We lastly consider the common situation of correlated input variables. We take each row $X_{i\cdot}\stackrel{{\scriptstyle iid}}{{\sim}}N_{p}(0,\Sigma)$ with $\Sigma_{jk}=\rho$ for $j\neq k$ and $\Sigma_{jj}=1$ , giving standard normal predictors with non-zero correlation $\rho$ . We take (i) $(n,p,s,\varsigma)=(100,400,10,0.2)$ , correlation $\rho=0.3$ and non-zero coefficients $\theta_{i}\stackrel{{\scriptstyle iid}}{{\sim}}U(-3,3)$ at the beginning of the signal; (ii) the same setting as in (i), but with higher correlation $\rho=0.7$ ; (iii) $(n,p,s,\varsigma)=(200,800,20,5)$ , correlation $\rho=0.3$ and non-zero coefficients $\theta_{i}=2\log n$ at the end of the signal; (iv) the same setting as in (iii), but with higher correlation $\rho=0.7$ . We apply the same parametrizations of the methods as in Section 5.2. The results are summarized in Table 7.

One might expect that mean-field VB methods should not perform so well under correlated inputs due to their factorizable structure. This was not the case in our simulations, where the VB methods perform competitively with the other methods, often providing the best results (except perhaps in (iv), where varbvs sometimes sometimes gave large $\ell_{2}$ error). The correlated design also does not seem to substantially influence the run time.

While our simulations are certainly not extensive, they suggest that mean-field VB can perhaps still be effective in certain correlated input settings and understanding the exact effect of correlation on VB seems to be a subtle question. It is currently not well understood how VB, or indeed even the true posterior, behaves in general correlated design settings. This important and practically very relevant setting requires further investigation, both theoretically and empirically.

Appendix B Proofs

B.1 Full oracle results

The proofs of the full oracle results in Theorems B.1 and B.2 below rely on Theorem 5, which allows one to exploit exponential probability bounds for the posterior to control the corresponding probability under the variational approximation. To prove our results, it therefore suffices to show that on a suitable event, one can (a) control the KL divergence between the variational approximation and the true posterior and (b) establish the appropriate posterior tail inequality (14). Part (a) is dealt with in Section B.2 and (b) in Section B.3 below. Define the events

[TABLE]

and

[TABLE]

for $\Gamma,\varepsilon,\kappa>0$ . The middle event in $\mathcal{T}_{1}$ says that the posterior puts most of its mass on models of dimension at most $\Gamma$ ; the number $1/4$ is unimportant and any number less than $1/2$ suffices. The third event says the posterior places all but exponentially small probability on an $\ell_{2}$ -ball of radius $\varepsilon$ about the truth and is used for a localization argument when bounding the KL divergence. The proof uses an iterative structure, using successive posterior localizations to eventually bound the KL divergence in Section B.2. This idea is a useful technique from Bayesian nonparametrics, see e.g. [32].

For parameters $\theta_{0},\theta_{*}\in\mathbb{R}^{p}$ , set $S_{*}=S_{\theta_{*}}$ and $s_{*}=|S_{*}|$ and define

[TABLE]

This quantity appears in the posterior exponential probabilities, which take the form $e^{-c\Delta_{*}}$ . We require the following parameter choices for the event $\mathcal{T}_{1}$ in (B.2):

[TABLE]

for some $M>0$ large enough depending only on $A_{1},A_{3},A_{4}$ .

Lemma B.1.

(i) The event $\mathcal{T}_{0}$ defined in (B.1) satisfies

[TABLE]

(ii) Suppose the prior satisfies (4) and (5). For $\theta_{0}\in\mathbb{R}^{p}\backslash\{0\}$ , let $\theta_{*}\in\mathbb{R}^{p}$ be any vector satisfying $1\leq s_{*}=|S_{\theta_{*}}|\leq|S_{\theta_{0}}|=s_{0}$ ,

[TABLE]

Then the event $\mathcal{T}_{1}$ given in (B.2) with parameters $\Gamma,\varepsilon,\kappa$ chosen according to (B.4) satisfies

[TABLE]

uniformly over all $\theta_{0}$ and $\theta_{*}$ as above.

Proof.

(i) Under $P_{\theta_{0}}$ , $X^{T}(Y-X\theta_{0})=X^{T}Z\sim N_{p}(0,X^{T}X)$ . Since $(X^{T}Z)_{i}\sim N(0,(X^{T}X)_{ii})$ and $(X^{T}X)_{ii}\leq\|X\|^{2}$ for all $1\leq i\leq p$ , a union bound and the standard Gaussian tail inequality give

[TABLE]

(ii) Applying Markov’s inequality and Lemma B.5 below with $M=3$ gives

[TABLE]

Since the right-hand side does not depend on $\theta_{0}$ or $\theta_{*}$ , the probability tends to zero uniformly as required.

Under the assumptions on $\theta_{*}$ ,

[TABLE]

Therefore, applying Lemma B.6 with $L\geq 1$ yields

[TABLE]

Using Markov’s inequality and the last display with $L=L_{0}=\max(3+12/A_{4},2+A_{4}/2)$ ,

[TABLE]

Since the right-hand side again does not depend on $\theta_{0}$ or $\theta_{*}$ , the probability tends to zero uniformly as required. ∎

Theorem B.1 (Full oracle recovery).

Suppose the model selection prior (3) satisfies (4) and (5). For $\theta_{0}\in\mathbb{R}^{p}\backslash\{0\}$ , let $\theta_{*}\in\mathbb{R}^{p}$ be any vector satisfying $1\leq s_{*}=|S_{\theta_{*}}|\leq|S_{\theta_{0}}|=s_{0}$ and $\|X(\theta_{0}-\theta_{*})\|_{2}^{2}\leq(s_{0}-s_{*})\log p.$ Then the variational Bayes posterior $\widetilde{\Pi}$ satisfies, uniformly over all $\theta_{0}$ and $\theta_{*}$ as above,

[TABLE]

for any $\rho_{n}>2$ , where $\Gamma,L_{0}$ are given in (B.4). Moreover, both

[TABLE]

satisfy the same inequality. Furthermore, the exact same inequalities hold for the variational Bayes posteriors $\widetilde{Q}$ and $\hat{Q}$ .

Proof.

Suppose first that $s_{*}/\phi(S_{*})^{2}\leq s_{0}/\phi(S_{0})^{2}$ . Let $\mathcal{T}_{1}$ denote the event in (B.2) with parameters (B.4), which by Lemma B.1(ii) satisfies $P_{\theta_{0}}(\mathcal{T}_{1})\to 1$ uniformly over all $\theta_{0},\theta_{*}$ in the theorem hypothesis. Set

[TABLE]

and note $E_{\theta_{0}}\widetilde{\Pi}(\Theta_{n})\leq E_{\theta_{0}}\widetilde{\Pi}(\Theta_{n})1_{\mathcal{T}_{1}}+o(1).$ We now apply Theorem 5 with this choice of $\Theta_{n}$ on the event $\mathcal{T}_{1}$ . For $\Delta_{*}$ defined in (B.3), it holds that $\Delta_{*}\leq(1+\tfrac{16}{\phi(S_{0})^{2}}\tfrac{\lambda}{\bar{\lambda}})s_{0}\log p$ by (B.5). Using Lemma B.6 below with $L+2=\rho_{n}$ thus gives

[TABLE]

for $p$ large enough depending on $A_{1},A_{3},A_{4}$ , and where $C,c>0$ also depend only on the prior parameters. Since $\mathcal{T}_{1}\subset\mathcal{T}_{0}$ by (B.2), condition (14) is satisfied on $\mathcal{T}_{1}$ with $\delta_{n}=c\rho_{n}\Delta_{*}$ . Applying Theorem 5 gives

[TABLE]

Note that the parameters (B.4) satisfy $\Gamma\log p\lesssim\Delta_{*}$ and $\varepsilon\lesssim\frac{\sqrt{s_{0}\log p}}{\|X\|\widetilde{\psi}_{L_{0}+2}(S_{0})^{2}\phi(S_{0})}$ . Using this and Lemma B.4 below,

[TABLE]

as required.

If $s_{*}/\phi(S_{*})^{2}>s_{0}/\phi(S_{0})^{2},$ then $\tfrac{\sqrt{s_{*}\log p}}{\phi(S_{*})}+\|X(\theta_{0}-\theta_{*})\|_{2}>\tfrac{\sqrt{s_{0}\log p}}{\phi(S_{0})}.$ The desired inequality then immediately follows from the stronger inequality with $\theta_{*}=\theta_{0}$ just established above. The results for $\ell_{1}$ and $\ell_{2}$ loss follow exactly as above by using the respective inequalities for the $\ell_{1}$ and $\ell_{2}$ oracle contraction rates in Lemma B.6 to establish (14).

Similarly, the results for the variational Bayes posteriors $\hat{Q}$ and $\widetilde{Q}$ based on the mean-field variational families (9) and (10) follow identically upon using Lemmas B.2 and B.3 instead of Lemma B.4 to control the Kullback-Leibler divergence. ∎

Theorem B.2 (Full oracle dimension).

Suppose the model selection prior (3) satisfies (4) and (5). For $\theta_{0}\in\mathbb{R}^{p}\backslash\{0\}$ , let $\theta_{*}\in\mathbb{R}^{p}$ be any vector satisfying $1\leq s_{*}=|S_{\theta_{*}}|\leq|S_{\theta_{0}}|=s_{0}$ and $\|X(\theta_{0}-\theta_{*})\|_{2}^{2}\leq(s_{0}-s_{*})\log p.$ Then the variational Bayes posterior $\widetilde{\Pi}$ satisfies, uniformly over all $\theta_{0}$ and $\theta_{*}$ as above,

[TABLE]

for any $\rho_{n}>0$ , where $\Gamma,L_{0}$ are given in (B.4). Furthermore, the exact same inequality holds for the variational Bayes posteriors $\widetilde{Q}$ and $\hat{Q}$ .

Proof.

The proof follows similarly to that of Theorem B.1 by applying Theorem 5 with

[TABLE]

again taking the event $A=\mathcal{T}_{1}$ and using Lemma B.5 with $M=\rho_{n}+2$ instead of Lemma B.6 to verify (14). ∎

Remark B.1 (Misspecification of the error distribution).

The Gaussian error distribution is assumed in model (1) for concreteness and can be relaxed. For recovery and dimension control (Theorems 1 and 2), inspection of the contraction rate proofs in [13] and the KL bounds in Section B.2 show that it suffices that there exists a constant $C>0$ such that

[TABLE]

which holds for much more general noise distributions. This condition is commonly imposed when studying the LASSO, see e.g. [9]. For the full oracle bounds, we further need that Lemma 3 of [13], which concerns a change of measure, holds. This indeed holds under a wider range of noise distributions, see Remark 1 of [13]. The results for VB in this paper are thus robust under noise misspecification as for the true posterior [13], see also Section A.4 for an empirical study of noise misspecification for our method.

B.2 Kullback-Leibler divergences between variational classes and the posterior

We now show that on the event $\mathcal{T}_{1}$ in (B.2), we can bound the (minimized) Kullback-Leibler divergences between the posterior and the approximating variational classes. In particular, we need oracle-type bounds on the KL divergence to obtain our oracle results. This is the major technical difficulty in establishing our result. We first consider the family $\mathcal{Q}$ of distributions (9), which consists of products of non-diagonal multivariate normal distributions with Dirac delta distributions for a single fixed support set $S$ .

For a given model $S\subseteq\{1,\dots,p\}$ , let $X_{S}$ denote the $n\times|S|$ -submatrix of the full regression matrix $X$ , where we keep only the columns $X_{\cdot i}$ , $i\in S$ . Let $\hat{\theta}_{S}=(X_{S}^{T}X_{S})^{-1}X_{S}^{T}Y$ be the least squares estimator in the restricted model $Y=X_{S}\theta_{S}+Z$ . If the restricted model were correctly specified, then $\hat{\theta}_{S}$ would have distribution $N_{S}(\theta_{0,S},(X_{S}^{T}X_{S})^{-1})$ under $P_{\theta_{0}}$ . We approximate the posterior with a $N_{S}(\hat{\theta}_{S},(X_{S}^{T}X_{S})^{-1})\otimes\delta_{S^{c}}$ distribution, where $S$ is a suitable approximating set to which the posterior assigns sufficient probability.

Lemma B.2.

If $4e^{1+\Gamma\log p-\kappa}\leq 1$ , then the variational posterior $\hat{Q}$ arising from the family (9) satisfies

[TABLE]

Proof.

We construct our posterior approximation on the event $\mathcal{T}_{1}$ in (B.2). The posterior takes the form

[TABLE]

where the weights $\hat{q}=(\hat{q}_{S}:\,S\subseteq\{1,...,p\})$ lie in the $2^{p}$ -dimensional simplex and $\Pi_{S}(\cdot|Y)$ is the posterior for $\theta_{S}\in\mathbb{R}^{|S|}$ in the restricted model $Y=X_{S}\theta_{S}+Z$ . Since

[TABLE]

it follows that on $\mathcal{T}_{1}$ ,

[TABLE]

for all $p$ since $\Gamma>0$ . Note further that

[TABLE]

Together, the last two displays show that on $\mathcal{T}_{1}$ and for all $p$ , there exists a set $\tilde{S}$ satisfying

[TABLE]

Since an $N_{S}(\mu_{S},\Sigma_{S})\otimes\delta_{S^{c}}$ distribution is only absolutely continuous with respect to the $\hat{q}_{S}\Pi_{S}(\cdot|Y)\otimes\delta_{S^{c}}$ term of the posterior (B.6),

[TABLE]

where the last Kullback-Leibler divergence is over $|\tilde{S}|$ -dimensional distributions. On $\mathcal{T}_{1}$ , $\log(1/\hat{q}_{\tilde{S}})\leq\log(2ep^{\Gamma})=\log(2e)+\Gamma\log p$ . It thus remains to bound the second term in (B.8).

Let $E_{\mu_{S},\Sigma_{S}}$ denote the expectation under the law $\theta_{S}\sim N_{S}(\mu_{S},\Sigma_{S})$ . Setting

[TABLE]

one can check that the resulting normal distribution has density function proportional to $e^{-\frac{1}{2}\|Y-X_{\tilde{S}}\theta_{\tilde{S}}\|_{2}^{2}}$ , $\theta_{\tilde{S}}\in\mathbb{R}^{|\tilde{S}|}$ . Therefore,

[TABLE]

with $D_{\Pi}=\int_{\mathbb{R}^{|\tilde{S}|}}e^{-\frac{1}{2}\|Y-X_{\tilde{S}}\theta_{\tilde{S}}\|_{2}^{2}-\lambda\|\theta_{\tilde{S}}\|_{1}}d\theta_{\tilde{S}}$ and $D_{N}=\int_{\mathbb{R}^{|\tilde{S}|}}e^{-\frac{1}{2}\|Y-X_{\tilde{S}}\theta_{\tilde{S}}\|_{2}^{2}-\lambda\|\theta_{0,\tilde{S}}\|_{1}}d\theta_{\tilde{S}}$ the normalizing constants.

We firstly upper bound $\log(D_{\Pi}/D_{N})$ . Define

[TABLE]

Let $\bar{\theta}_{\tilde{S}}$ denote the extension of a vector $\theta_{\tilde{S}}\in\mathbb{R}^{|\tilde{S}|}$ to $\mathbb{R}^{p}$ with $\bar{\theta}_{\tilde{S},j}=\theta_{\tilde{S},j}$ for $j\in\tilde{S}$ and $\bar{\theta}_{\tilde{S},j}=0$ for $j\not\in\tilde{S}$ . On $\mathcal{T}_{1}$ , using (B.6) and (B.7),

[TABLE]

where the last inequality holds by assumption. Using Bayes formula, this yields

[TABLE]

almost surely. In particular, $D_{\Pi}\leq 2\int_{B_{\tilde{S}}}e^{-\frac{1}{2}\|Y-X_{\tilde{S}}\theta_{\tilde{S}}\|_{2}^{2}-\lambda\|\theta_{\tilde{S}}\|_{1}}d\theta_{\tilde{S}}$ on $\mathcal{T}_{1}$ . Therefore on $\mathcal{T}_{1}$ ,

[TABLE]

where in the fourth inequality we have applied Cauchy-Schwarz.

We now turn to the first term in (B.10). On $\mathcal{T}_{1}$ , using the triangle inequality and Cauchy-Schwarz,

[TABLE]

since $E_{0,\Sigma_{\tilde{S}}}\|\theta_{\tilde{S}}\|_{2}^{2}=\text{Tr}(\Sigma_{\tilde{S}})$ . Let $\Lambda_{\min}(A)$ and $\Lambda_{\max}(A)$ denote the smallest and largest eigenvalues, respectively, of a symmetric, positive definite matrix $A$ . Using the variational characterization of maximal/minimal eigenvalues ([24], p. 234), for any $S\subseteq\{1,\dots,p\}$ ,

[TABLE]

Therefore,

[TABLE]

Under $P_{\theta_{0}}$ , using (1) and (B.9), the bias term can be decomposed as

[TABLE]

For $I$ , note first that the $\ell_{2}$ -operator norm of $(X_{\tilde{S}}^{T}X_{\tilde{S}})^{-1}$ is bounded by $1/(\|X\|^{2}\widetilde{\phi}(|{\tilde{S}}|)^{2})$ by (B.12). On $\mathcal{T}_{1}$ , using Cauchy-Schwarz,

[TABLE]

Together with (B.7), this gives

[TABLE]

Using the same bound on the $\ell_{2}$ -operator norm and (1), on the event $\mathcal{T}_{1}\subset\mathcal{T}_{0}$ it holds that

[TABLE]

Combining all of the above bounds and using that $|\tilde{S}|\leq\Gamma$ , on the event $\mathcal{T}_{1}$ ,

[TABLE]

Together with (B.10), the bound $\log(D_{\Pi}/D_{N})\leq 2\lambda\Gamma^{1/2}\varepsilon+\log 2$ derived above and that $\widetilde{\phi}(|\tilde{S}|)\leq\widetilde{\phi}(1)\leq 1$ , this yields

[TABLE]

Combining this with (B.8) and that $\log(1/\hat{q}_{\tilde{S}})\leq\log(2e)+\Gamma\log p$ completes the proof. ∎

We next consider the mean-field subclass $\mathcal{Q}_{MF}$ of $\mathcal{Q}$ given by (10). This again selects a single fixed support $S$ but further requires the fitted normal distribution to have diagonal covariance matrix. We consider a diagonal version of $N_{S}(\hat{\theta}_{S},(X_{S}^{T}X_{S})^{-1})\otimes\delta_{S^{c}}$ considered in Lemma B.2.

Lemma B.3.

If $4e^{1+\Gamma\log p-\kappa}\leq 1$ , then the variational posterior $\widetilde{Q}$ arising from the family (10) satisfies

[TABLE]

Proof.

We showed in the proof of Lemma B.2 that on the event $\mathcal{T}_{1}$ given in (B.2), there exists a set $\tilde{S}$ satisfying (B.7). Arguing as in (B.8),

[TABLE]

where the last Kullback-Leibler divergence is over the $|\tilde{S}|$ -dimensional distributions and $D_{\tilde{S}}$ ranges over diagonal positive definite matrices. On $\mathcal{T}_{1}$ and for all $p$ , we have $\log(1/\hat{q}_{\tilde{S}})\leq\log(2ep^{\Gamma})=\log(2e)+\Gamma\log p$ by (B.7).

The latter Kullback-Leibler divergence equals

[TABLE]

for any covariance matrix $\Sigma_{\tilde{S}}$ . For the first term in (B.13), the formula for the Kullback-Leibler divergence between two multivariate Gaussians gives

[TABLE]

where $|A|$ denotes the determinant of a square matrix $A$ . Set now $\mu_{\tilde{S}}=(X_{\tilde{S}}^{T}X_{\tilde{S}})^{-1}X_{\tilde{S}}^{T}Y$ , $\Sigma_{\tilde{S}}=(X_{\tilde{S}}^{T}X_{\tilde{S}})^{-1}$ as in (B.9) and define the diagonal matrix $D_{\tilde{S}}$ via $(D_{\tilde{S}})_{ii}=1/(\Sigma_{\tilde{S}}^{-1})_{ii}=1/(X_{\tilde{S}}^{T}X_{\tilde{S}})_{ii}$ . This gives $\text{Tr}(\Sigma_{\tilde{S}}^{-1}D_{\tilde{S}})=|\tilde{S}|$ , so that it remains to control $\tfrac{1}{2}\log(|\Sigma_{\tilde{S}}|/|D_{\tilde{S}}|)=\tfrac{1}{2}\log(|\Sigma_{\tilde{S}}||D_{\tilde{S}}^{-1}|)$ . For our choice of $D_{\tilde{S}}$ ,

[TABLE]

while for $\Lambda_{\min}(A)$ and $\Lambda_{\max}(A)$ the smallest and largest eigenvalues, respectively, of a matrix $A$ and using (B.12),

[TABLE]

This yields that $\text{KL}(N_{\tilde{S}}(\mu_{\tilde{S}},D_{\tilde{S}})\|N_{\tilde{S}}(\mu_{\tilde{S}},\Sigma_{\tilde{S}}))\leq|\tilde{S}|\log(1/\widetilde{\phi}(|\tilde{S}|))\leq\Gamma\log(1/\widetilde{\phi}(\Gamma))$ .

Note that the second term in (B.13) is identical to the expression (B.10), except that the expectation is taken under $\theta_{\tilde{S}}\sim N_{\tilde{S}}(\mu_{\tilde{S}},D_{\tilde{S}})$ instead of $\theta_{\tilde{S}}\sim N_{\tilde{S}}(\mu_{\tilde{S}},\Sigma_{\tilde{S}})$ . One may therefore use the exact same arguments as in Lemma B.2 with the only difference occurring in the second term in (B.11), where one instead has $\lambda E_{0,D_{\tilde{S}}}\|\theta_{\tilde{S}}\|_{1}\leq\lambda|\tilde{S}|^{1/2}(E_{0,D_{\tilde{S}}}\|\theta_{\tilde{S}}\|_{2}^{2})^{1/2}=\lambda|\tilde{S}|^{1/2}\text{Tr}(D_{\tilde{S}})^{1/2}$ . For $e_{i}$ the $i^{th}$ unit vector in $\mathbb{R}^{p}$ ,

[TABLE]

so that $\lambda|\tilde{S}|^{1/2}\text{Tr}(D_{\tilde{S}})^{1/2}\leq\lambda\Gamma/(\|X\|\widetilde{\phi}(1))$ . Combining the bounds as in Lemma B.2 then gives the result. ∎

Lemma B.4.

If $4e^{1+\Gamma\log p-\kappa}\leq 1$ , then the variational posterior $\widetilde{\Pi}$ arising from the family (7) of spike-and-slab distributions satisfies

[TABLE]

Proof.

Since $\mathcal{Q}_{MF}\subset\mathcal{P}_{MF}$ , we have $\text{KL}(\widetilde{\Pi}\|\Pi(\cdot|Y))\leq\text{KL}(\widetilde{Q}\|\Pi(\cdot|Y))$ . The result then follows from Lemma B.3. ∎

B.3 Oracle contraction rates for the original posterior distribution

Oracle type contraction rates for the original posterior were established in Castillo et al. [13]. However, their results are not stated with exponential bounds as needed in (14), so we must reformulate them in order to apply our Theorem 5. The required exponential bounds in fact follow from their proofs; we recall here the required results and, since [13] is a rather technical article, we provide a brief explanation why the exponential bounds hold.

Lemma B.5 (Theorem 10 of [13]).

Suppose the prior satisfies (4) and (5). Then for $p$ large enough depending on $A_{2},A_{4}$ , any $M>0$ and any $\theta_{0},\theta_{*}\in\mathbb{R}^{p}$ ,

[TABLE]

where $S_{*}=S_{\theta_{*}}$ and $\mathcal{T}_{0}$ is the event in (B.1).

Proof.

Following the proof of Theorem 10 of [13], one obtains using (6.3) and the second display on p. 2008 of [13] that for $\bar{\lambda}=2\|X\|\sqrt{\log p}$ , any $\theta_{*}$ and any measurable set $B\subseteq\mathbb{R}^{p}$ ,

[TABLE]

Setting now $B=\{\theta:|S_{\theta}|>R\}$ for $R\geq s_{*}$ , the third display on p. 2008 of [13] shows that

[TABLE]

for $p$ large enough that $4A_{2}/p^{A_{4}}<1$ . Substituting this into the second last display and using that $\bar{\lambda}^{2}=4\|X\|^{2}\log p$ ,

[TABLE]

Choosing $R=(2\delta+1)s_{*}-1+2\eta$ , the right-hand side equals

[TABLE]

Further picking $\delta=2M(1+16\lambda/(\bar{\lambda}\phi(S_{*})^{2}))/A_{4}$ and $\eta=2M\|X(\theta_{0}-\theta_{*})\|_{2}^{2}/(A_{4}\log p)$ , the right-hand side is bounded by

[TABLE]

for $p$ large enough depending on $A_{2},A_{4}$ , as required. ∎

The following result is a modified version of the oracle inequality in Theorem 3 of [13] with $S_{*}=S_{0}$ . Since it is stated somewhat differently in [13], we sketch why this is true.

Lemma B.6 (Theorem 3 of [13]).

*Suppose the prior satisfies (4) and (5). Then there exists a constant $M>0$ such that for $p$ large enough, both depending only on $A_{1},A_{3},A_{4}$ , any $L\geq 1$ , and uniformly over all $\theta_{0},\theta_{*}\in\mathbb{R}^{p}$ with $|S_{\theta_{*}}|\leq|S_{\theta_{0}}|$ , *

[TABLE]

where $s_{0}=|S_{\theta_{0}}|$ , $s_{*}=|S_{\theta_{*}}|$ and $C=C(A_{2},A_{4})$ . Moreover, both

[TABLE]

satisfy the same inequality.

Proof.

Unless otherwise stated, we use here the notation from [13]. As on p. 2008 of [13], define the event $E=\{\theta:|S_{\theta}|\leq D_{*}\wedge D_{0}\}$ for

[TABLE]

where $\bar{\lambda}=2\|X\|\sqrt{\log p}$ and $D_{0}$ is the same expression with $\theta_{*}$ replaced by $\theta_{0}$ . Note that we take different constants than in (6.7) of [13] to obtain the required exponential tail bound. Lemma B.5 yields, with $M=L+2$ and since $s_{*}\leq s_{0}$ ,

[TABLE]

for every $\theta_{0}\in\mathbb{R}^{p}$ , so we can intersect the desired set with $E$ in what follows.

From definition (12), we have $\overline{\psi}_{L+2}(S_{0})=\overline{\phi}(D_{0}+s_{0})$ . Continuing through the proof, the third last display on p. 2009 of [13] (note that up to this point, the definitions of $D_{*}$ and $D_{0}$ only affect the definition of the compatibility type constants) gives

[TABLE]

where again $\bar{\lambda}=2\|X\|\sqrt{\log p}$ . By condition (4), $\sum_{s=0}^{p}\pi_{p}(s)2^{s}\leq\pi_{p}(0)\sum_{s=0}^{p}(2A_{2}p^{-A_{4}})^{s}\leq\pi_{p}(0)C(A_{2},A_{4})$ for $p$ large enough. Using this and taking $R^{2}=\overline{M}^{2}(D_{*}+s_{*})\log p/\overline{\psi}_{L+2}(S_{0})^{2}$ , the last display is bounded by

[TABLE]

where we have also used $\overline{\psi}_{L+2}(S_{0})\leq\overline{\phi}(1)\leq 1$ for any $S_{0}$ . Using the definition (B.14) of $D_{*}$ , that $\lambda/\bar{\lambda}\leq 2$ and the inequality $\sqrt{x+y}\leq\sqrt{x}+\sqrt{y}$ for any $x,y\geq 0$ ,

[TABLE]

for a constant $C>0$ depending only on $A_{4}$ , yielding

[TABLE]

Combining this with the third last display gives

[TABLE]

for some $M>0$ large enough depending only on $A_{1},A_{3},A_{4}$ . Using $\overline{\psi}_{L+2}(S_{0})\leq 1$ and the definition (B.14), the probability in the last display is smaller than that in (B.15) if $4(L+2)/A_{4}\geq L$ . Considering these two cases separately establishes the required inequality for the prediction error $\|X(\theta-\theta_{0})\|_{2}$ .

For $\ell_{1}$ -loss, the result follows from that for prediction error and the first display on p. 2010 of [13].

For $\ell_{2}$ -loss, note that $\|X(\theta-\theta_{0})\|_{2}\geq\widetilde{\phi}(|S_{\theta-\theta_{0}}|)\|X\|\|\theta-\theta_{0}\|_{2}\geq\widetilde{\psi}_{L+2}(S_{0})\|X\|\|\theta-\theta_{0}\|_{2}$ for any $\theta\in E$ . The result then follows from that for prediction error and that $\overline{\psi}_{L+2}(S_{0})\geq\widetilde{\psi}_{L+2}(S_{0})$ by Lemma D.1. ∎

Appendix C Additional methodological details

C.1 Proofs for the variational algorithm

We provide here the derivations of the formulas used in the CAVI update equations of our variational algorithm in Section 4.

Proof of (16): We compute the Kullback-Leibler divergence between $P_{\bm{\mu},\bm{\sigma},\bm{\gamma}}$ and the posterior $\Pi(\cdot|Y)$ , conditional on $z_{i}=1$ , as a function of $\mu_{i}$ and $\sigma_{i}$ . Since the variational probability distribution of $\theta_{i}$ conditional on $z_{i}=1$ (i.e. $P_{\mu_{i},\sigma_{i}|z_{i}=1}$ ) is singular to the Dirac measure $\delta_{0}$ , in the Radon-Nikodym derivative $dP_{\mu_{i},\sigma_{i}|z_{i}=1}/d\Pi_{i}$ , where $\Pi_{i}$ is the prior for $\theta_{i}$ , it suffices to consider only the continuous part of the prior measure in the denominator. Write $d\Pi(\theta|Y)=D_{\Pi}^{-1}e^{-\|Y-X\theta\|_{2}^{2}/2}d\Pi(\theta)$ with $D_{\Pi}$ the normalizing constant. Using all of these and the prior product structure, $\text{KL}(P_{\bm{\mu},\bm{\sigma},\bm{\gamma}|z_{i}=1}\|\Pi(\cdot|Y))$ equals, as a function of $\mu_{i}$ and $\sigma_{i}$ ,

[TABLE]

where $C>0$ is independent of $\mu_{i},\sigma_{i}$ and $\overline{w}_{i}=a_{0}/(a_{0}+b_{0})$ is the prior mean for $w_{i}$ . Recall that the expected value of the folded normal distribution with parameters $\mu\in\mathbb{R}$ and $\sigma>0$ is $\sigma\sqrt{2/\pi}e^{-\mu^{2}/(2\sigma^{2})}+\mu(1-2\Phi(-\mu/\sigma))$ . Using this and explicitly evaluating the expectation of the first term, the last display equals

[TABLE]

where $C^{\prime}>0$ is again independent of $\mu_{i},\sigma_{i}$ . Minimizing the last display with respect to either $\mu_{i}$ or $\sigma_{i}$ (but not jointly) gives the same minimizers as minimizing $f_{i}$ and $g_{i}$ in (16).

Proof of (17): Similarly to the derivation of (16) above, the KL divergence between $P_{\bm{\mu},\bm{\sigma},\bm{\gamma}}$ and $\Pi(\cdot|Y)$ as a function of $\gamma_{i}$ equals

[TABLE]

where $C>0$ is independent of $\gamma_{i}$ and $\overline{w}_{i}=a_{0}/(a_{0}+b_{0})$ . Since on an event of $P_{\bm{\mu},\bm{\sigma},\bm{\gamma}}$ -probability one, $\theta_{i}=0$ if and only if $z_{i}=0$ , the last display equals

[TABLE]

where $C>0$ may change from line to line and is independent of $\gamma_{i}$ . Setting the derivative with respect to $\gamma_{i}$ of this last expression equal to zero and rearranging gives (17).

C.2 Algorithms for Gaussian slabs

We collect here for completeness the variational algorithms for the spike-and-slab prior with Gaussian slabs with which we have compared our method. First we give the component-wise update of the parameters as in [27], see Algorithm 2 below.

In [25] the authors argue that coordinate-wise parameter updates can accumulate error from each step leading to a suboptimal optimization procedure. To resolve this, they propose simultaneously updating the entire parameter vectors $\bm{\mu},\bm{\sigma}$ and $\bm{\lambda}$ without using a CAVI type of algorithm. A version of their proposed algorithm is given in Algorithm 3, where $diag(v)$ , $v\in\mathbb{R}^{p}$ , creates a diagonal square matrix in $\mathbb{R}^{p\times p}$ with diagonal elements $v$ (see also Algorithm 1 of [46] with $\alpha=1$ , $\sigma=1$ and $\nu_{1}=1$ for a related implementation). As in the other cases, we have taken the ridge regression estimator $(X^{T}X+I)^{-1}X^{T}Y$ as our initialization for $\mu$ .

Lastly, we provide the VB algorithm for the $\mathcal{Q}_{MF}$ mean-field variational class using Laplace slabs in the prior.

Appendix D Examples of compatible design matrices

In addition to the compatibility type constants defined in Section 2.3, we also consider a stronger invertibility condition involving the ‘mutual coherence’ of the design matrix, which is the maximal correlation between the different predictors in $X$ .

Definition D.1 (Mutual coherence).

The mutual coherence number is

[TABLE]

While we do not actually use the mutual coherence in our results, it provides an easy way to understand the compatibility constants in Definitions 1-3 in several well-studied design matrix examples below. The following result relates these notions.

Lemma D.1 (Lemma 1 of [13]).

$\phi(S)^{2}\geq\overline{\phi}(1)^{2}-15|S|\text{mc}(X)$ , $\overline{\phi}(s)^{2}\geq\widetilde{\phi}(s)^{2}\geq\overline{\phi}(1)^{2}-s\text{mc}(X)$ .

By evaluating the infimum in Definition 2 at the unit vectors, one obtains $\widetilde{\phi}(1)=\overline{\phi}(1)=\min_{i}\|X_{\cdot i}\|_{2}/\|X\|=\min_{i\neq j}\|X_{\cdot i}\|_{2}/\|X_{\cdot j}\|_{2}$ , which is bounded away from zero if the columns of $X$ have comparable Euclidean norms. In this case, Lemma D.1 implies that the compatibility numbers and sparse singular values are bounded away from zero for models of size $O(1/\text{mc}(X))$ . The mutual coherence condition is thus the strongest of these notions. These conditions are illustrated via the following well-studied examples.

(Sequence model). We observe a vector $Y=(Y_{1},\dots,Y_{n})$ of independent random variables with $Y_{i}\sim N(\theta_{i},1)$ . This corresponds to model (1) with $n=p$ and $X=I_{p}$ the identity matrix, so that $\|X\|=\|X_{\cdot i}\|_{2}=1$ for all $i$ , the compatibility numbers are 1 and $\text{mc}(X)=0$ . In this setting, all results below are valid for all sparsity levels. 2. 2.

(Sequence model, multiple observations). We observe $n$ independent $N(\theta_{i},\sigma_{n}^{2})$ random variables with $\sigma_{n}\to 0$ . Defining $Y_{i}$ as $\sigma_{n}^{-1}$ times the original observations, this falls within the framework of model (1) with $X=\sigma_{n}^{-1}I_{p}$ , so that $\|X\|=\|X_{\cdot i}\|_{2}=\sigma_{n}^{-1}$ for all $i$ , the compatibility numbers are 1 and $\text{mc}(X)=0$ , similar to Example 1. 3. 3.

(Regression with orthogonal design). If $X$ is an orthogonal design matrix such that $\langle X_{\cdot i},X_{\cdot j}\rangle=0$ for $i\neq j$ , the regression problem can be transformed into a sequence model. 4. 4.

(Response model). Suppose the entries of the original regression matrix are i.i.d. random variables $W_{ij}$ . We may then normalize the entries of the design matrix by defining $X_{ij}=W_{ij}/\|W_{\cdot j}\|_{2}$ , so that the column lengths satisfy $\|X\|=\|X_{\cdot i}\|_{2}=1$ for all $i$ . If $|W_{ij}|\leq C$ for a constant $C>0$ and $\log p=o(n)$ , or $Ee^{t_{0}|W_{ij}|^{\alpha}}<\infty$ for some $\alpha,t_{0}>0$ and $\log p=o(n^{\alpha/(4+\alpha)})$ , then Theorems 1 and 2 of [10] show that $\sqrt{n/\log p}\text{mc}(W)\stackrel{{\scriptstyle P}}{{\to}}2$ as $n\to\infty$ . Since $\text{mc}(W)=\text{mc}(X)$ , this shows that for any $\varepsilon>0$ , $P(\text{mc}(X)>(2+\varepsilon)\sqrt{(\log p)/n})\to 0$ . Thus with probability approaching one, the compatibility numbers are bounded away from zero for sparsity levels $s_{n}=o(\sqrt{n/\log p})$ .

A classic example is $W_{ij}\stackrel{{\scriptstyle iid}}{{\sim}}N(0,1)$ . In this case, the above bound on the mutual coherence holds as long as $\log p=o(n^{1/3})$ . 5. 5.

By rescaling the columns of $X$ , one can set the $p\times p$ matrix $C:=X^{T}X/n$ to take value one for all diagonal entries. Then $\|X\|=\|X_{\cdot i}\|_{2}=\sqrt{n}$ for all $i$ and the elements $C_{ij}$ , $i\neq j$ , are the correlations between columns. For some $m\in\mathbb{N}$ , if $C_{ij}=r$ for a constant $0<r<(1+cm)^{-1}$ and all $i\neq j$ or $|C_{ij}|\leq c/(2m-1)$ for every $i\neq j$ , then [49] show that models up to dimension $m$ satisfy the ‘strong irrepresentability condition’ and are hence estimable. In particular, $\text{mc}(X)=\max_{i\neq j}C_{ij}=O(1/m)$ and hence the compatibility numbers are bounded away from zero for sparsity levels $s_{n}=o(m)$ .

Bibliography49

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Alquier, P., and Ridgway, J. Concentration of tempered posteriors and of their variational approximations. Ann. Statist. 48 , 3 (2020), 1475–1497.
2[2] Banerjee, S., Castillo, I., and Ghosal, S. Survey paper: Bayesian inference in high-dimensional models.
3[3] Belitser, E., and Ghosal, S. Empirical Bayes oracle uncertainty quantification for regression. Ann. Statist., to appear (2020).
4[4] Belitser, E., and Nurushev, N. Needles and straw in a haystack: Robust confidence for possibly sparse sequences. Bernoulli 26 , 1 (2020), 191–225.
5[5] Blei, D. M., Kucukelbir, A., and Mc Auliffe, J. D. Variational inference: a review for statisticians. J. Amer. Statist. Assoc. 112 , 518 (2017), 859–877.
6[6] Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent Dirichlet Allocation. J. Mach. Learn. Res. 3 (Mar. 2003), 993–1022.
7[7] Boucheron, S., Lugosi, G., and Massart, P. Concentration inequalities: A nonasymptotic theory of independence . Oxford University Press, Oxford, 2013.
8[8] Breiman, L., and Friedman, J. H. Estimating optimal transformations for multiple regression and correlation. J. Amer. Statist. Assoc. 80 , 391 (1985), 580–619.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Variational Bayes for high-dimensional linear regression with sparse priors

Abstract

1 Introduction

2 Prior, variational families and design matrix

2.1 Model selection priors

2.2 Variational approximations

2.3 Design matrix

Definition 1** (Compatibility).**

Definition 2** (Uniform compatibility for sparse vectors).**

Definition 3** (Smallest scaled sparse singular value).**

3 Main results

Theorem 1** (Recovery).**

Theorem 2** (Dimension).**

Theorem 3** (Oracle recovery).**

Theorem 4** (Oracle dimension).**

Theorem 5**.**

Proof.

4 Variational Bayes algorithm

4.1 Coordinate update equations

4.2 Prioritized updating order

5 Numerical study

5.1 Prioritized updates

5.2 Comparing Bayesian variable selection methods

6 Conclusion

Appendix A Additional numerical results

A.1 Ozone interaction data

A.2 Comparing the VB algorithms

A.3 The effect of the hyper-parameter λ\lambdaλ

A.4 Noise misspecification

A.5 Bayesian variable selection methods under correlated inputs

Appendix B Proofs

B.1 Full oracle results

Lemma B.1**.**

Proof.

Theorem B.1** (Full oracle recovery).**

Proof.

Theorem B.2** (Full oracle dimension).**

Proof.

Remark B.1** (Misspecification of the error distribution).**

B.2 Kullback-Leibler divergences between variational classes and the posterior

Lemma B.2**.**

Proof.

Lemma B.3**.**

Proof.

Lemma B.4**.**

Proof.

B.3 Oracle contraction rates for the original posterior distribution

Lemma B.5** (Theorem 10 of [13]).**

Proof.

Lemma B.6** (Theorem 3 of [13]).**

Proof.

Appendix C Additional methodological details

C.1 Proofs for the variational algorithm

C.2 Algorithms for Gaussian slabs

Appendix D Examples of compatible design matrices

Definition D.1** (Mutual coherence).**

Lemma D.1** (Lemma 1 of [13]).**

Definition 1 (Compatibility).

Definition 2 (Uniform compatibility for sparse vectors).

Definition 3 (Smallest scaled sparse singular value).

Theorem 1 (Recovery).

Theorem 2 (Dimension).

Theorem 3 (Oracle recovery).

Theorem 4 (Oracle dimension).

Theorem 5.

A.3 The effect of the hyper-parameter $\lambda$

Lemma B.1.

Theorem B.1 (Full oracle recovery).

Theorem B.2 (Full oracle dimension).

Remark B.1 (Misspecification of the error distribution).

Lemma B.2.

Lemma B.3.

Lemma B.4.

Lemma B.5 (Theorem 10 of [13]).

Lemma B.6 (Theorem 3 of [13]).

Definition D.1 (Mutual coherence).

Lemma D.1 (Lemma 1 of [13]).