Bayesian cumulative shrinkage for infinite factorizations

Sirio Legramanti; Daniele Durante; David B. Dunson

arXiv:1902.04349·stat.ME·September 11, 2020

Bayesian cumulative shrinkage for infinite factorizations

Sirio Legramanti, Daniele Durante, David B. Dunson

PDF

1 Repo

TL;DR

This paper introduces a novel cumulative shrinkage prior for models with unknown dimensions, such as factor analysis, improving dimension recovery and model inference through theoretical and practical advantages.

Contribution

It proposes the cumulative shrinkage process, a new increasing shrinkage prior that enhances dimension inference in over-complete models like factor analysis.

Findings

01

Improved ability to recover the true model dimension.

02

Demonstrated advantages over existing methods in simulations.

03

Effective in real personality traits data analysis.

Abstract

There is a wide variety of models in which the dimension of the parameter space is unknown. For example, in factor analysis the number of latent factors is typically not known and has to be inferred from the observed data. Although classical shrinkage priors are useful in these contexts, increasing shrinkage priors can provide a more effective option, which progressively penalizes expansions with growing complexity. In this article we propose a novel increasing shrinkage prior, named the cumulative shrinkage process, for the parameters controlling the dimension in over-complete formulations. Our construction has broad applicability, simple interpretation, and is based on a sequence of spike and slab distributions which assign increasing mass to the spike as model complexity grows. Using factor analysis as an illustrative example, we show that this formulation has theoretical and…

Tables2

Table 1. Table 1: Performance of cusp and mgp in 25 25 25 simulations for different ( p , H 0 ) 𝑝 subscript 𝐻 0 (p,H_{0}) scenarios

$(p, H_{0})$	method	mse		$E (H^{*} ∣ y)$		averaged ess	runtime (s)
		median	iqr	median	iqr	median	median
(20,5)	cusp	0.75	0.29	5.00	0.00	655.04	310.76
	mgp	0.75	0.32	19.69	0.21	547.23	616.61
(50,10)	cusp	2.25	0.33	10.00	0.00	273.55	716.23
	mgp	2.26	0.28	28.64	1.94	251.35	1845.88
(100,15)	cusp	3.76	0.40	15.00	0.00	175.26	2284.87
	mgp	3.97	0.45	34.38	2.92	116.10	5002.33

Table 2. Table 2: Sensitivity analysis for cusp hyper-parameters ( α , a θ , b θ , θ ∞ ) 𝛼 subscript 𝑎 𝜃 subscript 𝑏 𝜃 subscript 𝜃 (\alpha,a_{\theta},b_{\theta},\theta_{\infty}) in 25 25 25 simulations

$(p, H_{0})$	$(α, a_{θ}, b_{θ}, θ_{\infty})$	mse		$E (H^{*} ∣ y)$		averaged ess	runtime (s)
		median	iqr	median	iqr	median	median
(20,5)	(2.5,2,2,0.05)	0.74	0.32	5.00	0.00	626.22	317.31
	(10,2,2,0.05)	0.74	0.33	5.00	0.00	636.61	314.82
	(5,2,1,0.05)	0.72	0.34	5.00	0.00	607.61	322.68
	(5,1,2,0.05)	0.79	0.30	5.00	0.00	602.28	309.39
	(5,2,2,0.025)	0.78	0.31	5.00	0.00	655.80	313.21
	(5,2,2,0.1)	0.74	0.30	5.00	0.04	604.88	315.51
(50,10)	(2.5,2,2,0.05)	2.25	0.40	10.00	0.00	280.39	719.11
	(10,2,2,0.05)	2.20	0.36	10.00	0.00	277.89	748.75
	(5,2,1,0.05)	2.16	0.42	10.00	0.00	266.82	722.67
	(5,1,2,0.05)	2.35	0.40	10.00	0.00	272.47	689.70
	(5,2,2,0.025)	2.22	0.35	10.00	0.00	280.60	717.19
	(5,2,2,0.1)	2.22	0.41	10.00	0.00	273.39	698.96
(100,15)	(2.5,2,2,0.05)	3.68	0.47	15.00	0.00	176.31	2247.44
	(10,2,2,0.05)	3.74	0.40	15.00	0.00	172.02	2205.78
	(5,2,1,0.05)	3.64	0.44	15.00	0.00	172.04	2287.32
	(5,1,2,0.05)	3.96	0.52	15.00	0.00	174.74	2178.47
	(5,2,2,0.025)	3.70	0.44	15.00	0.00	172.83	2200.20
	(5,2,2,0.1)	3.77	0.44	15.00	0.00	174.76	2284.80

Equations33

(θ_{h} ∣ π_{h}) \sim P_{h} = (1 - π_{h}) P_{0} + π_{h} δ_{θ_{\infty}}, π_{h} = \sum_{l = 1}^{h} ω_{l}, ω_{l} = v_{l} \prod_{m = 1}^{l - 1} (1 - v_{m}),

(θ_{h} ∣ π_{h}) \sim P_{h} = (1 - π_{h}) P_{0} + π_{h} δ_{θ_{\infty}}, π_{h} = \sum_{l = 1}^{h} ω_{l}, ω_{l} = v_{l} \prod_{m = 1}^{l - 1} (1 - v_{m}),

E (v_{h}) = \frac{1}{1 + α}, E (ω_{h}) = \frac{α ^{h - 1}}{( 1 + α ) ^{h}}, E (π_{h}) = 1 - \frac{α ^{h}}{( 1 + α ) ^{h}} (h = 1, 2, \dots) .

E (v_{h}) = \frac{1}{1 + α}, E (ω_{h}) = \frac{α ^{h - 1}}{( 1 + α ) ^{h}}, E (π_{h}) = 1 - \frac{α ^{h}}{( 1 + α ) ^{h}} (h = 1, 2, \dots) .

E (θ_{h}) = E {E (θ_{h} ∣ π_{h})} = {1 - E (π_{h})} θ_{0} + E (π_{h}) θ_{\infty} = θ_{\infty} + {α (1 + α)^{- 1}}^{h} (θ_{0} - θ_{\infty}),

E (θ_{h}) = E {E (θ_{h} ∣ π_{h})} = {1 - E (π_{h})} θ_{0} + E (π_{h}) θ_{\infty} = θ_{\infty} + {α (1 + α)^{- 1}}^{h} (θ_{0} - θ_{\infty}),

\mbox p r (∣ θ_{h} - θ_{\infty} ∣ > ε) = P_{0} {\overset{ˉ}{B}_{ε} (θ_{\infty})} {α (1 + α)^{- 1}}^{h} .

\mbox p r (∣ θ_{h} - θ_{\infty} ∣ > ε) = P_{0} {\overset{ˉ}{B}_{ε} (θ_{\infty})} {α (1 + α)^{- 1}}^{h} .

E (H^{*}) = \sum_{h = 1}^{\infty} E (c_{h}) = \sum_{h = 1}^{\infty} E {E (c_{h} ∣ π_{h})} = \sum_{h = 1}^{\infty} E (1 - π_{h}) = \sum_{h = 1}^{\infty} {α (1 + α)^{- 1}}^{h} = α .

E (H^{*}) = \sum_{h = 1}^{\infty} E (c_{h}) = \sum_{h = 1}^{\infty} E {E (c_{h} ∣ π_{h})} = \sum_{h = 1}^{\infty} E (1 - π_{h}) = \sum_{h = 1}^{\infty} {α (1 + α)^{- 1}}^{h} = α .

\mbox p r {d_{\infty} (θ, θ^{(H)}) > ε} = \mbox p r {sup (∣ θ_{h} ∣ : h = H + 1, H + 2, \dots) > ε} \leq P_{0} {\overset{ˉ}{B}_{ε} (0)} α {α (1 + α)^{- 1}}^{H},

\mbox p r {d_{\infty} (θ, θ^{(H)}) > ε} = \mbox p r {sup (∣ θ_{h} ∣ : h = H + 1, H + 2, \dots) > ε} \leq P_{0} {\overset{ˉ}{B}_{ε} (0)} α {α (1 + α)^{- 1}}^{H},

θ_{h}^{- 1} = \prod_{l = 1}^{h} ϑ_{l} (h = 1, 2, \dots), ϑ_{1} \sim \mbox G a (a_{1}, 1), ϑ_{l} \sim \mbox G a (a_{2}, 1) (l = 2, 3, \dots) .

θ_{h}^{- 1} = \prod_{l = 1}^{h} ϑ_{l} (h = 1, 2, \dots), ϑ_{1} \sim \mbox G a (a_{1}, 1), ϑ_{l} \sim \mbox G a (a_{2}, 1) (l = 2, 3, \dots) .

(θ_{h} ∣ π_{h}) \sim (1 - π_{h}) \mbox I n v G a (a_{θ}, b_{θ}) + π_{h} δ_{θ_{\infty}}, π_{h} = \sum_{l = 1}^{h} ω_{l}, ω_{l} = v_{l} \prod_{m = 1}^{l - 1} (1 - v_{m}),

(θ_{h} ∣ π_{h}) \sim (1 - π_{h}) \mbox I n v G a (a_{θ}, b_{θ}) + π_{h} δ_{θ_{\infty}}, π_{h} = \sum_{l = 1}^{h} ω_{l}, ω_{l} = v_{l} \prod_{m = 1}^{l - 1} (1 - v_{m}),

(θ_{h} ∣ z_{h}) \sim {1 - \mathds 1 (z_{h} \leq h)} \mbox I n v G a (a_{θ}, b_{θ}) + \mathds 1 (z_{h} \leq h) δ_{θ_{\infty}},

(θ_{h} ∣ z_{h}) \sim {1 - \mathds 1 (z_{h} \leq h)} \mbox I n v G a (a_{θ}, b_{θ}) + \mathds 1 (z_{h} \leq h) δ_{θ_{\infty}},

\displaystyle\mbox{pr}(z_{h}=l\mid-)\propto\left\{\begin{array}[]{ll}\omega_{l}N_{p}(\lambda_{h};0,\theta_{\infty}I_{p}),&\qquad\text{for \ $l=1,\dots,h,$}\\ \omega_{l}t_{2a_{\theta}}\{\lambda_{h};0,(b_{\theta}/a_{\theta})I_{p}\},&\qquad\text{for \ $l=h+1,\dots,H,$}\\ \end{array}\right.

\displaystyle\mbox{pr}(z_{h}=l\mid-)\propto\left\{\begin{array}[]{ll}\omega_{l}N_{p}(\lambda_{h};0,\theta_{\infty}I_{p}),&\qquad\text{for \ $l=1,\dots,h,$}\\ \omega_{l}t_{2a_{\theta}}\{\lambda_{h};0,(b_{\theta}/a_{\theta})I_{p}\},&\qquad\text{for \ $l=h+1,\dots,H,$}\\ \end{array}\right.

A \in B (ℜ) sup ∣ P_{0} (A) - P_{h} (A) ∣ = A \in B (ℜ) sup ∣ P_{0} (A) - (1 - π_{h}) P_{0} (A) - π_{h} δ_{θ_{\infty}} (A) ∣ = π_{h} A \in B (ℜ) sup ∣ P_{0} (A) - δ_{θ_{\infty}} (A) ∣.

A \in B (ℜ) sup ∣ P_{0} (A) - P_{h} (A) ∣ = A \in B (ℜ) sup ∣ P_{0} (A) - (1 - π_{h}) P_{0} (A) - π_{h} δ_{θ_{\infty}} (A) ∣ = π_{h} A \in B (ℜ) sup ∣ P_{0} (A) - δ_{θ_{\infty}} (A) ∣.

E [P_{h} {\overset{ˉ}{B}_{ε} (θ_{\infty})}] = E [(1 - π_{h}) P_{0} {\overset{ˉ}{B}_{ε} (θ_{\infty})} + π_{h} δ_{θ_{\infty}} {\overset{ˉ}{B}_{ε} (θ_{\infty})}] = P_{0} {\overset{ˉ}{B}_{ε} (θ_{\infty})} {1 - E (π_{h})} .

E [P_{h} {\overset{ˉ}{B}_{ε} (θ_{\infty})}] = E [(1 - π_{h}) P_{0} {\overset{ˉ}{B}_{ε} (θ_{\infty})} + π_{h} δ_{θ_{\infty}} {\overset{ˉ}{B}_{ε} (θ_{\infty})}] = P_{0} {\overset{ˉ}{B}_{ε} (θ_{\infty})} {1 - E (π_{h})} .

\mbox p r {\cup_{h > H} (∣ θ_{h} ∣ > ε)} \leq \sum_{h = H + 1}^{\infty} \mbox p r (∣ θ_{h} ∣ > ε) = P_{0} {\overset{ˉ}{B}_{ε} (0)} \sum_{h = H + 1}^{\infty} {α (1 + α)^{- 1}}^{h} .

\mbox p r {\cup_{h > H} (∣ θ_{h} ∣ > ε)} \leq \sum_{h = H + 1}^{\infty} \mbox p r (∣ θ_{h} ∣ > ε) = P_{0} {\overset{ˉ}{B}_{ε} (0)} \sum_{h = H + 1}^{\infty} {α (1 + α)^{- 1}}^{h} .

∣ λ_{r \cdot} λ_{j \cdot}^{T} ∣ \leq ∥ λ_{r \cdot} ∥∥ λ_{j \cdot} ∥ \leq 1 \leq j \leq p max ∥ λ_{j \cdot} ∥^{2} .

∣ λ_{r \cdot} λ_{j \cdot}^{T} ∣ \leq ∥ λ_{r \cdot} ∥∥ λ_{j \cdot} ∥ \leq 1 \leq j \leq p max ∥ λ_{j \cdot} ∥^{2} .

E (∥ λ_{j \cdot} ∥^{2}) = \sum_{h = 1}^{H} E (λ_{j h}^{2}) = \sum_{h = 1}^{H} E {E (λ_{j h}^{2} ∣ θ_{h})} = \sum_{h = 1}^{H} E (θ_{h}),

E (∥ λ_{j \cdot} ∥^{2}) = \sum_{h = 1}^{H} E (λ_{j h}^{2}) = \sum_{h = 1}^{H} E {E (λ_{j h}^{2} ∣ θ_{h})} = \sum_{h = 1}^{H} E (θ_{h}),

\mbox p r {(λ_{j h} - λ_{0 j h})^{2} < ϵ_{1}^{2} / (p H), \mbox f or a l l j = 1, \dots, p; h = 1, \dots, H}

\mbox p r {(λ_{j h} - λ_{0 j h})^{2} < ϵ_{1}^{2} / (p H), \mbox f or a l l j = 1, \dots, p; h = 1, \dots, H}

= E [j = 1 \prod p h = 1 \prod H \mbox p r {(λ_{j h} - λ_{0 j h})^{2} < ϵ_{1}^{2} / (p H) ∣ θ}] > 0.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

siriolegramanti/CUSP
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Bayesian cumulative shrinkage for infinite factorizations

Sirio Legramanti111Department of Decision Sciences, Bocconi University, 20136 Milan, Italy, [email protected], [email protected]

Daniele Durante††footnotemark:

David B. Dunson222Department of Statistical Science, Duke University, Durham, NC 27708, U.S.A., [email protected]

Abstract

There is a wide variety of models in which the dimension of the parameter space is unknown. For example, in factor analysis the number of latent factors is typically not known and has to be inferred from the observed data. Although classical shrinkage priors are useful in these contexts, increasing shrinkage priors can provide a more effective option, which progressively penalizes expansions with growing complexity. In this article we propose a novel increasing shrinkage prior, named the cumulative shrinkage process, for the parameters controlling the dimension in over-complete formulations. Our construction has broad applicability, simple interpretation, and is based on a sequence of spike and slab distributions which assign increasing mass to the spike as model complexity grows. Using factor analysis as an illustrative example, we show that this formulation has theoretical and practical advantages over current competitors, including an improved ability to recover the model dimension. An adaptive Markov chain Monte Carlo algorithm is proposed, and the methods are evaluated in simulation studies and applied to personality traits data. Code is available at https://github.com/siriolegramanti/CUSP.

Some key words: Factor analysis; Increasing shrinkage; Multiplicative gamma process; Spike and slab; Stick-breaking

1 Introduction

There has been a considerable interest in shrinkage priors for high dimensional parameters (e.g., Ishwaran and Rao,, 2005; Carvalho et al.,, 2010) but most of the focus has been on regression, where there is no natural ordering in the coefficients. There are several settings, however, where an order is present and desirable. Indeed, in statistical models relying on low-rank factorizations or basis expansions, such as factor models and tensor factorizations, it is natural to expect that additional dimensions play a progressively less important role in characterizing the data or model structure, and hence the associated parameters should have a stochastically decreasing effect. Such a behavior can be induced through increasing shrinkage priors. For instance, in the context of Bayesian factor models an example of this approach can be found in the multiplicative gamma process developed by Bhattacharya and Dunson, (2011) to penalize the effect of additional factor loadings via a cumulative product of gamma priors for their precision. Although this prior has been widely applied, there are practical disadvantages that motivate consideration of alternative solutions (Durante,, 2017). In general, despite the importance of increasing shrinkage priors in many factorization models, the methods, theory and computational strategies for these priors remain under-developed.

Motivated by the above considerations, we propose a novel increasing shrinkage prior, the cumulative shrinkage process, which is broadly applicable, while having simple and parsimonious structure. The proposed prior induces increasing shrinkage via a sequence of spike and slab distributions assigning growing mass to the spike as model complexity grows. In Definition 1, we present this prior for the general case in which the effect of the $h$ th dimension is controlled by a scalar parameter $\theta_{h}\in\Re$ , so that redundant terms can be essentially deleted by progressively shrinking the sequence $\theta=\{\theta_{h}\in\Theta\subseteq\Re:h=1,2,\ldots\}$ towards an appropriate value $\theta_{\infty}\in\Re$ . For example, in factor models $\theta_{h}\in\Re_{+}$ may denote the variance of the loadings for the $h$ th factor, and the goal is to define a prior on these terms which favors stochastically decreasing impact of the factors via increasing concentration of the loadings near zero as $h$ grows.

Definition 1

Let $\theta=\{\theta_{h}\in\Theta\subseteq\Re:h=1,2,\ldots\}$ denote a countable sequence of parameters. We say that $\theta$ is distributed according to a cumulative shrinkage process with parameter $\alpha>0$ , starting slab distribution $P_{0}$ and target value $\theta_{\infty}$ if, conditionally on $\pi=\{\pi_{h}\in(0,1):h=1,2,\ldots\}$ , each $\theta_{h}$ is independent and has the following spike and slab distribution:

[TABLE]

where $v_{1},v_{2},\ldots$ are independent $\mbox{Beta}(1,\alpha)$ variables and $P_{0}$ is a diffuse continuous distribution.

Equation (1) exploits the stick-breaking construction of the Dirichlet process (Ishwaran and James,, 2001). This implies that the probability $\pi_{h}$ assigned to the spike $\delta_{\theta_{\infty}}$ increases with the model dimension $h$ , and that $\lim_{h\to\infty}\pi_{h}=1$ almost surely. Hence, as complexity grows, $P_{h}$ increasingly concentrates around $\theta_{\infty}$ , which is specified to facilitate the deletion of redundant terms, while the slab $P_{0}$ corresponds to the prior on the active parameters. Definition 1 can be extended to sequences in $\Re^{p}$ , and $\delta_{\theta_{\infty}}$ can be replaced with a continuous distribution, without affecting the key properties of the prior, which are presented in § 2. As we will discuss in § 2 and in § 3.1, it is also possible to restrict Definition 1 to finitely many terms $(\theta_{1},\ldots,\theta_{H})$ by letting $v_{H}=1$ . In practical implementations, this truncated version typically ensures full flexibility if $H$ is set to a conservative upper bound, but this value can be extremely large in several high dimensional settings, thus motivating our initial focus on the infinite expansion and its theoretical properties.

2 General properties of the cumulative shrinkage process

We first motivate our cumulative stick-breaking construction for the sequence $\pi$ that controls the mass assigned to the spike in (1) as a function of model dimension. Indeed, one could alternatively consider pre-specified non-decreasing functions bounded between [math] and $1$ . However, we have found that such specifications are overly-restrictive and have worse practical performance. The specification in (1) is purposely chosen to be effectively nonparametric, with Proposition 1 showing that the prior has large support on the space of non-decreasing sequences taking values in $(0,1)$ . See the Appendix for proofs.

Proposition 1

Let $\Pi$ be the probability measure induced on $\pi=\{\pi_{h}\in(0,1):h=1,2,\ldots\}$ by (1), then $\Pi$ has large support on the whole space of non-decreasing sequences taking values in $(0,1)$ .

Besides being fully flexible, our construction for $\pi$ also has simple interpretation and allows control over shrinkage via an interpretable parameter $\alpha$ , as stated in Proposition 2 and in the subsequent results.

Proposition 2

Each $\pi_{h}$ in (1) coincides with the proportion of the total variation distance between the slab and the spike covered up to step $h$ , in the sense that $\pi_{h}=d_{\textsc{tv}}(P_{0},P_{h})/d_{\textsc{tv}}(P_{0},\delta_{\theta_{\infty}})$ for every $h$ .

Using similar arguments, we can obtain analogous expressions for $\omega_{h}$ and $v_{h}$ , which represent the proportions of the total $d_{\textsc{tv}}(P_{0},\delta_{\theta_{\infty}})$ and the remaining $d_{\textsc{tv}}(P_{h-1},\delta_{\theta_{\infty}})$ , respectively, covered between steps $h-1$ and $h$ . Specifically, $\omega_{h}=d_{\textsc{tv}}(P_{h-1},P_{h})/d_{\textsc{tv}}(P_{0},\delta_{\theta_{\infty}})$ and $v_{h}=d_{\textsc{tv}}(P_{h-1}{,}P_{h})/d_{\textsc{tv}}(P_{h-1},\delta_{\theta_{\infty}})$ for every $h$ . The expectations of these quantities are explicitly available as

[TABLE]

Moreover, combining (2) with Definition 1, the expectation of $\theta_{h}\ (h=1,2,\ldots)$ is

[TABLE]

where $\theta_{0}$ defines the expected value under the slab $P_{0}$ . Hence, as $h$ grows, the prior expectation of $\theta_{h}$ converges exponentially towards the spike location $\theta_{\infty}$ . As stated in Lemma 1, a stronger notion of cumulative shrinkage in distribution, beyond simple concentration in expectation, also holds under (1).

Lemma 1

Let ${\mathbb{B}}_{\varepsilon}(\theta_{\infty})=\{\theta_{h}\in\Theta\subseteq\Re:|\theta_{h}-\theta_{\infty}|\leq\varepsilon\}$ denote an $\varepsilon$ -neighborhood around $\theta_{\infty}$ with radius $\varepsilon>0$ , and define with $\bar{\mathbb{B}}_{\varepsilon}(\theta_{\infty})$ the complement of ${\mathbb{B}}_{\varepsilon}(\theta_{\infty})$ . Then, for any $h=1,2,\ldots$ and $\varepsilon>0$ ,

[TABLE]

Therefore, $\mbox{pr}(|\theta_{h+1}{-}\ \theta_{\infty}|\leq\varepsilon)>\mbox{pr}(|\theta_{h}{-}\ \theta_{\infty}|\leq\varepsilon)$ for any $\alpha>0$ , $h=1,2,\ldots$ and $\varepsilon>0$ .

Equations (2)–(4) highlight how the rate of increasing shrinkage is controlled by $\alpha$ . In particular, lower values of $\alpha$ induce faster concentration around $\theta_{\infty}$ and hence more rapid shrinkage of the redundant terms. This control over the rate of increasing shrinkage via $\alpha$ is separated from the specification of the slab $P_{0}$ , thereby allowing flexible modelling of the active terms. As discussed in Durante, (2017), such a separation does not hold, for example, in the multiplicative gamma process (Bhattacharya and Dunson,, 2011) whose hyper-parameters control both the rate of shrinkage and the prior for the active factors. This creates a trade-off between the need to maintain diffuse priors for the active terms and the attempt to shrink the redundant ones. Moreover, increasing shrinkage holds only in expectation and for specific hyper-parameters.

Instead, our prior ensures increasing shrinkage in distribution for any $\alpha$ , and can model any prior expectation on the number of active terms. In fact, $\alpha$ is equal to the prior mean of the number of terms in $\theta$ modelled via the slab $P_{0}$ . This result follows after noticing that $(\theta_{h}\mid\pi_{h})$ in (1) can be alternatively obtained by marginalizing out the augmented indicator $c_{h}\sim\mbox{Bern}(1-\pi_{h})$ in $(\theta_{h}\mid c_{h})\sim c_{h}P_{0}+(1-c_{h})\delta_{\theta_{\infty}}$ . According to this result, $H^{*}=\sum\nolimits_{h=1}^{\infty}c_{h}$ counts the number of active elements in $\theta$ , and its prior mean is

[TABLE]

Hence, $\alpha$ should be set to the expected number of active terms, while $P_{0}$ should be sufficiently diffuse to model active components, and $\theta_{\infty}$ should be chosen to facilitate the deletion of redundant ones.

Recalling Bhattacharya and Dunson, (2011) and Rousseau and Mengersen, (2011), it is useful to define models with more than enough components and then choose shrinkage priors which favor effective deletion of the unnecessary ones. This choice protects against over-fitting and allows estimation of model dimension, bypassing the need for reversible jump (Lopes and West,, 2004) or other computationally intensive strategies. Our cumulative shrinkage process in (1) provides a useful prior for this purpose. As discussed in § 1, it is straightforward to modify Definition 1 to instead restrict to $H$ components, by letting $v_{H}=1$ , with $H$ a conservative upper bound. Theorem 1 provides theoretical support for such a truncated representation.

Theorem 1

If $\theta$ has prior (1) and $\theta^{(H)}$ denotes the sequence obtained by fixing $\theta_{h}=0$ in $\theta$ for every $h>H$ , then for any truncation index $H$ and $\varepsilon\geq|\theta_{\infty}|$ ,

[TABLE]

where $d_{\infty}$ is the sup-norm distance and $\bar{\mathbb{B}}_{\varepsilon}(0)$ is the complement of ${\mathbb{B}}_{\varepsilon}(0)=\{\theta_{h}\in\Theta\subseteq\Re:|\theta_{h}|\leq\varepsilon\}$ .

Hence, the prior probability of $\theta^{(H)}$ being close to $\theta$ converges to one at a rate which is exponential in $H$ , thus justifying posterior inference under finite sequences based on a conservative $H$ . Although the above bound holds for $\varepsilon\geq|\theta_{\infty}|$ , in general $\theta_{\infty}$ is set close to zero. Hence, Theorem 1 is valid also for small $\varepsilon$ .

3 Cumulative shrinkage process for Gaussian factor models

3.1 Model formulation and prior specification

Definition 1 provides a general prior which can be used in different models (e.g., Gopalan et al.,, 2014) under appropriate choices of $P_{0}$ and $\theta_{\infty}$ . Here, we focus on Gaussian sparse factor models as an important special case to illustrate our approach. We will compare primarily to the multiplicative gamma process, which has been devised specifically for this class of models and was shown to have practical gains in this context relative to several competitors, including the use of lasso (Tibshirani,, 1996), elastic-net (Zou and Hastie,, 2005) and banding approaches (Bickel and Levina,, 2008). Although there are other priors for sparse factor models (e.g., Carvalho et al.,, 2008; Knowles and Ghahramani,, 2011), these choices have practical disadvantages relative to the multiplicative gamma process, so they will not be considered further here.

The focus will be on performance in learning the structure of the $p\times p$ covariance matrix $\Omega=\Lambda\Lambda^{{\mathrm{\scriptscriptstyle T}}}+\Sigma$ for the data $y_{i}=(y_{i1},\ldots,y_{ip})^{{\mathrm{\scriptscriptstyle T}}}\in\Re^{p}$ generated from the Gaussian factor model ${y_{i}=\Lambda\eta_{i}+\epsilon_{i}}$ , with $\eta_{ih}\sim N(0,1)$ , $(i=1,\ldots,n;h=1,2,\ldots)$ , $\epsilon_{i}\sim N_{p}(0,\Sigma)$ $(i=1,\ldots,n)$ and ${\Sigma=\mbox{diag}(\sigma_{1}^{2},\ldots,\sigma_{p}^{2})}$ . To perform Bayesian inference for this model, Bhattacharya and Dunson, (2011) assumed $\sigma^{2}_{j}\sim\mbox{InvGa}(a_{\sigma},b_{\sigma})$ $(j=1,\ldots,p)$ , and $(\lambda_{jh}\mid\phi_{jh},\theta_{h})\sim N(0,\phi_{jh}\theta_{h})$ $(j=1,\ldots,p;h=1,2,\ldots)$ with scales $\phi_{jh}$ from independent $\mbox{InvGa}(\nu/2,\nu/2)$ priors and global precisions $\theta_{h}^{-1}$ having multiplicative gamma process prior

[TABLE]

Specific choices of $(a_{1},a_{2})$ in (5) ensure that $E(\theta_{h})$ decreases with $h$ , thus allowing increasing shrinkage of the loadings as $h$ grows. Instead, we keep $\sigma^{2}_{j}\sim\mbox{InvGa}(a_{\sigma},b_{\sigma})$ $(j=1,\ldots,p)$ , but let $(\lambda_{jh}\mid\theta_{h})\sim N(0,\theta_{h})$ $(j=1,\ldots,p;\ h=1,2,\ldots)$ and place our cumulative shrinkage process prior on $\theta_{h}$ by assuming

[TABLE]

where $v_{1},v_{2},\ldots$ are independent $\mbox{Beta}(1,\alpha)$ . Integrating out $\theta_{h}$ , each loading $\lambda_{jh}$ has the marginal prior $(1-\pi_{h})t_{2a_{\theta}}(0,b_{\theta}/a_{\theta})+\pi_{h}N(0,\theta_{\infty})$ , where $t_{2a_{\theta}}(0,b_{\theta}/a_{\theta})$ denotes the Student- $t$ distribution with $2a_{\theta}$ degrees of freedom, location [math] and scale $b_{\theta}/a_{\theta}$ . Hence, $\theta_{\infty}$ should be set close to zero to allow effective shrinkage of redundant factors, while $(a_{\theta},b_{\theta})$ should be specified so as to induce a moderately diffuse prior with scale $b_{\theta}/a_{\theta}$ for the active loadings. Although the choice $\theta_{\infty}=0$ is possible, we follow Ishwaran and Rao, (2005) by suggesting $\theta_{\infty}>0$ to induce a continuous shrinkage prior on every $\lambda_{jh}$ which improves mixing and identification of the inactive factors. Exploiting the marginals for $\lambda_{jh}$ , it also follows that, if $b_{\theta}/a_{\theta}>\theta_{\infty}$ then $\mbox{pr}(|\lambda_{j,h+1}|\leq\varepsilon)>\mbox{pr}(|\lambda_{jh}|\leq\varepsilon)$ for each $j=1,\ldots,p$ , $h=1,2,\ldots$ and $\varepsilon>0$ . This allows cumulative shrinkage in distribution also for the loadings, and provides guidelines on $(a_{\theta},b_{\theta})$ and $\theta_{\infty}$ . Additional discussion on prior elicitation and empirical studies on sensitivity can be found in § 4.

To implement the analysis, we require a truncation $H$ on the number of factors needed to characterize $\Omega$ , as discussed in § 2. Theorem 2 states that our shrinkage process truncated at $H$ terms induces a well-defined prior for $\Omega$ with full-support, under the sufficient conditions that $H$ is greater than the true $H_{0}$ , and $E(\theta_{h})<\infty$ . These conditions are met when considering up to $p$ active factors, with $a_{\theta}>1$ and $\theta_{\infty}<\infty$ .

Theorem 2

Let $\Omega_{0}$ be any $p\times p$ covariance matrix and define with $\Pi$ the prior probability measure on $p\times p$ covariance matrices $\Omega$ induced by a Bayesian factor model having prior (6) on $\theta$ , truncated at $H$ with $v_{H}=1$ . If $E(\theta_{h})<\infty$ , then ${\Pi\{\Omega\in\Re^{p\times p}:\Omega\mbox{ has finite entries and is positive semi-definite}\}=1}$ . In addition, if there exists a decomposition $\Omega_{0}=\Lambda_{0}\Lambda_{0}^{{\mathrm{\scriptscriptstyle T}}}+\Sigma_{0}$ , such that $\Lambda_{0}\in\Re^{p\times H_{0}}$ and $H_{0}<H$ , then $\Pi\{B_{\varepsilon}^{\infty}(\Omega_{0})\}>0$ for any $\varepsilon>0$ , where $B_{\varepsilon}^{\infty}(\Omega_{0})$ is an $\varepsilon$ -neighborhood of $\Omega_{0}$ under the sup-norm.

Recalling Theorem 2 in Bhattacharya and Dunson, (2011), this result is also sufficient to ensure that the posterior of $\Omega$ is weakly consistent (Schwartz,, 1965).

3.2 Posterior computation via Gibbs sampling

Posterior inference for the factor model in § 3.1 with cumulative shrinkage process (6) truncated at $H$ terms for the loadings, proceeds via a Gibbs sampler cycling across the steps in Algorithm 1. This sampler relies on a data augmentation which exploits the fact that prior (6) can be obtained by marginalizing out the independent indicators $z_{h}\ (h=1,\ldots,H)$ with probabilities $\mbox{pr}(z_{h}=l\mid\omega_{l})=\omega_{l}$ $(l=1,\ldots,H)$ in

[TABLE]

where $\mathds{1}(z_{h}\leq h)=1$ if $z_{h}\leq h$ and [math] otherwise. As is clear from Algorithm 1, conditioned on $z_{1},\ldots,z_{H}$ , it is possible to sample from conjugate full-conditionals, whereas the updating of the augmented data relies on the full-conditional distribution

[TABLE]

where $N_{p}(\lambda_{h};0,\theta_{\infty}I_{p})$ and $t_{2a_{\theta}}\{\lambda_{h};0,(b_{\theta}/a_{\theta})I_{p}\}$ are the densities of $p$ -variate Gaussian and Student- $t$ distributions, respectively, evaluated at $\lambda_{h}=(\lambda_{1h},\ldots,\lambda_{ph})^{{\mathrm{\scriptscriptstyle T}}}$ . Equations (10) are obtained by marginalizing out $\theta_{h}$ , distributed as in (7), from the joint $N_{p}(\lambda_{h};0,\theta_{h}I_{p})$ . These calculations are straightforward in a variety of Bayesian models based on conditionally conjugate constructions, thus making (1) a general prior which can be easily incorporated, for instance, in Poisson factorizations (Gopalan et al.,, 2014).

3.3 Tuning the truncation index via adaptive Gibbs sampling

Recalling § 3.1, it is reasonable to perform Bayesian inference with at most $p$ factors. Under our cumulative shrinkage process truncated at $H$ terms this translates into $H\leq p+1$ , since there are at most $H-1$ active factors, with the $H$ th one modelled with the spike by construction. However, this choice is too conservative, since we expect substantially fewer active factors than $p$ , especially when $p$ is very large. Hence, running Algorithm 1 with $H=p+1$ would be computationally inefficient, since most of the columns in $\Lambda$ would be modelled by the spike, thus providing a negligible contribution to the factorization of $\Omega$ .

Bhattacharya and Dunson, (2011) addressed this issue via an adaptive Gibbs sampler which tunes $H$ as the sampler proceeds. To satisfy the diminishing adaptation condition in Roberts and Rosenthal, (2007), they adapt $H$ at the iteration $t$ with probability $p(t)=\exp(\alpha_{0}+\alpha_{1}t)$ , where $\alpha_{0}\leq 0$ and $\alpha_{1}<0$ . This adaptation consists in dropping the inactive columns of $\Lambda$ , if any, together with the corresponding parameters. If instead all columns are active, an extra factor is added, sampling the associated parameters from the prior.

This idea can be also implemented for the cumulative shrinkage process, as illustrated in Algorithm 2. Under our prior, the inactive $\Lambda$ columns are naturally identified as those modelled by the spike and, hence, have index $h$ such that $z_{h}\leq h$ . Under the multiplicative gamma process, instead, a column is flagged as inactive if all its entries are within distance $\epsilon$ from zero. This $\epsilon$ plays a similar role as our spike location $\theta_{\infty}$ . Indeed, lower values of $\epsilon$ and $\theta_{\infty}$ make it harder to discard inactive columns, thus affecting running time. Hence, although fixing $\theta_{\infty}$ close to zero is a key to enforce shrinkage, excessively low values should be avoided. Since under a truncated cumulative shrinkage process the number of active factors $H^{*}$ is at most $H-1$ , we increase $H$ by one when $H^{*}=H-1$ , and we decrease $H$ to $H^{*}+1$ when $H^{*}<H-1$ .

In our implementation no adaptation is allowed before a fixed number $\bar{t}$ of iterations to let the chain stabilize, while $H$ and $H^{*}$ are initialized to $p+1$ and $p$ , which is the maximum possible rank for $\Omega$ . Further guidance for the choice of $H$ can be obtained by monitoring how close $E(\pi_{H})$ is to 1, via (2).

4 Performance assessments of Gaussian factor models in simulations

We consider illustrative simulations to assess performance in learning the structure of the true covariance matrix $\Omega_{0}=\Lambda_{0}\Lambda^{{\mathrm{\scriptscriptstyle T}}}_{0}+\Sigma_{0}$ for the data $y=(y_{1},\ldots,y_{n})$ from a Gaussian factor model, with $\Sigma_{0}=I_{p}$ and the entries in $\Lambda_{0}\in\Re^{p\times H_{0}}$ drawn from independent $N(0,1)$ . To study performance at varying dimensions, we consider three different combinations of $(p,H_{0})$ : $(20,5)$ , $(50,10)$ and $(100,15)$ . For every pair $(p,H_{0})$ , we sample 25 datasets of $n=100$ observations from $N_{p}(0,\Omega_{0})$ and, for each of the 25 replicates, we perform posterior inference on $\Omega$ via the Gaussian factor model in § 3.1 under both prior (5) and (6), exploiting the adaptive Gibbs sampler in Bhattacharya and Dunson, (2011) and Algorithm 2, respectively.

For our cumulative shrinkage process, we set $\alpha=5$ , $a_{\theta}=b_{\theta}=2$ and $\theta_{\infty}=0.05$ , whereas for the multiplicative gamma process, we follow Durante, (2017) by considering $(a_{1},a_{2})=(1,2)$ , and set $\nu=3$ as done by Bhattacharya and Dunson, (2011) in their simulations. For both models, $(a_{\sigma},b_{\sigma})$ are fixed at $(1,0.3)$ as in Bhattacharya and Dunson, (2011). The truncation $H$ is initialized at $p$ for the multiplicative gamma process and at $p+1$ for the cumulative shrinkage process, both corresponding to at most $p$ active factors. For the two methods, adaptation is allowed only after $500$ iterations and, following Bhattacharya and Dunson, (2011), the parameters $(\alpha_{0},\alpha_{1})$ are set to $(-1,-5\times 10^{-4})$ , while the adaptation threshold $\epsilon$ in the multiplicative gamma process is $10^{-4}$ . Both algorithms are run for $10000$ iterations after a burn-in of $5000$ and, by thinning every 5, we obtain a final sample of $2000$ draws from the posterior of $\Omega$ . For each of the $25$ simulations in every scenario, we compute a Monte Carlo estimate of $\sum_{j=1}^{p}\sum_{q=j}^{p}E\{(\Omega_{jq}-\Omega_{0jq})^{2}\mid y\}/\{p(p+1)/2\}$ and $E(H^{*}\mid y)$ . Since $E\{(\Omega_{jq}-\Omega_{0jq})^{2}\mid y\}=\{E(\Omega_{jq}\mid y)-\Omega_{0jq}\}^{2}+\mbox{var}(\Omega_{jq}\mid y)$ , the posterior averaged mean square error accounts for both bias and variance in the posterior of $\Omega$ .

Table 1 shows, for each scenario and model, the median and the interquartile range of the above quantities computed from the $25$ measures produced by the different simulations, together with the medians of the averaged effective sample sizes, out of $2000$ samples, and of the running times. Such quantities rely on an R implementation run on an Intel Core i7-3632QM CPU laptop with $7.7$ GB of RAM. The two methods have comparable mean square errors, but these measures and the performance gains of prior (6) over (5) increase with $H_{0}$ . Our approach also provides some improvements in mixing and reduced running times. The latter is arguably due to the fact that the multiplicative gamma process overestimates $H^{*}$ , hence keeping more parameters to update than necessary. Instead, our cumulative shrinkage process recovers the true dimension $H_{0}$ in all settings, thus efficiently tuning the truncation level $H$ . Such an improved learning of the true underlying dimension is confirmed by the $95\%$ credible intervals highly concentrated around $H_{0}$ in all the scenarios considered. The multiplicative gamma process leads instead to wider credible intervals for $H^{*}$ , with none of them including $H_{0}$ . As shown in Table 2, results are robust to moderate and reasonable changes in the hyper-parameters of the cumulative shrinkage process. We also tried to modify $\epsilon$ in Bhattacharya and Dunson, (2011) so as to delete $\Lambda$ columns with values on the same scale of our spike. This setting provided lower estimates for $H^{*}$ and, hence, a computational time more similar to our cumulative shrinkage process, but led to worse mean square errors and still some difficulties in learning $H_{0}$ .

5 Application of Gaussian factor models to personality data

We conclude with an application to a subset of the personality data available in the dataset bfi from the R package psych. Here, we focus on the association structure among $p=25$ personality self-report items collected on a 6 point response scale for $n=126$ individuals older than $50$ years. These variables represent answers to questions organized into five personality traits known as agreeableness, conscientiousness, extraversion, neuroticism, and openness. Recalling common implementations of factor models, we center the 25 items, and then replace variables $1,9,10,11,12,22$ and $25$ with their negative version as suggested in the R documentation of the bfi dataset to have coherent answers within each personality trait. Posterior inference under priors (5)–(6) is performed with the same hyper-parameters and Gibbs settings as in § 4.

Figure 1 shows posterior means and credible intervals for the absolute value of the entries in the correlation matrix $\bar{\Omega}$ , under our model. Samples from $\bar{\Omega}$ are obtained computing $\bar{\Omega}=(\Omega\odot I_{p})^{-\frac{1}{2}}\Omega(\Omega\odot I_{p})^{-\frac{1}{2}}$ for every sample of $\Omega=\Lambda\Lambda^{{\mathrm{\scriptscriptstyle T}}}+\Sigma$ , with $\odot$ denoting the element-wise Hadamard product. Figure 1 highlights associations within each block of five answers measuring a main personality trait, while showing also interesting across-blocks correlations among agreeableness and extraversion as well as conscientiousness and neuroticism. Openness has less evident within-block and across-block associations. These results suggest three main factors as confirmed by the posterior mean and by the $95\%$ credible intervals for $H^{*}$ under the cumulative shrinkage process, which are $2.84$ and $(2,3)$ , respectively. Such posterior summaries are $24.01$ and $(18,25)$ under the multiplicative gamma process, but the higher $H^{*}$ does not lead to improved learning of $\bar{\Omega}$ . In fact, when considering the Monte Carlo estimate of the mean squared deviations $\sum_{j=1}^{p}\sum_{q=j}^{p}E(\bar{\Omega}_{jq}-S_{jq})^{2}/\{p(p+1)/2\}$ from the sample correlation matrix $S$ , we obtain $0.01$ under both (6) and (5), suggesting that the multiplicative gamma process might overestimate $H^{*}$ in this application. This leads to more redundant parameters to be updated in the adaptive Gibbs sampler, thus increasing the computational time from $400.69$ to $1321.04$ seconds. Our approach also increases the averaged effective sample size from $901.68$ to $1070.83$ .

Acknowledgement

The authors are grateful to the Editor, the Associate Editor and the referees for the useful suggestions, and acknowledge the support from miur (prin 2017 grant) as well as the United States Office of Naval Research and National Institutes of Health in the preparation of the final version of this article.

Appendix

Proof of Proposition 1. Since the mapping from the sequence $w=\{w_{h}\in(0,1):h=1,2,\ldots\}$ to $\pi=\{\pi_{h}\in(0,1):h=1,2,\ldots\}$ is one-to-one, it is sufficient to ensure that the stick-breaking prior for $w$ has full support on the infinite dimensional simplex. This result is proved by Bissiri and Ongaro, (2014) in § 3.2. $\Box$

Proof of Proposition 2. The proof of Proposition 2 adapts the one of Theorem 1 in Canale et al., (2018). In fact, under the prior in Definition 1, the distance $d_{\textsc{tv}}(P_{0},P_{h})$ on the Borel $\sigma$ -algebra in $\Re$ is equal to

[TABLE]

Hence $d_{\textsc{tv}}(P_{0},P_{h})=\pi_{h}d_{\textsc{tv}}(P_{0},\delta_{\theta_{\infty}})$ , completing the proof. $\Box$

Proof of Lemma 1. Notice that, for each $h$ , $\mbox{pr}(|\theta_{h}-\theta_{\infty}|>\varepsilon)$ can be equivalently expressed as

[TABLE]

Therefore, replacing $E(\pi_{h})$ with its expression in equation (2) leads to (4). To prove that $\mbox{pr}(|\theta_{h+1}{-}\theta_{\infty}|\leq\varepsilon)>\mbox{pr}(|\theta_{h}{-}\theta_{\infty}|\leq\varepsilon)$ it is sufficient to note that $\{\alpha(1+\alpha)^{-1}\}^{h+1}<\{\alpha(1+\alpha)^{-1}\}^{h}$ . $\Box$

Proof of Theorem 1. The proof follows after noting that $\mbox{pr}(\sup_{h>H}|\theta_{h}|>\varepsilon)=\mbox{pr}\{\cup_{h>H}(|\theta_{h}|>\varepsilon)\}$ , and that $\delta_{\theta_{\infty}}\{\bar{\mathbb{B}}_{\varepsilon}(0)\}=0$ for any $\varepsilon\geq|\theta_{\infty}|$ . Hence, adapting the proof of Lemma 1, we obtain

[TABLE]

To conclude the proof, notice that $\sum_{h=H+1}^{\infty}\{\alpha(1+\alpha)^{-1}\}^{h}=\alpha\{\alpha(1+\alpha)^{-1}\}^{H}$ . $\Box$

Proof of Theorem 2. Let us first prove that for the Gaussian factor model in § 3 $\cdot$ 1, with prior (6) truncated at $H$ terms, we have ${\Pi\{\Omega\in\Re^{p\times p}:\Omega\mbox{ has finite entries and is positive semi-definite}\}=1}$ . Since, by construction, $\Sigma$ is diagonal with almost surely finite and non-negative entries, and $\Lambda\Lambda^{{\mathrm{\scriptscriptstyle T}}}$ is trivially positive semi-definite, we only need to ensure that each entry $\lambda_{r\cdot}\lambda_{j\cdot}^{{\mathrm{\scriptscriptstyle T}}}$ in $\Lambda\Lambda^{{\mathrm{\scriptscriptstyle T}}}$ is almost surely finite. By the Cauchy-Schwartz inequality we obtain

[TABLE]

Under the factor model in § $3{\cdot}1$ having prior (6) truncated at $H$ terms, we have that

[TABLE]

for every $j=1,\ldots,p$ , including the index of the maximum, thus ensuring that each entry in $\Lambda\Lambda^{{\mathrm{\scriptscriptstyle T}}}$ is almost surely finite under the sufficient condition that $E(\theta_{h})<\infty$ $(h=1,\ldots,H)$ . This holds when $a_{\theta}>1$ and $\theta_{\infty}<\infty$ .

Let us now prove the full support for $\Pi$ . Since $H>H_{0}$ , there always exists a $\Lambda\in\Re^{p\times H}$ and a positive diagonal matrix $\Sigma$ such that $\Lambda\Lambda^{\mathrm{\scriptscriptstyle T}}+\Sigma=\Lambda_{0}\Lambda_{0}^{\mathrm{\scriptscriptstyle T}}+\Sigma_{0}$ . For instance, one can let $\Sigma=\Sigma_{0}$ and $\Lambda=[\Lambda_{0},0_{p\times(H-H_{0})}]$ . Hence, it suffices to prove full support for the priors induced on $\Lambda$ and $\Sigma$ by the truncated version of our cumulative shrinkage process. Such a property easily holds for $\Sigma$ , whose diagonal elements $\sigma_{j}^{2}\ (j=1,\ldots,p)$ have independent inverse-gamma priors. Moreover, adapting the proof of Proposition 2 in Bhattacharya and Dunson, (2011), full support can be proved also for the prior induced on $\Lambda$ . Indeed, recalling § $3{\cdot}1$ , we have that $\mbox{pr}\{\sum_{j=1}^{p}\sum_{h=1}^{H}(\lambda_{jh}-\lambda_{0jh})^{2}<\epsilon_{1}^{2}\}\geq\mbox{pr}\{(\lambda_{jh}-\lambda_{0jh})^{2}<\epsilon_{1}^{2}/(pH),\mbox{ for all }j=1,\ldots,p;\ h=1,\ldots,H\}$ with

[TABLE]

In fact, conditioned on $\theta=(\theta_{1},\ldots,\theta_{H})$ , each $\lambda_{jh}$ has independent $N(0,\theta_{h})$ distribution. $\Box$

Bibliography17

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bhattacharya and Dunson, (2011) Bhattacharya, A. and Dunson, D. B. (2011). Sparse Bayesian infinite factor models. Biometrika , 98:291–306.
2Bickel and Levina, (2008) Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices. Ann. Statist. , 36(1):199–227.
3Bissiri and Ongaro, (2014) Bissiri, P. G. and Ongaro, A. (2014). On the topological support of species sampling priors. Electron. J. Statist. , 8(1):861–882.
4Canale et al., (2018) Canale, A., Durante, D., and Dunson, D. B. (2018). Convex mixture regression for quantitative risk assessment. Biometrics , 74:1331–1340.
5Carvalho et al., (2008) Carvalho, C. M., Chang, J., Lucas, J. E., Nevins, J. R., Wang, Q., and West, M. (2008). High-dimensional sparse factor modeling: applications in gene expression genomics. J. Am. Statist. Assoc. , 103:1438–1456.
6Carvalho et al., (2010) Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika , 97:465–480.
7Durante, (2017) Durante, D. (2017). A note on the multiplicative gamma process. Statist. Probabil. Lett. , 122:198–204.
8Gopalan et al., (2014) Gopalan, P., Ruiz, F. J., Ranganath, R., and Blei, D. (2014). Bayesian nonparametric Poisson factorization for recommendation systems. J. Mach. Learn. Res. W&CP , 33:275–283.