Bayesian Factor-adjusted Sparse Regression
Jianqing Fan, Bai Jiang, Qiang Sun

TL;DR
This paper introduces a Bayesian factor-adjusted sparse regression approach for high-dimensional data with correlated covariates, effectively modeling common factors and idiosyncratic components to improve variable selection and prediction.
Contribution
It develops a novel Bayesian method using spike-and-slab priors for factor-adjusted regression, addressing limitations of traditional sparsity assumptions in correlated settings.
Findings
Outperforms lasso in simulations
Insensitive to overestimation of factors
Scales well with data size and sparsity
Abstract
This paper investigates the high-dimensional linear regression with highly correlated covariates. In this setup, the traditional sparsity assumption on the regression coefficients often fails to hold, and consequently many model selection procedures do not work. To address this challenge, we model the variations of covariates by a factor structure. Specifically, strong correlations among covariates are explained by common factors and the remaining variations are interpreted as idiosyncratic components of each covariate. This leads to a factor-adjusted regression model with both common factors and idiosyncratic components as covariates. We generalize the traditional sparsity assumption accordingly and assume that all common factors but only a small number of idiosyncratic components contribute to the response. A Bayesian procedure with a spike-and-slab prior is then proposed for…
| Method | estimation ( error) | model selection rate | sure screening rate | average model size | estimation (relative error) |
|---|---|---|---|---|---|
| generic Bayes, | 1.536 | 0.0% | 100.0% | 15.92 | 5.940 |
| Factor-adjusted Bayes, | 0.124 | 88.2% | 100.0% | 5.13 | 2.057 |
| Factor-adjusted Bayes, | 0.124 | 87.0% | 100.0% | 5.14 | 2.058 |
| Factor-adjusted Bayes, | 0.130 | 86.4% | 100.0% | 5.14 | 2.057 |
| Factor-adjusted Bayes, | 0.133 | 86.1% | 100.0% | 5.14 | 2.069 |
| generic lasso, | 1.189 | 0% | 100% | 92.57 | 3.187 |
| Factor-adjusted lasso, | 0.460 | 27% | 100% | 21.20 | 1.899 |
| Factor-adjusted lasso, | 0.463 | 25% | 100% | 21.56 | 1.865 |
| Factor-adjusted lasso, | 0.467 | 27% | 100% | 26.07 | 1.787 |
| Factor-adjusted lasso, | 0.466 | 24% | 100% | 29.21 | 1.657 |
| Method | estimation ( error) | model selection rate | sure screening rate | average model size | estimation (relative error) |
|---|---|---|---|---|---|
| generic Bayes, | 0.092 | 90.5% | 100.0% | 5.10 | 0.881 |
| Factor-adjusted Bayes, | 0.095 | 87.9% | 100.0% | 5.13 | 0.906 |
| Factor-adjusted Bayes, | 0.097 | 88.4% | 100.0% | 5.12 | 0.914 |
| Factor-adjusted Bayes, | 0.098 | 88.5% | 100.0% | 5.12 | 0.932 |
| Factor-adjusted Bayes, | 0.102 | 88.9% | 100.0% | 5.12 | 0.968 |
| generic lasso, | 0.495 | 53% | 100% | 11.72 | 1.302 |
| Factor-adjusted lasso, | 0.498 | 61% | 100% | 11.98 | 1.248 |
| Factor-adjusted lasso, | 0.500 | 56% | 100% | 13.28 | 1.279 |
| Factor-adjusted lasso, | 0.481 | 56% | 100% | 12.48 | 1.141 |
| Factor-adjusted lasso, | 0.487 | 58% | 100% | 13.62 | 1.114 |
| Method | estimation ( error) | model selection rate | sure screening rate | average model size | estimation (relative error) |
|---|---|---|---|---|---|
| generic Bayes, | 0.070 | 91.5% | 100.0% | 5.09 | 0.913 |
| Factor-adjusted Bayes, | 0.090 | 91.1% | 100.0% | 5.09 | 1.690 |
| Factor-adjusted Bayes, | 0.091 | 90.7% | 100.0% | 5.10 | 1.715 |
| Factor-adjusted Bayes, | 0.093 | 90.4% | 100.0% | 5.10 | 1.733 |
| Factor-adjusted Bayes, | 0.095 | 89.8% | 100.0% | 5.11 | 1.763 |
| generic lasso, | 0.734 | 13% | 100% | 10.18 | 3.266 |
| Factor-adjusted lasso, | 0.454 | 53% | 100% | 15.17 | 1.125 |
| Factor-adjusted lasso, | 0.471 | 57% | 100% | 14.96 | 1.193 |
| Factor-adjusted lasso, | 0.465 | 48% | 100% | 16.65 | 1.139 |
| Factor-adjusted lasso, | 0.492 | 55% | 100% | 18.06 | 1.213 |
| Method | 2-yr bond | 3-yr bond | 4-yr bond | 5-yr bond |
| PCR | 0.646 | 0.603 | 0.568 | 0.540 |
| generic Bayes | 0.765 | 0.734 | 0.722 | 0.696 |
| factor-adjusted Bayes | 0.775 | 0.753 | 0.747 | 0.726 |
| generic lasso | 0.719 | 0.717 | 0.701 | 0.688 |
| factor-adjusted lasso | 0.766 | 0.764 | 0.746 | 0.719 |
| Method | 2-yr bond | 3-yr bond | 4-yr bond | 5-yr bond |
| generic Bayes | 12.97 | 12.97 | 13.13 | 13.05 |
| factor-adjusted Bayes | 11.04 | 11.39 | 11.63 | 11.41 |
| generic lasso | 24.06 | 24.25 | 25.62 | 25.71 |
| factor-adjusted lasso | 34.46 | 35.12 | 36.91 | 36.57 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Monetary Policy and Economic Impact · Advanced Statistical Methods and Models
Bayesian Factor-adjusted Sparse Regression
Jianqing Fan11footnotemark: 1, Bai Jiang, and Qiang Sun Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544; E-mail: [email protected], [email protected] of Statistical Sciences, University of Toronto, Toronto, ON M5S 3G3; E-mail: [email protected].
(January 31, 2019)
Abstract
This paper investigates the high-dimensional linear regression with highly correlated covariates. In this setup, the traditional sparsity assumption on the regression coefficients often fails to hold, and consequently many model selection procedures do not work. To address this challenge, we model the variations of covariates by a factor structure. Specifically, strong correlations among covariates are explained by common factors and the remaining variations are interpreted as idiosyncratic components of each covariate. This leads to a factor-adjusted regression model with both common factors and idiosyncratic components as covariates. We generalize the traditional sparsity assumption accordingly and assume that all common factors but only a small number of idiosyncratic components contribute to the response. A Bayesian procedure with a spike-and-slab prior is then proposed for parameter estimation and model selection. Simulation studies show that our Bayesian method outperforms its lasso analogue, manifests insensitivity to the overestimates of the number of common factors, pays a negligible price in the no correlation case, and scales up well with increasing sample size, dimensionality and sparsity. Numerical results on a real dataset of U.S. bond risk premia and macroeconomic indicators lend strong support to our methodology.
keywords: factor model, Bayesian sparse regression, posterior convergence rate, model selection.
1 Introduction
High-dimensional linear models are useful for a wide arrays of economic problems (Fan et al., 2011; Belloni et al., 2012). These models typically assume the sparsity of regression coefficients, that is, only a small number of covariates have significant effects on the response. However, the explanatory variables in the panel of an economic dataset are often highly correlated due to the influence of latent common factors, rendering the sparsity assumption unreasonable and restrictive. To address this issue, this paper proposes a general regression model with a factor-adjusted sparsity assumption, and develops a Bayesian method for this model.
To motivate the factor-adjusted model and its corresponding methodology, we start with the standard linear regression model
[TABLE]
where is an response vector, is a design matrix of observations and covariates, is a -dimensional vector of regression coefficients, is an unknown standard deviation, and is an -dimensional standard Gaussian random vector, independent of . Without loss of generality, we assume and include no intercept term in the model. Of interest is the high-dimensional regime in which the dimensionality is much larger than the sample size .
This model has attracted intensive interests in the frequentist community (Tibshirani, 1996; Fan and Li, 2001; Candes and Tao, 2007; Fan and Lv, 2008; Zhang and Huang, 2008; Su and Candes, 2016, among others). All of these methods hinge on at least two basic assumptions. The first one assumes that the correlations between explanatory variables are sufficiently weak. Examples of this assumption are the mutual coherence condition (Donoho and Huo, 2001; Donoho and Elad, 2003; Donoho et al., 2006; Bunea et al., 2007), the irrepresentable condition (Zhao and Yu, 2006), the restricted eigenvalue condition (Bickel et al., 2009; Fan et al., 2018) and the uniform compatibility condition (Bühlmann and van de Geer, 2011, page 157). The second one, referred to as the sparsity assumption, assumes that only a small number of covariates contribute to the response. Formally, the sparsity, defined as , is much smaller than the dimensionality .
Nevertheless, the weak correlation conditions do not necessarily hold in many applications, especially those in economic and financial studies. In an economic or financial dataset, the explanatory variables, e.g., stock returns or macroeconomic indicators over a period of time, are often influenced by similar economic fundamentals and are thus heavily correlated due to the existence of co-movement patterns (Forbes and Rigobon, 2002; Stock and Watson, 2002). In the presence of such strong correlations introduced by common factors, one naturally expects strong effects of common factors on the response. If this is true, many covariates would have non-negligible effects on the response, rendering the traditional sparsity assumption in the standard regression model (1) ideologically unreasonable.
The above argument shows the necessity to take the correlation structure of explanatory variables into account and adjust the sparsity assumption accordingly. For this purpose, we consider using factor models (Stock and Watson, 2002; Bai, 2003; Bai and Ng, 2006; Fan et al., 2013) and assume that each datum (row) of the data matrix exhibits a decomposition of form
[TABLE]
where is a unknown matrix of factor loading coefficients, is a -dimensional random vector of common factors, and is a -dimensional random vector of weakly-correlated idiosyncratic components, uncorrelated with . Without loss of generality, we assume , , and . Both common factors and idiosyncratic components are latent, but they are often estimated by using principal component analysis (PCA) (Bai, 2003; Fan et al., 2013; Wang and Fan, 2017). Model (2) embraces the well-known CAPM model (Sharpe, 1964; Lintner, 1975) and Fama-French model (Fama and French, 1993) as its special cases, with observable common factors. Let be the matrix of common factors, and be the matrix of idiosyncratic components. Then a more compact matrix form reads as
[TABLE]
Each covariate (column) in can be decomposed as a sum of two components and , reflecting the influence of common factors and idiosyncratic variations respectively.
Utilizing this factor structure (3), we generalize the standard sparse regression model (1) to a factor-adjusted sparse regression model of the form
[TABLE]
where and are regression coefficient vectors of and , respectively. We assume that is dense (as it is usually low-dimensional) but is sparse. That is, all common factors but only a small number of idiosyncratic components of the original explanatory variables contribute to the response. A non-zero indicates that the covariate , excluding the strong correlation with other covariates, has a specific effect on the response. Compared to the traditional sparsity assumption, this factor-adjusted sparsity assumption is more tenable as the idiosyncratic components are weakly-correlated.
We remark that our generalized factor-adjusted regression model (4) covers the standard regression model (1) as a special case by restricting the side constraint that . Under this constraint, the factor-adjusted sparsity assumption imposed on regression coefficients of idiosyncratic components in model (4) coincides with the traditional sparsity assumption in model (1). Thus any statistical method for estimating model (4) would estimate model (1). Of course, when such a constraint is not enforced, model (4) provides more flexibility in the regression analysis than model (1).
Model (4) is similar but different from the factor-augmented regression or the augmented principal component regression of Stock and Watson (2002); Bai and Ng (2006). In the factor-augmented models, the factors are usually extracted from a large panel of data via PCA and used as a part of covariates, yet the other variables are introduced from outside of the panel. These models are typically low-dimensional. In contrast, model (4) takes idiosyncratic components as covariates, which are created internally from the panel of the data. This allows to explore additional explanatory power of the data. Our analyses of model (4) in the high-dimensional fashion are applicable to the low-dimensional factor-augmented regression models in the literature, as model (4) can easily incorporate external variables in the part of and/or . For simplicity of presentation, we omit the details.
Kneip and Sarda (2011) gave an insightful discussion on the limitation of the traditional sparse assumption in model (1) with factor-structured covariates and proposed a factor-augmented regression model. Nevertheless, they still need the weak correlation condition on the original covariates, which is unlikely to hold for factor-structured covariates. See equation (5.5) of Kneip and Sarda (2011). Fan et al. (2016) pointed out the failure of classical frequentist methods dealing with model (1) with factor-structured covariates, and proposed a frequentist method for estimating model (4). Specially, they estimated the latent common factors and idiosyncratic components, and then run frequentist sparse regression methods (e.g., lasso) on estimated common factors and idiosyncratic components. Similar to ours, they impose the weak correlation condition on idiosyncratic components, instead of the original covariates. See Example 3.2 of Fan et al. (2016).
This paper focuses on Bayesian solutions to model (4). As shown in Section 2, the fully Bayesian procedure cannot work easily due to the involvement of latent common factors and idiosyncratic components in the posterior computation. Inspired by Fan et al. (2016), we consider estimating these latent variables by PCA and running a Bayesian sparse regression method on their estimates. The arsenal of Bayesian sparse regression methods, including those exploiting shrinkage priors (e.g. Park and Casella, 2008; Polson and Scott, 2012; Armagan et al., 2013; Bhattacharya et al., 2015; Song and Liang, 2017) and those exploiting spike-and-slab priors (among others, Ishwaran and Rao, 2005; Narisetty and He, 2014; Castillo et al., 2015; Ročková and George, 2018), has been developed in parallel to the frequentist methods. However, it is unclear whether these methods would work on estimated common factors and idiosyncratic components in model (4). When it does work, it remains unknown whether the factor model estimation would incur any loss to the convergence rate or model selection consistency of the Bayesian sparse regression method. Given theoretical results in the frequentist setting, these questions are still challenging, because the definitions of estimation errors and technical conditions of frequentist and Bayesian methods are significantly different (Castillo et al., 2015). Even if a Bayesian sparse regression method is theoretically sound, it is unclear whether it performs better or worse than the frequentist methods on finite sample data. We would like answer these questions in the current paper.
Specifically, our Bayesian method imposes a slab prior on the regression coefficients of estimated common factors, and a spike-and-slab prior on the regression coefficients of estimated idiosyncratic components. This procedure results in a pseudo-posterior distribution, which differs from the exact posterior distribution obtained by a Bayesian regression on exact common factors and idiosyncratic components. Interestingly, the pseudo-posterior distribution achieves the contraction rate of the regression coefficients, which matches that of the exact posterior distribution. Byproducts of our analyses include the adaptivity to the unknown sparsity and the unknown standard deviation . We only need a type of sparse eigenvalue condition on the idiosyncratic components to overcome the non-identifiability issue of the parameters. This is easy to hold due to the weak correlation among idiosyncratic components. Moreover, by assuming a beta-min condition that is frequently used in the high-dimensional regression literature, we prove that our method consistently selects the support of the true sparse regression coefficients.
The rest of this paper proceeds as follows. In Section 2, we propose the Bayesian methodology for the factor-adjusted regression model (4). Section 3 establishes the contraction rates and model selection consistency of the pseudo-posterior distribution. These theoretical results rely on a high-level condition concerning the estimation of factor models, which is examined by Section 4. Section 5 presents experimental results on simulation datasets. Section 6 applies our method to a real dataset of U.S. bond risk premia and macroeconomic indicators. Section 7 is devoted to discussions. All technical proofs and algorithmic implementation are detailed in the appendices.
Notation. We write for a diagonal matrix of elements . For a symmetric matrix , we write its largest eigenvalue as and its smallest eigenvalue as . For a matrix , we write to denote its -th column, and lowercase to denote its -th row. For a index set , is the sub-matrix of assembling the columns indexed by . Let be the element-wise maximum norm of , let be its Frobenius norm. For a vector , let denote its sub-vector assembling components indexed by , and let denote its norm. For two sequences and , or means .
2 Model and Methodology
Our goal is to study the factor-adjusted regression model (4), in which both common and idiosyncratic components are unobserved, but are observed through (3). Each datum (row) in admits the factor structure (2) with therein identically distributed as . Note that are not necessarily independently distributed. The dimension of is fixed, but the dimension of may grow as increases. By decomposition, and are uncorrelated. Without loss of generality, we assume that , and . The regression coefficient vector of is sparse in the sense that is small. We allow to grow as increases, but require so that the desired contraction rate as . The Gaussian errors are independent from and .
An inherent difficulty for estimating model (4) is that both common factors and idiosyncratic components are unobserved. Therefore the first step is to estimate these unobserved variables. We follow Bai (2003); Fan et al. (2013); Wang and Fan (2017) and use PCA for this task. Let be the eigenvalues of . A natural estimator of is the concatenation of the square-root--scaled eigenvectors corresponding to the top eigenvalues of , denote by . That is,
[TABLE]
where Then we estimate by
[TABLE]
If is unknown, we may estimate by
[TABLE]
where is any prescribed upper bound for (Luo et al., 2009; Lam and Yao, 2012; Ahn and Horenstein, 2013).
After estimating unobserved variables, we propose a Bayesian sparse regression method for tasks of parameter estimation and model selection. Suppose we are given data generated from true parameter . Let be its running parameter. Let and be the support of and , respectively. We consider a hierarchical prior with a slab prior on the coefficients of common factors and a spike-and-slab prior on the coefficients of idiosyncratic components as follows:
[TABLE]
where is a positive continuous density function on , e.g., the inverse-gamma density; is a “slab” density function on in the sense that as , e.g., the Gaussian density and the Laplace density ; hyperparameters control the scales of running coefficients ; and, hyperparameter controls the sparsity of running models . For the scaling hyperparameters, we set so that the effects of possibly heterogeneous scales of ’s are appropriately adjusted. For the sparsity hyperparameter, we simply set in the simulation experiments. When dealing with a real dataset, one could choose an informative according to expertise knowledges in the specific area, or tune by sophisticated cross-validation or empirical Bayes procedures
The Bayesian sparse regression on response and regressors , with prior (6) obtains a pseudo-posterior distribution
[TABLE]
where denotes the -dimensional normal distribution with mean and covariance . We call it a “pseudo-posterior” distribution and put a hat over to emphasize that it differs from the exact posterior distributions , obtained by a Bayesian regression on observed , and , obtained by a fully Bayesian procedure.
It is worth noting that, even in the simplest setting in which are i.i.d. and are jointly independent, the exact posterior distribution given by a fully Bayesian procedure
[TABLE]
is computationally intractable due to the involvement of latent variables in the complicated integral. Thus a fully Bayesian procedure does not solve model (4) easily.
3 Theory
In this section, we show under commonly-seen assumptions for Bayesian sparse regression methods that the pseudo-posterior distribution (7) achieves the convergence rate of the estimation error for the coefficient vectors . This rate is so far the best rate Bayesian methods can achieve with observed (Song and Liang, 2017). We see that the factor adjustment added by our approach to the Bayesian sparse regression method incurs no loss in terms of estimation error rate. Byproducts of our analysis are the adaptivities of the pseudo-posterior distribution to the unknown sparsity and unknown standard deviation . Finally, when the beta-min condition holds, we establish the model selection consistency of the pseudo-posterior distribution (7).
3.1 Assumptions
In the high-dimensional regime , a common assumption is that is sparse of size . Following the sparse regression literature, we assume that such that the desired error rate as . To recover the sparse coefficient vector at rate , we need the following assumptions.
Assumption 1**.**
There exists a large integer and a constant such that
[TABLE]
holds with probability approaching .
This assumption is commonly referred to as the sparse eigenvalue condition in the frequentist literature (Fan et al., 2018). In a recent study of Bayesian sparse regression with shrinkage priors, Song and Liang (2017) imposed the same assumption on original covariates . Here our assumption is imposed on their idiosyncratic components .
Our next assumption upper bounds the maximum eigenvalue of , which is the Gram matrix corresponding to the true model . Assumptions 1-2 together ensure that is well conditioned.
Assumption 2**.**
There exists a constant such that
[TABLE]
holds with probability approaching .
Raskutti et al. (2010); Dobriban and Fan (2016) gave sufficient conditions for correlated covariates to satisfy Assumptions 1-2. These theories typically allow in Assumption 1. If consists of i.i.d. entries with zero mean, unit variance and only finite fourth moment, Assumption 2 holds by Bai-Yin theorem in the random matrix theory (Bai and Yin, 1988; Yin et al., 1988).
Since we feed a Bayesian sparse regression method with the estimated variables rather than the latent variables , it is necessary to control the error of . This goal is achieved by assumptions on the estimation errors of latent variables and the magnitudes of the true coefficient vectors. For the estimation error of the factor model, we impose a generic high-level condition as follows.
Assumption 3**.**
The latent common factors and idiosyncratic components can be estimated by and as follows.
[TABLE]
for some nearly orthogonal matrix such that and .
Since represents the eigenspace of the top eigenvalues of and mimics the column space of , there is a nearly-orthogonal transformation, represented by , between and . Next section will verify this error rate in factor models under standard assumptions.
Our last assumption requires constant orders of the true parameters .
Assumption 4**.**
is fixed, , and .
This condition is not restrictive. It holds if and only if the response variable has finite variance, under Assumptions 1-2. To see this point, note that the variance of a single response variable is
[TABLE]
where is the sub-vector of corresponding to the true model , and have all eigenvalues bounded away from [math] and , due to Assumptions 1-2. Although our theoretical analyses need bounded magnitude of regression coefficients to avoid the amplification of estimation errors of latent variables, we remark here that, when the underlying true factors, ’s and ’s, and/or more accurate estimates are available, we can allow larger magnitudes of regression coefficients.
3.2 Definition of Posterior Contraction Rate
The definition of convergence rate in the Bayesian setting differs from that in the frequentist setting. We formally define it by following the classical Bayesian literature (Ghosal et al., 2000; Shen and Wasserman, 2001).
Definition 1** (Posterior contraction).**
Consider a parametric model indexed by . Let be a sequence of data generations according to some true parameter . Let be a function of . Let be a loss function between the estimate and the parameter . A sequence of posterior distributions (random measures) is said to achieve convergence rate of estimation error if
[TABLE]
in -probability as for some constant .
Specifically in the factor-adjusted regression model (4) with covariates hidden in (3), we consider
[TABLE]
where is introduced by Assumption 3, and want to show that achieves the contraction rate of estimation error
[TABLE]
As noted on Assumption 3, approximates in the sense that they have almost the same column space and element-wisely for some nearly orthogonal transformation matrix . Thus the pseudo-posterior distribution would concentrate around such that .
3.3 Results
This subsection presents the main results of the paper. Recall that . Let
[TABLE]
where are constants, and are supports of and , respectively, and is the cardinality of the set difference of and .
Theorem 1**.**
Let denote the probability measure under the true parameters. Under Assumptions 1-4, the following statements hold.
- (a)
(estimation error rate) There exist constants and such that
[TABLE]
as . 2. (b)
(prediction error rate) There exist constants and such that
[TABLE]
as . 3. (c)
(model selection consistency) If then there exist constants and such that
[TABLE]
as . It follows that
[TABLE]
as .
Part (a) establishes the convergence rate of the -estimation error of (up to a nearly orthogonal transformation ) and , the adaptivity to the unknown sparsity , and the adaptivity to the unknown standard deviation .
Part (b) shows that predicts the conditional mean with mean squared error for each single datum instance on average.
The first implication in Part (c) asserts that the pseudo-posterior distribution will select all variables in and at most other variables, with high probability. In simulation experiments, we observe that the pseudo-posterior distribution overestimates the true support size by less than . The second implication asserts that
[TABLE]
in probability as , and therefore provides a variable selection rule. Simply speaking, we can consistently select the true model by thresholding the running coefficients at . In simulation experiments, the majority of pseudo-posterior samples of parameters hit the true model correctly even if the thresholding rule is not used.
The additional condition that in part (c) is called “beta-min condition” in the literature on Bayesian sparse regression (Castillo et al., 2015; Song and Liang, 2017). Narisetty and He (2014) use another identifiability condition to achieve the model selection consistency. Their condition can be shown slightly stronger than the beta-min condition in presence of the minimum sparse eigenvalue condition. To see this point, one can compare their Condition 4.4 to our equation (10) in the proof of Lemma A2, part(d).
4 Factor Model Estimation
This section verifies Assumption 3, which concerns the estimation errors of factor models under standard assumptions. Following Bickel and Levina (2008), we define a uniformity class of positive semi-definite matrices as follows
[TABLE]
Assumption 5**.**
are identically (not necessarily independently) distributed as . , ; , with for some , and .
Assumption 6**.**
All entries in the loading matrix are uniformly bounded, i.e., , and all the eigenvalues of is strictly bounded away from [math] and .
Assumption 7**.**
The sample covariance matrices of and converge to the true covariance matrices at rate in the element-wise maximum norm.
[TABLE]
In Assumption 5, is made to avoid the non-identifiability issue of and . If rows of are i.i.d. copies of some -dimensional distribution then converges almost surely to as and Assumption 6 holds when has eigenvalues bounded away from 0 and . Assumptions 5-6 together characterize the “low-rank plus sparse” structure of the covariance matrix of . That is,
[TABLE]
where the first part is of low rank , and the second part is sparse in the sense that the quantity for some is . This decomposition has a “spike plus non-spike” structural interpretation as well: the smallest non-zero eigenvalue of is of order , while the largest eigenvalue of is of order . This eigen-gap plays the key role in estimating and .
Assumption 7 requires that the sample covariance , and converge to their ideal counterparts at an appropriate rate. Kneip and Sarda (2011) provided sufficient conditions for it to hold in case that are i.i.d.. Fan et al. (2013) established the same rate for stationary and weakly-correlated time-series. Our recent work on the concentration inequalities for general Markov chains (Jiang et al., 2018) can verify this assumption in case that are functionals of ergodic Markov chains.
Next theorem summarizes the theoretical results on factor model estimation under Assumptions 5-7. Part (b) of this theorem bounds the difference between column spaces of and in terms of principal angles, which is novel from the previous theory in the literature (Fan et al., 2013) and may be of independent interest. Parts (c) and (d), which are immediate corollaries of part (b), derive Assumption 3.
Definition 2**.**
The principal angles between two linear spaces spanned by orthonormal column vectors of and are defined as
[TABLE]
where are the singular values of or .
Theorem 2**.**
Let consist of -scaled left singular vectors of , which are orthonormal vectors spanning the column space of . Under Assumptions 5-7, the following statements hold.
- (a)
Eigenvalue recovery:
[TABLE] 2. (b)
Eigenspace recovery:
[TABLE] 3. (c)
Common factor recovery:
[TABLE]
for some nearly orthogonal matrix with and . 4. (d)
Idiosyncratic component recovery:
[TABLE]
5 Simulation Experiments
This section reports simulation results. As a basic case, we set , and generate , and . We set true parameters , , , and .
For prior (6), we choose the inverse-gamma density with shape and scale , the Gaussian density and set hyperparameters and . Starting from , we iterate a Gibbs sampler times and drop the first iterations as the burn-in period. The implementation details of the Gibbs sampler is given in the appendix.
The pseudo-posterior distribution are evaluated in terms of five measures. The posterior mean of is compared to in terms of estimation error. The model selection rate and the sure screening rate are also computed. The former is the portion of the posterior samples that select the true model, i.e., , and the latter is the portion of the posterior samples that select all sparse coefficients, i.e., . To evaluate the adaptivity to unknown sparsity , we report the average model size . To evaluate the adaptivity to unknown standard deviation , the posterior mean of is compared to in terms of relative estimation error. These measures are evaluated over 100 replicates of the datasets, and their averages are reported.
For the comparison purpose, the factor-adjusted lasso method is implemented by using R package glmnet (Friedman et al., 2010). The -penalty hyperparameters of the lasso methods are tuned by 10-fold cross-validation. Since the generic Bayesian / lasso with as covariates can be seen as the factor-adjusted Bayesian / lasso with the underestimate of , we also include them in the comparison.
5.1 Comparison of four methods, and insensitivity to misestimates of
Table 1 summarizes the five measures of four methods in the basic case. Results show that the factor-adjusted Bayesian method outperforms the factor-adjusted lasso method in the tasks of estimation and model selection. The poor performance of the factor-adjusted lasso method may partly result from the less satisfactory hyperparameter tuning procedure implemented in the R package glmnet.
We feed the factor-adjusted methods with the various estimates , and observe that their performances are insensitive to the overestimate of (Table 1). In case that there is no correlation among , i.e., , the factor-adjusted Bayesian method performs slightly worse than the generic Bayesian method (Table 2).
We emphasize that the meaning of the model selection rate for the Bayesian methods are slightly different from that for the frequentist methods. For example, 50% model selection rate given by a frequentist method means that it select the true sparse model in 50 out of 100 replicates of the dataset. In contrast, 90% model selection rate given by a Bayesian method means that every 9 of 10 posterior samples of parameters hit the true sparse model in a single replicate of the dataset on average. In the simulation experiments reported by Tables 1-2, at least every 7 of 10 pseudo-posterior samples obtained by our method hit the true sparse model in each of 100 replicates of the dataset.
5.2 Scalability as increase
We vary the sample size , the dimensionality and the sparsity in the basic case, and test the scalability of the proposed methodology.
In Figure 1(a), we fix all parameters in the basic case but vary , , , , , . In Figure 1(b), we fix all parameters in the basic case but vary , , , , , . In Figure 1(c), we fix all parameters in the basic case but vary , , , , , , , . For factor-adjusted methods, are used.
We observe that our method outperforms the other three methods in terms of estimation error and model selection rate under each combination of , and achieves comparable relative error of to the factor-adjusted lasso method.
5.3 Estimating the standard regression model
Recall that, when admits a factor structure (3), the standard regression model (1) is a special case of the factor-adjusted regression model (4) with the parameter constraint . We expect that the factor-adjusted Bayesian method solves model (1) as well. To verify this expectation, we set (or equivalently ), and test four methods on simulation datasets.
We see that the factor-adjusted Bayesian method does solve model (1). Interestingly, while the factor adjustment added to the lasso method significantly increases the model selection rate from 13% to roughly 50%, the generic Bayesian method works comparably well or even better than the factor-adjusted Bayesian method. We will discuss this phenomenon in the discussion section.
6 Predicting U.S. Bond Risk Premia
This section applies our method to predict U.S. bond risk premia with a large panel of macroeconomic variables. The response variables are monthly U.S. bond risk premia with maturity of years spanning the period from January, 1964 to December, 2003 (Ludvigson and Ng, 2009). The -year bond risk premium at period is defined as the (log) holding return from buying an -year bond at period and selling it as an -year boud at period , excessing the (log) return on one-year bond bought at period . The covariates are macroeconomic variables collected in the FRED-MD database (McCracken and Ng, 2016) during the same period.
The covariates over 480 months are strongly correlated. The scree plot of PCA of these covariates (Figure 2) shows that the first principal component accounts for 55.9% of the total variance, and that the first 5 principal components account for 89.7% of the total variance.
We consider the rolling window regression and next value prediction. Specifically, we regress a U.S. bond risk premium on the macroeconomic variables in the last month. For each time window of size ahead of month , we fit
[TABLE]
and do out-of-sample prediction
[TABLE]
The regression function is fitted as by one of the generic lasso method, the factor-adjusted lasso method, the generic Bayesian method, the factor-adjusted Bayesian method and the principal component regression (PCR) method (Ludvigson and Ng, 2009). For the factor-adjusted methods, the number of common factors is estimated by (5). For the Bayesian methods, we set in prior (6). For PCR, the top eight principal components are included in the regression model in a similar vein to (Ludvigson and Ng, 2009). The R package pls (Wehrens and Mevik, 2007) is used for implementation of PCR.
The prediction performance is evaluated by the out-of-sample , which is computed as follows.
[TABLE]
where is one of two-year, three-year, four-year and five-year U.S. bond risk premia, is the prediction of by one of five methods in comparison, and is the average of .
Table 4 summarizes the out-of-sample five methods achieve on this task. Table 5 reports the average size of the sparse models they select. We observe that the factor-adjusted Bayesian method together with the factor-adjusted Bayesian method achieve higher out-of-sample than other methods. But the factor-adjusted Bayesian method select much sparser models than the factor-adjusted lasso method.
7 Discussion
We propose a factor-adjusted regression model to handle the linear relationship between the response variable and possibly highly correlated covariates. We decompose the predictors into common factors and idiosyncratic components, where the common factors explain most of the variations, and assume all common factors but a small number of idiosyncratic components contribute to the response. The corresponding Bayesian methodology is then developed for estimating such a model. Theoretical results suggest that the proposed methodology can consistently estimate the factor-adjusted model and thus obtain consistent predictions, under an easily-to-hold sparse eigenvalue condition on the idiosyncratic components instead of the original covariates.
Our factor-adjusted model covers the standard linear model as a sub-model with the side constraint. Thus, our proposed methodology can easily handle the case when the standard linear regression model is assumed to be the underlying model. In simulation studies on the sub-model, we find that the factor adjustment greatly improves the performance of lasso, while the generic Bayesian sparse regression is comparable to the factor-adjusted Bayesian sparse regression in terms of estimation error and model selection rate (Table 3). This suggests a fundamental difference between the frequentist sparse regression method and the Bayesian sparse regression method. Indeed, one can prove under Assumptions 1-2, 5-7 that
[TABLE]
If , these two terms are of constant order, and then a similar argument to the proof of Theorem 1 would establish the convergence and model selection consistency of the generic Bayesian regression on standard regression model (1).
Nonetheless, we recommend the factor-adjusted Bayesian regression on model (4) over the generic Bayesian regression on model (1) for three reasons. First, the theoretical analyses of the former allow to grow with , in contrast the latter requires fixed . Second, model (4) provides more flexibility than its sub-model (1) in the regression analyses and would potentially explore more explanatory power from the data. On the real dataset of U.S. bond risk premia, the factor adjusted Bayesian regression achieves 1.0%-3.0% more out-of-sample with one or two less variables (Tables 4-5). Third, in the no correlation case (although it is unlike the case in practice), the factor-adjusted Bayesian regression pays a negligible price for model misspecification (Table 2).
Acknowledgement
We would like to thank Yun Yang for helpful discussions.
Appendix A Technical Proofs for Bayesian Sparse Regression
This appendix collects technical proofs for Theorem 1.
A.1 Proof of Theorem 1
The proofs of three parts use the same techniques and have a similar structure. First, we observe under Assumptions 1-3 that, for some constant and any constant ,
[TABLE]
hold with probability approaching . The first three claimed bounds are directly taken from Assumption 3. The last two claimed bound follow from Weyl’s inequality. For any model of size at most , the singular values of differ from those of by at most
[TABLE]
This implies that
[TABLE]
We thereafter need to show that the conditional probabilities of
[TABLE]
given any realization of satisfying (8), vanish . Recall that
[TABLE]
where and are supports of and , respectively, and .
Consider the conditional probability of event (a). Since depends on through and , we have
[TABLE]
Next, by a change of measure trick and Cauchy-Schwarz inequality,
[TABLE]
Proceed to bound two integrals separately. The logarithm of the second integral is the Rényi Divergence of order 2 from to . It follows that
[TABLE]
for some constant , where (8) derives and , and Assumption 4 controls and .
On the other hand, let denote the probability measure under which , then
[TABLE]
which is concerning the posterior convergence rate of Bayesian sparse regression in model with fixed design matrix to identify the true parameter .
This fixe-design regression is analyzed by Theorem 3. By part (a) of Theorem 3, we can find such that and the first Integral . Combining bounds of two integrals completes the proof of part (a) of Theorem 1. Using similar arguments, parts (b) and (c) of Theorem 3 derive parts (b) and (c) of Theorem 1, respectively.
A.2 Bayesian Sparse Regression with Fixed Design
Theorem 3**.**
Recall that denote the probability measure under the model with fixed design . Suppose and true parameters satisfy
[TABLE]
then the following statements hold.
- (a)
(estimation error rate) For any constants , there exist sufficiently large such that
[TABLE] 2. (b)
(prediction error rate) For any constants , there exist sufficiently large such that
[TABLE] 3. (c)
(model selection consistency) Suppose in addition. For any constants , there exist sufficiently large such that
[TABLE]
Remark. To apply Theorem 3 in the proof of Theorem 1, we replace with , and check that in Theorem 1.
The following proposition, which is also (Barron, 1998, Lemma 6) and (Song and Liang, 2017, Lemma A4), is the central technique to prove Theorem 3.
Proposition 1**.**
Consider a parametric model . Let and be two subsets of the parameter space . Let be a sequence of data generations according to true parameter . Let be a prior distribution over . If
- (1)
, 2. (2)
there exists a test function such that
[TABLE] 3. (3)
and
[TABLE]
then for any ,
[TABLE]
The intuition of this proposition is that any less preferred should either excluded by the prior (for ) or distinguished from by a uniformly powerful test (for ).
Lemmas A1-A3 are useful to verify the three conditions in Proposition 1, respectively. Their proofs are collected in the next subsection.
Lemma A1**.**
(Theorem 1.1 in (Pelekis, 2016)) For a Binomial distributed random variable , if then
[TABLE]
where .
Lemma A2**.**
Under the same assumption of Theorem 3,
- (a)
For
[TABLE]
we have
[TABLE] 2. (b)
For
[TABLE]
we have
[TABLE] 3. (c)
For
[TABLE]
we have
[TABLE] 4. (d)
Suppose in addition. For
[TABLE]
we have
[TABLE] 5. (e)
For
[TABLE]
we have
[TABLE]
Lemma A3**.**
Under the same assumption of Theorem 3,
[TABLE]
for sufficiently large and .
Proof of Theorem 1, part (a).
We verify the three conditions in Proposition 1 one by one. Let
[TABLE]
where are defined in Lemma A2. Then . Applying Lemma A1 yields that
[TABLE]
for sufficiently large . From parts (a),(b) of Lemma A2 and Lemma A5, it follows that
[TABLE]
By Lemma A3, the third condition in Proposition 1 hold asymptotically with
[TABLE]
for any sufficiently large . For any , , We can find sufficiently large and suitable such that
[TABLE]
to complete the proof. ∎
Proof of Theorem 3, part(b).
We use a similar argument to the proof of Theorem 3, part (a) but different and . Let
[TABLE]
where are defined in Lemma A2. Then
[TABLE]
The second condition in Proposition 1 follows from parts (a),(c) of Lemma A2 and Lemma A5.
[TABLE]
∎
Proof of Theorem 3, part(c).
We use a similar argument to the proof of Theorem 3, part (a) but different and .
[TABLE]
where are defined in Lemma A2. Then . The second condition in Proposition 1 follows from parts (a),(d),(e) of Lemma A2 and Lemma A5.
[TABLE]
∎
A.3 Technical Proofs of Lemmas
The proofs of Lemmas A2-A3 invoke a few preliminary results. We list them as follows.
Lemma A4** (Probability bounds of chi-squared random variables).**
Let be a chi-squared random variable of degree .
- (a)
For any such that ,
[TABLE]
In addition, if but ,
[TABLE] 2. (b)
[TABLE]
In addition, if then for any such that
[TABLE]
Proof.
For part (a), the first assertion follows from the sub-exponential tail of chi-squared distribution, and the second assertion is due to
[TABLE]
For part (b), the first assertion is a corollary of (Laurent and Massart, 2000, Lemma 1), and the second assertion follows from
[TABLE]
∎
Lemma A5**.**
For a collection of subspace and a collection of test functions
[TABLE]
Proof.
[TABLE]
∎
Lemma A6** (Part of Corollary 2.4 in (Liu, 2005)).**
Let
[TABLE]
be a positive semi-definite matrix with non-singular principal sub-matrix then
[TABLE]
Proof of Lemma A2, part (a).
Under the null hypothesis, write
[TABLE]
(1) follows from the facts that with under and that . For (2), we observe that projection matrices
[TABLE]
for nested models , and thus the term achieves its maximum value at any and its minimum value at some s.t. and . (3) uses the fact that
[TABLE]
Applying Lemma A4, part (a) yields
[TABLE]
Under the alternative hypothesis, observe that , where
[TABLE]
Using Lemma A5,
[TABLE]
For any such that and any , write
[TABLE]
(1) follows from the facts that with under and that . (2) plugs in the restriction
[TABLE]
from the definition of . (3) uses the fact that
[TABLE]
again. Since the final bound in the last display is uniform for any such that and any , we apply Lemma A4, part (a) and yield
[TABLE]
∎
Proof of Lemma A2, part (b).
Under the null hypothesis, write
[TABLE]
(1) follows from the facts that with under and that . (2) is due to
[TABLE]
For (3), we observe that projection matrices
[TABLE]
for nested models , and thus the term achieves its maximum value at some s.t. and . (4) uses the fact that
[TABLE]
Applying Lemma A4, part (b) yields
[TABLE]
Under the alternative hypothesis, observe that , where
[TABLE]
Using Lemma A5,
[TABLE]
For any such that and any , write
[TABLE]
(1) follows from the facts that with under and that . (2) plugs in the restrictions
[TABLE]
from the definition of . (3) uses a similar argument to what we have used for the null hypothesis. Since the final bound in the last display is uniform for any such that and any , we apply Lemma A4, part (b) and yield
[TABLE]
∎
Proof of Lemma A2, part (c).
Under the null hypothesis, write
[TABLE]
(1) follows from the facts that with under and that . For (2), we observe that projection matrices
[TABLE]
for nested models , and thus the term achieves its maximum value at some s.t. and . (3) uses the fact that
[TABLE]
Applying Lemma A4, part (b) yields
[TABLE]
Under the alternative hypothesis, observe that , where
[TABLE]
Using Lemma A5,
[TABLE]
For any such that and any , write
[TABLE]
(1) follows from the facts that with under and that . (2) plugs in the restrictions
[TABLE]
from the definition of . (3) uses the fact that
[TABLE]
Since the final bound in the last display is uniform for any such that and any , we apply Lemma A4, part (b) and yield
[TABLE]
∎
Proof of Lemma A2, part (d).
We first show that
[TABLE]
Indeed, for any s.t. ,
[TABLE]
Note that is the Schur complement of the principal submatrix in the matrix . Thus, by Lemma A6,
[TABLE]
It further implies that
[TABLE]
Under the null hypothesis, write
[TABLE]
(1) follows from the facts that with under and that . (2) plugs in (10). (3) is due to the fact that
[TABLE]
(4) uses the fact that
[TABLE]
Applying Lemma A4, part (b) yields
[TABLE]
Under the alternative hypothesis, observe that , where
[TABLE]
Using Lemma A5,
[TABLE]
For any such that and any , write
[TABLE]
(1) follows from the facts that with under and that . (2) plugs in the restriction
[TABLE]
from the definition of . (3) uses a similar argument to what we have used for the null hypothesis. Since the final bound in the last display is uniform for any such that and any , we apply Lemma A4, part (b) and yield
[TABLE]
∎
Proof of Lemma A2, part (e).
Under the null hypothesis, write
[TABLE]
(1) follows from the facts that with under and that . (2) is due to
[TABLE]
For (3), we observe that projection matrices
[TABLE]
for nested models , and thus the term achieves its maximum value at some s.t. . (4) uses the fact that
[TABLE]
Applying Lemma A4, part (b) yields
[TABLE]
Under the alternative hypothesis, observe that , where
[TABLE]
Using Lemma A5,
[TABLE]
For any such that and any , write
[TABLE]
(1) follows from the facts that with under , that and that . (2) plugs in the restrictions
[TABLE]
from the definition of . (3) uses a similar argument to what we have used for the null hypothesis. Since the final bound in the last display is uniform for any such that and any , we apply Lemma A4, part (b) and yield
[TABLE]
∎
Proof of Lemma A3.
Define
[TABLE]
Step 1. We first choose sufficiently small such that
[TABLE]
Observing that
[TABLE]
We proceed to find sufficiently small such that
[TABLE]
with conditional probability given . To this end, write
[TABLE]
where
[TABLE]
We choose sufficiently small such that .
Step 2. Since
[TABLE]
it is left to show
[TABLE]
Note that , , and for , . For all , we can find constant such that
[TABLE]
hold for sufficiently large . Thus
[TABLE]
if . ∎
Appendix B Technical Proofs for Factor Model Estimation
This section is devoted to the proofs of Theorem 3. Parts (a) and (b) of Theorem 3 are restated as Lemmas B7-B8. The proof of Lemma B7 is straightforward. To prove Lemma B8, we generalize the Davis-Kahan theorem (Davis and Kahan, 1970; Yu et al., 2014) as Proposition 2 and apply it to bound the principal angles from the perturbed eigenspace to the target eigenspace. Two preliminary lemmas, required by the proof of Lemma B8, are stated as Lemmas B9-B10. Parts (c) and (d) of Theorem 3, restated as Lemmas B11-B12, are immediate corollaries of Lemma B8.
Lemma B7** (Theorem 3, part (a)).**
Suppose Assumptions 5-7. Recall that are the eigenvalues of , and that are eigenvalues of . Then
[TABLE]
Proof.
It suffices to show so that Weyl’s inequality applies. To this end, write
[TABLE]
where
[TABLE]
Plugging into it rates in Assumptions 5-7 and completes the proof. ∎
Proposition 2**.**
Let be an symmetric matrix with eigenvalues and corresponding eigenvectors . Fix and assume that , where and . Let . Let and consists of the other eigenvalues of . Let and consists of the other eigenvectors of . Let be an (not necessarily symmetric) matrix with “-approximate” eigenvalues in the sense that
[TABLE]
where and consists of (not necessarily orthonormal) vectors. Then
[TABLE]
Proof.
Write
[TABLE]
Using the facts that and that , we derive that
[TABLE]
which is the numerator on the right hand side of the claimed inequality. On the other hand,
[TABLE]
where the first (in)equality follows from the identities that and that , the second (in)equality uses the orthogonality of to derive that
[TABLE]
and the third (in)equality uses the column orthonormality of again.
Proceed to consider the term . For real matrices , we write as the vectorization of , which is the vector obtained by stacking columns of , and denote by the kronecker product of matrices and . Using the identity for any matrices with appropriate dimensions, we have
[TABLE]
which is the left hand side times the denominator on the right hand side in the claimed inequality. ∎
Lemma B8** (Theorem 3, part (b)).**
Suppose Assumptions 5-7 hold. Recall that consists of -scaled left singular vectors of , and that consists of -scalded top left singular vectors of . Then recovers the column space of the latent common factor matrix in the sense that
[TABLE]
Proof.
Let consist of -scaled left singular vectors of except those in , then
[TABLE]
Thus it suffices to show . For this goal, we first apply Proposition 2 to show that
[TABLE]
Recall that consist of the right singular vectors of , i.e. , and write
[TABLE]
where . Applying Proposition 2 yields
[TABLE]
To prove (11), we are going to bound each term in the last display by Assumptions 5-7, and Lemmas B7,B9.
- (a)
For the term , we have by Assumption 7 that
[TABLE] 2. (b)
Using the facts that , that and that ,
[TABLE] 3. (c)
For the term
[TABLE]
we have, by Assumptions 5 and 7,
[TABLE]
and, by Lemma B9,
[TABLE] 4. (d)
From Lemma B7, it follows that
[TABLE]
Next, recall that is the singular value decomposition of . Write
[TABLE]
where the second (in)equality follows from the fact that, if is diagonal,
[TABLE]
the third (in)equality uses the orthogonality of , and the final (in)equality combines rates given by Lemma B10 and (11). ∎
Lemma B9**.**
Suppose Assumptions 5 and 6 hold. Then
[TABLE]
Proof.
Write
[TABLE]
Applying Markov’s inequalities to completes the proof. ∎
Lemma B10**.**
Suppose Assumption 6 and 7 holds. Let be the singular value decomposition of . Then
[TABLE]
Proof.
Write
[TABLE]
. ∎
Lemma B11** (Theorem 3, part (c)).**
Suppose Assumptions 5-7 hold. For some non-singular matrix with and ,
[TABLE]
Proof.
Recall that is the singular value decomposition of . Note that all singular values of is bounded by . Let and consist of the left and right singular vectors of (the signs of vectors are properly set such that the singular values of are non-negative). Thus
[TABLE]
where the last step uses Lemma B8. Set then
[TABLE]
The eigenvalues of or are the diagonal elements in , which are -close to as shown by Lemma B10. ∎
Lemma B12** (Theorem 3, part (d)).**
Suppose Assumptions 5-7 hold. recovers the latent individual factor matrix in the sense that
[TABLE]
Proof.
Recall that denote the -th row of , . Recall the definition of in Lemma B11. It is elementary that and . Note that . Write
[TABLE]
For the first term,
[TABLE]
For the second term,
[TABLE]
For the third term,
[TABLE]
∎
Appendix C Implementation of Gibbs Samplers
For the prior (6), we set as the Gaussian density function and as the inverse-gamma density function with shape and scale . A Gibbs sampler is implemented to explore the pseudo-posterior distribution (7). This Gibbs sampler runs towards the pseudo-posterior joint distribution of by iterating the following steps: (1) draw given and , (2) draw given , and , (3) draw given and , (4) draw given and .
For simplicity, we illustrate the implementation details with , for . For the first step, we have
[TABLE]
This implies
[TABLE]
where . However, it is computationally prohibitive to directly sample from this conditional distribution, as takes possible values. As a remedy, we flip in Gibbs random scans. In our experiments, we found that just one random scan suffices for the proposed method to perform well. Details of flipping will be given at the end of this section.
For the second step, we derive, by elementary calculus,
[TABLE]
Recall that . Similarly, for the third step,
[TABLE]
The final step uses the conjugacy of normal distribution and inverse-gamma distribution
[TABLE]
In the first step, in order to sample from the conditional distribution (12), we flip with probability
[TABLE]
where . The posterior probability ratio is computed as
[TABLE]
where we derive, by Sylvester’s determinant theorem and properties of Schur complements, that
[TABLE]
and, by Sherman-Morrison-Woodbury identity, that
[TABLE]
As shown in our theoretical analyses, this Gibbs sampler will deal with in most time. The computation of terms in the posterior probability ratio is numerically stable as the Gram matrices involved in the computation has small size. The computation
It is also time-efficient with complexity . The overall time complexity running iterations of Gibbs samplers in our Bayesian method is then . In contrast, the factor-adjusted lasso method costs time. In the simulation studies, we choose , , , as the typical setting, and observe that our Bayesian method runs as fast as its lasso analogue.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Ahn and Horenstein (2013) Ahn, S. C. and Horenstein, A. R. (2013). Eigenvalue ratio test for the number of factors. Econometrica 81 1203–1227.
- 2Armagan et al. (2013) Armagan, A. , Dunson, D. B. and Lee, J. (2013). Generalized double pareto shrinkage. Statistica Sinica 23 119.
- 3Bai (2003) Bai, J. (2003). Inferential theory for factor models of large dimensions. Econometrica 71 135–171.
- 4Bai and Ng (2006) Bai, J. and Ng, S. (2006). Confidence intervals for diffusion index forecasts and inference for factor-augmented regressions. Econometrica 74 1133–1150.
- 5Bai and Yin (1988) Bai, Z.-D. and Yin, Y.-Q. (1988). Necessary and sufficient conditions for almost sure convergence of the largest eigenvalue of a wigner matrix. Annals of Probability 1729–1741.
- 6Barron (1998) Barron, A. R. (1998). Information-theoretic characterization of bayes performance and the choice of priors in parametric and nonparametric problems. Bayesian Statistics 6 27–52.
- 7Belloni et al. (2012) Belloni, A. , Chen, D. , Chernozhukov, V. and Hansen, C. (2012). Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80 2369–2429.
- 8Bhattacharya et al. (2015) Bhattacharya, A. , Pati, D. , Pillai, N. S. and Dunson, D. B. (2015). Dirichlet-Laplace priors for optimal shrinkage. Journal of the American Statistical Association 110 1479–1490.
