Bayesian Factor-adjusted Sparse Regression

Jianqing Fan; Bai Jiang; Qiang Sun

arXiv:1903.09741·stat.ME·March 26, 2019

Bayesian Factor-adjusted Sparse Regression

Jianqing Fan, Bai Jiang, Qiang Sun

PDF

Open Access

TL;DR

This paper introduces a Bayesian factor-adjusted sparse regression approach for high-dimensional data with correlated covariates, effectively modeling common factors and idiosyncratic components to improve variable selection and prediction.

Contribution

It develops a novel Bayesian method using spike-and-slab priors for factor-adjusted regression, addressing limitations of traditional sparsity assumptions in correlated settings.

Findings

01

Outperforms lasso in simulations

02

Insensitive to overestimation of factors

03

Scales well with data size and sparsity

Abstract

This paper investigates the high-dimensional linear regression with highly correlated covariates. In this setup, the traditional sparsity assumption on the regression coefficients often fails to hold, and consequently many model selection procedures do not work. To address this challenge, we model the variations of covariates by a factor structure. Specifically, strong correlations among covariates are explained by common factors and the remaining variations are interpreted as idiosyncratic components of each covariate. This leads to a factor-adjusted regression model with both common factors and idiosyncratic components as covariates. We generalize the traditional sparsity assumption accordingly and assume that all common factors but only a small number of idiosyncratic components contribute to the response. A Bayesian procedure with a spike-and-slab prior is then proposed for…

Tables5

Table 1. Table 1 : Simulation results in the basic case with k = 3 𝑘 3 k=3 .

Method	$𝜷$ estimation ( $ℓ_{2}$ error)	model selection rate	sure screening rate	average model size	$σ^{2}$ estimation (relative error)
generic Bayes, $\hat{k} = 0$	1.536	0.0%	100.0%	15.92	5.940
Factor-adjusted Bayes, $\hat{k} = 3$	0.124	88.2%	100.0%	5.13	2.057
Factor-adjusted Bayes, $\hat{k} = 6$	0.124	87.0%	100.0%	5.14	2.058
Factor-adjusted Bayes, $\hat{k} = 9$	0.130	86.4%	100.0%	5.14	2.057
Factor-adjusted Bayes, $\hat{k} = 12$	0.133	86.1%	100.0%	5.14	2.069
generic lasso, $\hat{k} = 0$	1.189	0%	100%	92.57	3.187
Factor-adjusted lasso, $\hat{k} = 3$	0.460	27%	100%	21.20	1.899
Factor-adjusted lasso, $\hat{k} = 6$	0.463	25%	100%	21.56	1.865
Factor-adjusted lasso, $\hat{k} = 9$	0.467	27%	100%	26.07	1.787
Factor-adjusted lasso, $\hat{k} = 12$	0.466	24%	100%	29.21	1.657

Table 2. Table 2 : Simulation results in no correlation case with k = 0 𝑘 0 k=0 .

Method	$𝜷$ estimation ( $ℓ_{2}$ error)	model selection rate	sure screening rate	average model size	$σ^{2}$ estimation (relative error)
generic Bayes, $\hat{k} = 0$	0.092	90.5%	100.0%	5.10	0.881
Factor-adjusted Bayes, $\hat{k} = 3$	0.095	87.9%	100.0%	5.13	0.906
Factor-adjusted Bayes, $\hat{k} = 6$	0.097	88.4%	100.0%	5.12	0.914
Factor-adjusted Bayes, $\hat{k} = 9$	0.098	88.5%	100.0%	5.12	0.932
Factor-adjusted Bayes, $\hat{k} = 12$	0.102	88.9%	100.0%	5.12	0.968
generic lasso, $\hat{k} = 0$	0.495	53%	100%	11.72	1.302
Factor-adjusted lasso, $\hat{k} = 3$	0.498	61%	100%	11.98	1.248
Factor-adjusted lasso, $\hat{k} = 6$	0.500	56%	100%	13.28	1.279
Factor-adjusted lasso, $\hat{k} = 9$	0.481	56%	100%	12.48	1.141
Factor-adjusted lasso, $\hat{k} = 12$	0.487	58%	100%	13.62	1.114

Table 3. Table 3 : experimental results on model ( 1 ).

Method	$𝜷$ estimation ( $ℓ_{2}$ error)	model selection rate	sure screening rate	average model size	$σ^{2}$ estimation (relative error)
generic Bayes, $\hat{k} = 0$	0.070	91.5%	100.0%	5.09	0.913
Factor-adjusted Bayes, $\hat{k} = 3$	0.090	91.1%	100.0%	5.09	1.690
Factor-adjusted Bayes, $\hat{k} = 6$	0.091	90.7%	100.0%	5.10	1.715
Factor-adjusted Bayes, $\hat{k} = 9$	0.093	90.4%	100.0%	5.10	1.733
Factor-adjusted Bayes, $\hat{k} = 12$	0.095	89.8%	100.0%	5.11	1.763
generic lasso, $\hat{k} = 0$	0.734	13%	100%	10.18	3.266
Factor-adjusted lasso, $\hat{k} = 3$	0.454	53%	100%	15.17	1.125
Factor-adjusted lasso, $\hat{k} = 6$	0.471	57%	100%	14.96	1.193
Factor-adjusted lasso, $\hat{k} = 9$	0.465	48%	100%	16.65	1.139
Factor-adjusted lasso, $\hat{k} = 12$	0.492	55%	100%	18.06	1.213

Table 4. Table 4 : Out-of-sample R 2 superscript 𝑅 2 R^{2} of five methods predicting U.S. bond risk premia.

Method	2-yr bond	3-yr bond	4-yr bond	5-yr bond
PCR	0.646	0.603	0.568	0.540
generic Bayes	0.765	0.734	0.722	0.696
factor-adjusted Bayes	0.775	0.753	0.747	0.726
generic lasso	0.719	0.717	0.701	0.688
factor-adjusted lasso	0.766	0.764	0.746	0.719

Table 5. Table 5 : The average size of sparse models four methods select.

Method	2-yr bond	3-yr bond	4-yr bond	5-yr bond
generic Bayes	12.97	12.97	13.13	13.05
factor-adjusted Bayes	11.04	11.39	11.63	11.41
generic lasso	24.06	24.25	25.62	25.71
factor-adjusted lasso	34.46	35.12	36.91	36.57

Equations522

Y_{n \times 1} = X_{n \times p} β_{p \times 1} + σ ε_{n \times 1},

Y_{n \times 1} = X_{n \times p} β_{p \times 1} + σ ε_{n \times 1},

x_{i} = B f_{i} + u_{i},

x_{i} = B f_{i} + u_{i},

X = F B^{T} + U .

X = F B^{T} + U .

Y_{n \times 1} = F_{n \times k} α_{k \times 1} + U_{n \times p} β_{p \times 1} + σ ε_{n \times 1},

Y_{n \times 1} = F_{n \times k} α_{k \times 1} + U_{n \times p} β_{p \times 1} + σ ε_{n \times 1},

\frac{X X ^{T}}{n} F = F Λ, \frac{F ^{T} F}{n} = I, B = X^{T} F / n,

\frac{X X ^{T}}{n} F = F Λ, \frac{F ^{T} F}{n} = I, B = X^{T} F / n,

U = X - F B^{T} = (I - F F^{T} / n) X .

U = X - F B^{T} = (I - F F^{T} / n) X .

k = argmax_{k \leq k_{m a x}} \frac{k -th eigenvalue of X ^{T} X / n}{( k + 1 ) -th eigenvalue of X ^{T} X / n},

k = argmax_{k \leq k_{m a x}} \frac{k -th eigenvalue of X ^{T} X / n}{( k + 1 ) -th eigenvalue of X ^{T} X / n},

σ^{2} α ∣ σ^{2} 1 {j \in ξ} β_{ξ} ∣ σ^{2} \sim g (σ^{2}), \sim j = 1 \prod k \frac{1}{σ} h (\frac{α _{j}}{σ}), \sim Bernoulli (s_{0} / p), \sim j \in ξ \prod \frac{1}{τ _{j} σ} h (\frac{β _{j}}{τ _{j} σ}), β_{ξ^{c}} ∣ σ^{2} = 0,

σ^{2} α ∣ σ^{2} 1 {j \in ξ} β_{ξ} ∣ σ^{2} \sim g (σ^{2}), \sim j = 1 \prod k \frac{1}{σ} h (\frac{α _{j}}{σ}), \sim Bernoulli (s_{0} / p), \sim j \in ξ \prod \frac{1}{τ _{j} σ} h (\frac{β _{j}}{τ _{j} σ}), β_{ξ^{c}} ∣ σ^{2} = 0,

π (σ^{2}, α, β ∣ X, Y) = π (σ^{2}, α, β ∣ F, U, Y) \propto π (σ^{2}, α, β) N (Y ∣ F α + U β, σ^{2} I),

π (σ^{2}, α, β ∣ X, Y) = π (σ^{2}, α, β ∣ F, U, Y) \propto π (σ^{2}, α, β) N (Y ∣ F α + U β, σ^{2} I),

π (σ^{2}, α, β ∣ X, Y)

π (σ^{2}, α, β ∣ X, Y)

\propto π (σ^{2}, α, β) \int N (Y ∣ F α + (X - F B^{T}) β, σ^{2} I) i = 1 \prod n P_{f} (f_{i}) P_{u} (x_{i} - B f_{i}) d f_{i},

ξ : ∣ ξ ∣ \leq \overset{p}{ˉ} min λ_{m i n} (U_{ξ}^{T} U_{ξ} / n) \geq κ_{0}

ξ : ∣ ξ ∣ \leq \overset{p}{ˉ} min λ_{m i n} (U_{ξ}^{T} U_{ξ} / n) \geq κ_{0}

λ_{m a x} (U_{ξ^{⋆}}^{T} U_{ξ^{⋆}} / n) \leq κ_{1}

λ_{m a x} (U_{ξ^{⋆}}^{T} U_{ξ^{⋆}} / n) \leq κ_{1}

1 \leq j \leq k max ∥ (F H)_{j} - F_{j} ∥

1 \leq j \leq k max ∥ (F H)_{j} - F_{j} ∥

1 \leq j \leq p max ∥ U_{j} - U_{j} ∥

Var (y) = ∥ α^{⋆} ∥^{2} + (β_{ξ^{⋆}}^{⋆})^{T} Cov (u_{ξ^{⋆}}) β_{ξ^{⋆}}^{⋆} + σ^{⋆ 2},

Var (y) = ∥ α^{⋆} ∥^{2} + (β_{ξ^{⋆}}^{⋆})^{T} Cov (u_{ξ^{⋆}}) β_{ξ^{⋆}}^{⋆} + σ^{⋆ 2},

π (ℓ (γ (θ), γ^{⋆}) \geq M ϵ_{n} ∣ D_{n}) \to 0

π (ℓ (γ (θ), γ^{⋆}) \geq M ϵ_{n} ∣ D_{n}) \to 0

D_{n} = (X, Y), θ = (B, σ, α, β), γ (θ) = (β α), γ^{⋆} = (β ^{⋆} H α ^{⋆}),

D_{n} = (X, Y), θ = (B, σ, α, β), γ (θ) = (β α), γ^{⋆} = (β ^{⋆} H α ^{⋆}),

ℓ (γ (θ), γ^{⋆}) = ∥ γ (θ) - γ^{⋆} ∥ = (β α) - (β ^{⋆} H α ^{⋆}) .

ℓ (γ (θ), γ^{⋆}) = ∥ γ (θ) - γ^{⋆} ∥ = (β α) - (β ^{⋆} H α ^{⋆}) .

A (σ^{'}, α^{'}, β^{'}, M_{0}, M_{1}, M_{2}, ϵ_{n}) = ⎩ ⎨ ⎧ (σ, α, β) : ∣ ξ ∖ ξ^{'} ∣ \leq M_{0} s, \frac{σ ^{2}}{σ ^{'2}} \in (\frac{1 - M _{1} ϵ _{n}}{1 + M _{1} ϵ _{n}}, \frac{1 + M _{1} ϵ _{n}}{1 - M _{1} ϵ _{n}}), (β α) - (β ^{'} α ^{'}) \leq σ^{'} M_{2} ϵ_{n} . ⎭ ⎬ ⎫,

A (σ^{'}, α^{'}, β^{'}, M_{0}, M_{1}, M_{2}, ϵ_{n}) = ⎩ ⎨ ⎧ (σ, α, β) : ∣ ξ ∖ ξ^{'} ∣ \leq M_{0} s, \frac{σ ^{2}}{σ ^{'2}} \in (\frac{1 - M _{1} ϵ _{n}}{1 + M _{1} ϵ _{n}}, \frac{1 + M _{1} ϵ _{n}}{1 - M _{1} ϵ _{n}}), (β α) - (β ^{'} α ^{'}) \leq σ^{'} M_{2} ϵ_{n} . ⎭ ⎬ ⎫,

P^{⋆} (π (A^{c} (σ^{⋆}, H α^{⋆}, β^{⋆}, M_{0}, M_{1}, M_{2}, ϵ_{n}) ∣ X, Y) \geq e^{- C_{1} s l o g p}) \to 0

P^{⋆} (π (A^{c} (σ^{⋆}, H α^{⋆}, β^{⋆}, M_{0}, M_{1}, M_{2}, ϵ_{n}) ∣ X, Y) \geq e^{- C_{1} s l o g p}) \to 0

P^{⋆} (π (∥ (F α + U β) - (F α^{⋆} + U β^{⋆}) ∥ \geq σ^{⋆} M_{3} n ϵ_{n} ∣ X, Y) \geq e^{- C_{2} s l o g p}) \to 0

P^{⋆} (π (∥ (F α + U β) - (F α^{⋆} + U β^{⋆}) ∥ \geq σ^{⋆} M_{3} n ϵ_{n} ∣ X, Y) \geq e^{- C_{2} s l o g p}) \to 0

P^{⋆} (π (A^{c} (σ^{⋆}, H α^{⋆}, β^{⋆}, M_{0}, M_{1}, M_{2}, ϵ_{n}) \cup {ξ \neq \supseteq ξ^{⋆}} ∣ X, Y) \geq e^{- C_{3} s l o g p}) \to 0

P^{⋆} (π (A^{c} (σ^{⋆}, H α^{⋆}, β^{⋆}, M_{0}, M_{1}, M_{2}, ϵ_{n}) \cup {ξ \neq \supseteq ξ^{⋆}} ∣ X, Y) \geq e^{- C_{3} s l o g p}) \to 0

P^{⋆} (π ({∣ ξ ∖ ξ^{⋆} ∣ \leq M_{0} s, ξ \supseteq ξ^{⋆}}^{c} ∣ X, Y) \geq e^{- C_{3} s l o g p})

P^{⋆} (π ({∣ ξ ∖ ξ^{⋆} ∣ \leq M_{0} s, ξ \supseteq ξ^{⋆}}^{c} ∣ X, Y) \geq e^{- C_{3} s l o g p})

P^{⋆} (π ({j : ∣ β_{j} ∣ \geq σ ∣ ξ ∣ lo g p / n} \neq = ξ^{⋆} X, Y) \geq e^{- C_{3} s l o g p})

π ({j : ∣ β_{j} ∣ \geq σ ∣ ξ ∣ lo g p / n} = ξ^{⋆} X, Y) \to 1

π ({j : ∣ β_{j} ∣ \geq σ ∣ ξ ∣ lo g p / n} = ξ^{⋆} X, Y) \to 1

S_{q}^{+}

S_{q}^{+}

S_{1}^{+}

∥ F^{T} F / n - I ∥_{m a x} ∥ U^{T} U / n - Σ ∥_{m a x} ∥ F^{T} U / n ∥_{m a x} = O_{p} (lo g p / n), = O_{p} (lo g p / n), = O_{p} (lo g p / n) .

∥ F^{T} F / n - I ∥_{m a x} ∥ U^{T} U / n - Σ ∥_{m a x} ∥ F^{T} U / n ∥_{m a x} = O_{p} (lo g p / n), = O_{p} (lo g p / n), = O_{p} (lo g p / n) .

Cov (x) = B B^{T} + Σ,

Cov (x) = B B^{T} + Σ,

∠ (Ψ, Ψ) = (arccos (d_{1}), \dots, arccos (d_{k}))^{T},

∠ (Ψ, Ψ) = (arccos (d_{1}), \dots, arccos (d_{k}))^{T},

∥ Λ - Λ ∥_{m a x} / p = O_{p} (lo g p / n), k + 1 \leq k \leq n max ∣ λ_{j} ∣/ p = O_{p} (lo g p / n) .

∥ Λ - Λ ∥_{m a x} / p = O_{p} (lo g p / n), k + 1 \leq k \leq n max ∣ λ_{j} ∣/ p = O_{p} (lo g p / n) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Monetary Policy and Economic Impact · Advanced Statistical Methods and Models

Full text

Bayesian Factor-adjusted Sparse Regression

Jianqing Fan11footnotemark: 1, Bai Jiang, and Qiang Sun Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544; E-mail: [email protected], [email protected] of Statistical Sciences, University of Toronto, Toronto, ON M5S 3G3; E-mail: [email protected].

(January 31, 2019)

Abstract

This paper investigates the high-dimensional linear regression with highly correlated covariates. In this setup, the traditional sparsity assumption on the regression coefficients often fails to hold, and consequently many model selection procedures do not work. To address this challenge, we model the variations of covariates by a factor structure. Specifically, strong correlations among covariates are explained by common factors and the remaining variations are interpreted as idiosyncratic components of each covariate. This leads to a factor-adjusted regression model with both common factors and idiosyncratic components as covariates. We generalize the traditional sparsity assumption accordingly and assume that all common factors but only a small number of idiosyncratic components contribute to the response. A Bayesian procedure with a spike-and-slab prior is then proposed for parameter estimation and model selection. Simulation studies show that our Bayesian method outperforms its lasso analogue, manifests insensitivity to the overestimates of the number of common factors, pays a negligible price in the no correlation case, and scales up well with increasing sample size, dimensionality and sparsity. Numerical results on a real dataset of U.S. bond risk premia and macroeconomic indicators lend strong support to our methodology.

keywords: factor model, Bayesian sparse regression, posterior convergence rate, model selection.

1 Introduction

High-dimensional linear models are useful for a wide arrays of economic problems (Fan et al., 2011; Belloni et al., 2012). These models typically assume the sparsity of regression coefficients, that is, only a small number of covariates have significant effects on the response. However, the explanatory variables in the panel of an economic dataset are often highly correlated due to the influence of latent common factors, rendering the sparsity assumption unreasonable and restrictive. To address this issue, this paper proposes a general regression model with a factor-adjusted sparsity assumption, and develops a Bayesian method for this model.

To motivate the factor-adjusted model and its corresponding methodology, we start with the standard linear regression model

[TABLE]

where $\mathbf{Y}_{n\times 1}=(y_{1},\ldots,y_{n})^{\mathrm{\scriptscriptstyle T}}$ is an $n\times 1$ response vector, $\mathbf{X}_{n\times p}=(\bm{x}_{1},\ldots,\bm{x}_{n})^{\mathrm{\scriptscriptstyle T}}=[\mathbf{X}_{1},\dots,\mathbf{X}_{p}]$ is a design matrix of $n$ observations and $p$ covariates, $\bm{\beta}=(\beta_{1},\dots,\beta_{p})^{\mathrm{\scriptscriptstyle T}}$ is a $p$ -dimensional vector of regression coefficients, $\sigma$ is an unknown standard deviation, and $\bm{\varepsilon}$ is an $n$ -dimensional standard Gaussian random vector, independent of $\mathbf{X}$ . Without loss of generality, we assume $\mathbb{E}\mathbf{X}_{j}=\mathbf{0}$ and include no intercept term in the model. Of interest is the high-dimensional regime in which the dimensionality $p$ is much larger than the sample size $n$ .

This model has attracted intensive interests in the frequentist community (Tibshirani, 1996; Fan and Li, 2001; Candes and Tao, 2007; Fan and Lv, 2008; Zhang and Huang, 2008; Su and Candes, 2016, among others). All of these methods hinge on at least two basic assumptions. The first one assumes that the correlations between explanatory variables are sufficiently weak. Examples of this assumption are the mutual coherence condition (Donoho and Huo, 2001; Donoho and Elad, 2003; Donoho et al., 2006; Bunea et al., 2007), the irrepresentable condition (Zhao and Yu, 2006), the restricted eigenvalue condition (Bickel et al., 2009; Fan et al., 2018) and the uniform compatibility condition (Bühlmann and van de Geer, 2011, page 157). The second one, referred to as the sparsity assumption, assumes that only a small number $s$ of covariates contribute to the response. Formally, the sparsity, defined as $s:=|\{j:\beta_{j}\neq 0\}|$ , is much smaller than the dimensionality $p$ .

Nevertheless, the weak correlation conditions do not necessarily hold in many applications, especially those in economic and financial studies. In an economic or financial dataset, the explanatory variables, e.g., stock returns or macroeconomic indicators over a period of time, are often influenced by similar economic fundamentals and are thus heavily correlated due to the existence of co-movement patterns (Forbes and Rigobon, 2002; Stock and Watson, 2002). In the presence of such strong correlations introduced by common factors, one naturally expects strong effects of common factors on the response. If this is true, many covariates would have non-negligible effects on the response, rendering the traditional sparsity assumption in the standard regression model (1) ideologically unreasonable.

The above argument shows the necessity to take the correlation structure of explanatory variables into account and adjust the sparsity assumption accordingly. For this purpose, we consider using factor models (Stock and Watson, 2002; Bai, 2003; Bai and Ng, 2006; Fan et al., 2013) and assume that each datum (row) $\bm{x}_{i}\in\mathbb{R}^{p}$ of the data matrix $\mathbf{X}$ exhibits a decomposition of form

[TABLE]

where $\mathbf{B}=[\bm{b}_{1},\dots,\bm{b}_{p}]^{\mathrm{\scriptscriptstyle T}}$ is a $p\times k$ unknown matrix of factor loading coefficients, $\bm{f}_{i}$ is a $k$ -dimensional random vector of common factors, and $\bm{u}_{i}$ is a $p$ -dimensional random vector of weakly-correlated idiosyncratic components, uncorrelated with $\bm{f}_{i}$ . Without loss of generality, we assume $\mathbb{E}\bm{f}_{i}=\mathbf{0}$ , $\mathbb{E}\bm{u}_{i}=\mathbf{0}$ , and $\operatorname*{\rm Cov}(\bm{f}_{i})=\mathbf{I}$ . Both common factors and idiosyncratic components are latent, but they are often estimated by using principal component analysis (PCA) (Bai, 2003; Fan et al., 2013; Wang and Fan, 2017). Model (2) embraces the well-known CAPM model (Sharpe, 1964; Lintner, 1975) and Fama-French model (Fama and French, 1993) as its special cases, with observable common factors. Let $\mathbf{F}_{n\times k}=[\bm{f}_{1},\dots,\bm{f}_{n}]^{\mathrm{\scriptscriptstyle T}}=[\mathbf{F}_{1},\dots,\mathbf{F}_{k}]$ be the matrix of common factors, and $\mathbf{U}_{n\times p}=[\bm{u}_{1},\dots,\bm{u}_{n}]^{\mathrm{\scriptscriptstyle T}}=[\mathbf{U}_{1},\dots,\mathbf{U}_{p}]$ be the matrix of idiosyncratic components. Then a more compact matrix form reads as

[TABLE]

Each covariate (column) $\mathbf{X}_{j}$ in $\mathbf{X}$ can be decomposed as a sum of two components $\mathbf{F}\bm{b}_{j}$ and $\mathbf{U}_{j}$ , reflecting the influence of common factors and idiosyncratic variations respectively.

Utilizing this factor structure (3), we generalize the standard sparse regression model (1) to a factor-adjusted sparse regression model of the form

[TABLE]

where $\bm{\alpha}$ and $\bm{\beta}$ are regression coefficient vectors of $\mathbf{F}$ and $\mathbf{U}$ , respectively. We assume that $\bm{\alpha}$ is dense (as it is usually low-dimensional) but $\bm{\beta}$ is sparse. That is, all common factors but only a small number of idiosyncratic components of the original explanatory variables contribute to the response. A non-zero $\beta_{j}$ indicates that the covariate $\mathbf{X}_{j}$ , excluding the strong correlation with other covariates, has a specific effect on the response. Compared to the traditional sparsity assumption, this factor-adjusted sparsity assumption is more tenable as the idiosyncratic components are weakly-correlated.

We remark that our generalized factor-adjusted regression model (4) covers the standard regression model (1) as a special case by restricting the side constraint that $\bm{\alpha}=\mathbf{B}^{\mathrm{\scriptscriptstyle T}}\bm{\beta}$ . Under this constraint, the factor-adjusted sparsity assumption imposed on regression coefficients of idiosyncratic components in model (4) coincides with the traditional sparsity assumption in model (1). Thus any statistical method for estimating model (4) would estimate model (1). Of course, when such a constraint is not enforced, model (4) provides more flexibility in the regression analysis than model (1).

Model (4) is similar but different from the factor-augmented regression or the augmented principal component regression of Stock and Watson (2002); Bai and Ng (2006). In the factor-augmented models, the factors are usually extracted from a large panel of data via PCA and used as a part of covariates, yet the other variables are introduced from outside of the panel. These models are typically low-dimensional. In contrast, model (4) takes idiosyncratic components as covariates, which are created internally from the panel of the data. This allows to explore additional explanatory power of the data. Our analyses of model (4) in the high-dimensional fashion are applicable to the low-dimensional factor-augmented regression models in the literature, as model (4) can easily incorporate external variables in the part of $\mathbf{F}$ and/or $\mathbf{U}$ . For simplicity of presentation, we omit the details.

Kneip and Sarda (2011) gave an insightful discussion on the limitation of the traditional sparse assumption in model (1) with factor-structured covariates and proposed a factor-augmented regression model. Nevertheless, they still need the weak correlation condition on the original covariates, which is unlikely to hold for factor-structured covariates. See equation (5.5) of Kneip and Sarda (2011). Fan et al. (2016) pointed out the failure of classical frequentist methods dealing with model (1) with factor-structured covariates, and proposed a frequentist method for estimating model (4). Specially, they estimated the latent common factors and idiosyncratic components, and then run frequentist sparse regression methods (e.g., lasso) on estimated common factors and idiosyncratic components. Similar to ours, they impose the weak correlation condition on idiosyncratic components, instead of the original covariates. See Example 3.2 of Fan et al. (2016).

This paper focuses on Bayesian solutions to model (4). As shown in Section 2, the fully Bayesian procedure cannot work easily due to the involvement of latent common factors and idiosyncratic components in the posterior computation. Inspired by Fan et al. (2016), we consider estimating these latent variables by PCA and running a Bayesian sparse regression method on their estimates. The arsenal of Bayesian sparse regression methods, including those exploiting shrinkage priors (e.g. Park and Casella, 2008; Polson and Scott, 2012; Armagan et al., 2013; Bhattacharya et al., 2015; Song and Liang, 2017) and those exploiting spike-and-slab priors (among others, Ishwaran and Rao, 2005; Narisetty and He, 2014; Castillo et al., 2015; Ročková and George, 2018), has been developed in parallel to the frequentist methods. However, it is unclear whether these methods would work on estimated common factors and idiosyncratic components in model (4). When it does work, it remains unknown whether the factor model estimation would incur any loss to the convergence rate or model selection consistency of the Bayesian sparse regression method. Given theoretical results in the frequentist setting, these questions are still challenging, because the definitions of estimation errors and technical conditions of frequentist and Bayesian methods are significantly different (Castillo et al., 2015). Even if a Bayesian sparse regression method is theoretically sound, it is unclear whether it performs better or worse than the frequentist methods on finite sample data. We would like answer these questions in the current paper.

Specifically, our Bayesian method imposes a slab prior on the regression coefficients of estimated common factors, and a spike-and-slab prior on the regression coefficients of estimated idiosyncratic components. This procedure results in a pseudo-posterior distribution, which differs from the exact posterior distribution obtained by a Bayesian regression on exact common factors and idiosyncratic components. Interestingly, the pseudo-posterior distribution achieves the $\ell_{2}$ contraction rate $\sqrt{s\log p/n}$ of the regression coefficients, which matches that of the exact posterior distribution. Byproducts of our analyses include the adaptivity to the unknown sparsity $s$ and the unknown standard deviation $\sigma$ . We only need a type of sparse eigenvalue condition on the idiosyncratic components to overcome the non-identifiability issue of the parameters. This is easy to hold due to the weak correlation among idiosyncratic components. Moreover, by assuming a beta-min condition that is frequently used in the high-dimensional regression literature, we prove that our method consistently selects the support of the true sparse regression coefficients.

The rest of this paper proceeds as follows. In Section 2, we propose the Bayesian methodology for the factor-adjusted regression model (4). Section 3 establishes the contraction rates and model selection consistency of the pseudo-posterior distribution. These theoretical results rely on a high-level condition concerning the estimation of factor models, which is examined by Section 4. Section 5 presents experimental results on simulation datasets. Section 6 applies our method to a real dataset of U.S. bond risk premia and macroeconomic indicators. Section 7 is devoted to discussions. All technical proofs and algorithmic implementation are detailed in the appendices.

Notation. We write ${\rm diag}(a_{1},\dots,a_{m})$ for a diagonal matrix of elements $a_{1},\dots,a_{m}$ . For a symmetric matrix $\mathbf{A}$ , we write its largest eigenvalue as $\lambda_{\max}(\mathbf{A})$ and its smallest eigenvalue as $\lambda_{\min}(\mathbf{A})$ . For a matrix $\mathbf{A}_{m_{1}\times m_{2}}=[a_{ij}]_{1\leq i\leq m_{1},1\leq j\leq m_{2}}$ , we write $\mathbf{A}_{j}$ to denote its $j$ -th column, and lowercase $\bm{a}_{i}$ to denote its $i$ -th row. For a index set $\xi\subseteq\{1,\dots,m_{2}\}$ , $\mathbf{A}_{\xi}=[\mathbf{A}_{j}:j\in\xi]$ is the sub-matrix of $\mathbf{A}$ assembling the columns indexed by $\xi$ . Let $\|\mathbf{A}\|_{\max}=\max_{i,j}|a_{ij}|$ be the element-wise maximum norm of $\mathbf{A}$ , let $\|\mathbf{A}\|_{\text{F}}$ be its Frobenius norm. For a vector $\bm{v}$ , let $\bm{v}_{\xi}$ denote its sub-vector assembling components indexed by $\xi$ , and let $\|\bm{v}\|$ denote its $\ell_{2}$ norm. For two sequences $a_{n}$ and $b_{n}$ , $a_{n}\prec b_{n}$ or $b_{n}\succ a_{n}$ means $a_{n}={\textnormal{o}}(b_{n})$ .

2 Model and Methodology

Our goal is to study the factor-adjusted regression model (4), in which both common and idiosyncratic components $[\mathbf{F},\mathbf{U}]$ are unobserved, but $\mathbf{X}$ are observed through (3). Each datum (row) $\bm{x}_{i}$ in $\mathbf{X}$ admits the factor structure (2) with $\{(\bm{f}_{i},\bm{u}_{i})\}_{1\leq i\leq n}$ therein identically distributed as $(\bm{f},\bm{u})$ . Note that $\{(\bm{f}_{i},\bm{u}_{i})\}_{1\leq i\leq n}$ are not necessarily independently distributed. The dimension $k$ of $\bm{f}$ is fixed, but the dimension $p$ of $\bm{u}$ may grow as $n$ increases. By decomposition, $\bm{f}$ and $\bm{u}$ are uncorrelated. Without loss of generality, we assume that $\mathbb{E}\bm{f}=\mathbf{0}$ , $\mathbb{E}\bm{u}=\mathbf{0}$ and $\operatorname*{\rm Cov}(\bm{f})=\mathbf{I}$ . The regression coefficient vector $\bm{\beta}$ of $\mathbf{U}$ is sparse in the sense that $s=|\{j:\beta_{j}\neq 0\}|$ is small. We allow $s$ to grow as $n$ increases, but require $s\prec n/\log p$ so that the desired $\ell_{2}$ contraction rate $\sqrt{s\log p/n}\to 0$ as $n\to\infty$ . The Gaussian errors $\bm{\varepsilon}$ are independent from $\mathbf{F}$ and $\mathbf{U}$ .

An inherent difficulty for estimating model (4) is that both common factors and idiosyncratic components are unobserved. Therefore the first step is to estimate these unobserved variables. We follow Bai (2003); Fan et al. (2013); Wang and Fan (2017) and use PCA for this task. Let $\widehat{\lambda}_{1}\geq\dots\geq\widehat{\lambda}_{n}$ be the $n$ eigenvalues of $\mathbf{X}\mathbf{X}^{\mathrm{\scriptscriptstyle T}}/n$ . A natural estimator of $\mathbf{F}$ is the concatenation of the $k$ square-root- $n$ -scaled eigenvectors corresponding to the top $k$ eigenvalues of $\mathbf{X}\mathbf{X}^{\mathrm{\scriptscriptstyle T}}/n$ , denote by $\widehat{\mathbf{F}}$ . That is,

[TABLE]

where $\widehat{\mathbf{\Lambda}}={\rm diag}(\widehat{\lambda}_{1},\dots,\widehat{\lambda}_{k}).$ Then we estimate $\mathbf{U}$ by

[TABLE]

If $k$ is unknown, we may estimate $k$ by

[TABLE]

where $k_{\max}$ is any prescribed upper bound for $k$ (Luo et al., 2009; Lam and Yao, 2012; Ahn and Horenstein, 2013).

After estimating unobserved variables, we propose a Bayesian sparse regression method for tasks of parameter estimation and model selection. Suppose we are given data $(\mathbf{X},\mathbf{Y})$ generated from true parameter $(\sigma^{\star},\bm{\alpha}^{\star},\bm{\beta}^{\star})$ . Let $(\sigma,\bm{\alpha},\bm{\beta})$ be its running parameter. Let $\xi=\{j:\beta_{j}\neq 0\}$ and $\xi^{\star}=\{j:\beta^{\star}_{j}\neq 0\}$ be the support of $\bm{\beta}$ and $\bm{\beta}^{\star}$ , respectively. We consider a hierarchical prior $\pi(\sigma^{2},\bm{\alpha},\bm{\beta})$ with a slab prior on the coefficients of common factors $\mathbf{F}$ and a spike-and-slab prior on the coefficients of idiosyncratic components $\mathbf{U}$ as follows:

[TABLE]

where $g$ is a positive continuous density function on $(0,\infty)$ , e.g., the inverse-gamma density; $h$ is a “slab” density function on $(-\infty,+\infty)$ in the sense that $-\log[\inf_{|z|\leq t}h(z)]={\textnormal{O}}(t^{2})$ as $t\to\infty$ , e.g., the Gaussian density $e^{-z^{2}/2}/\sqrt{2\pi}$ and the Laplace density $e^{-|z|/2}/2$ ; hyperparameters $\tau_{1},\dots,\tau_{p}$ control the scales of running coefficients $\beta_{1},\dots,\beta_{p}$ ; and, hyperparameter $s_{0}$ controls the sparsity of running models $\xi$ . For the scaling hyperparameters, we set $\tau_{j}^{-1}=\|\widehat{\mathbf{U}}_{j}\|/\sqrt{n}$ so that the effects of possibly heterogeneous scales of $\widehat{\mathbf{U}}_{j}$ ’s are appropriately adjusted. For the sparsity hyperparameter, we simply set $s_{0}=1$ in the simulation experiments. When dealing with a real dataset, one could choose an informative $s_{0}$ according to expertise knowledges in the specific area, or tune $s_{0}$ by sophisticated cross-validation or empirical Bayes procedures

The Bayesian sparse regression on response $\mathbf{Y}$ and regressors $\widehat{\mathbf{F}}$ , $\widehat{\mathbf{U}}$ with prior (6) obtains a pseudo-posterior distribution

[TABLE]

where $\mathcal{N}(\mathbf{Y}|\bm{\mu},\sigma^{2}\mathbf{I})$ denotes the $n$ -dimensional normal distribution with mean $\bm{\mu}_{n\times 1}$ and covariance $\sigma^{2}\mathbf{I}$ . We call it a “pseudo-posterior” distribution and put a hat over $\pi$ to emphasize that it differs from the exact posterior distributions $\pi(\sigma^{2},\bm{\alpha},\bm{\beta}|\mathbf{F},\mathbf{U},\mathbf{Y})$ , obtained by a Bayesian regression on observed $[\mathbf{F},\mathbf{U}]$ , and $\pi(\sigma^{2},\bm{\alpha},\bm{\beta}|\mathbf{X},\mathbf{Y})$ , obtained by a fully Bayesian procedure.

It is worth noting that, even in the simplest setting in which $\{(\bm{f}_{i},\bm{u}_{i})\}_{1\leq i\leq n}$ are i.i.d. and $\bm{f}_{i}\sim P_{f},\bm{u}_{i}\sim P_{u}$ are jointly independent, the exact posterior distribution given by a fully Bayesian procedure

[TABLE]

is computationally intractable due to the involvement of latent variables in the complicated integral. Thus a fully Bayesian procedure does not solve model (4) easily.

3 Theory

In this section, we show under commonly-seen assumptions for Bayesian sparse regression methods that the pseudo-posterior distribution (7) achieves the convergence rate $\epsilon_{n}=\sqrt{s\log p/n}$ of the $\ell_{2}$ estimation error for the coefficient vectors $(\bm{\alpha}^{\star},\bm{\beta}^{\star})$ . This rate is so far the best rate Bayesian methods can achieve with observed $[\mathbf{F},\mathbf{U}]$ (Song and Liang, 2017). We see that the factor adjustment added by our approach to the Bayesian sparse regression method incurs no loss in terms of $\ell_{2}$ estimation error rate. Byproducts of our analysis are the adaptivities of the pseudo-posterior distribution to the unknown sparsity $s$ and unknown standard deviation $\sigma^{\star}$ . Finally, when the beta-min condition holds, we establish the model selection consistency of the pseudo-posterior distribution (7).

3.1 Assumptions

In the high-dimensional regime $p\succ n$ , a common assumption is that $\bm{\beta}^{\star}$ is sparse of size $s$ . Following the sparse regression literature, we assume that $s\prec n/\log p$ such that the desired error rate $\epsilon_{n}=\sqrt{s\log p/n}\to 0$ as $n\to\infty$ . To recover the sparse coefficient vector $\bm{\beta}^{\star}$ at rate $\epsilon_{n}$ , we need the following assumptions.

Assumption 1.

There exists a large integer $\bar{p}(n,p)\succ s$ and a constant $\kappa_{0}>0$ such that

[TABLE]

holds with probability approaching $1$ .

This assumption is commonly referred to as the sparse eigenvalue condition in the frequentist literature (Fan et al., 2018). In a recent study of Bayesian sparse regression with shrinkage priors, Song and Liang (2017) imposed the same assumption on original covariates $\mathbf{X}$ . Here our assumption is imposed on their idiosyncratic components $\mathbf{U}$ .

Our next assumption upper bounds the maximum eigenvalue of $\mathbf{U}_{\xi^{\star}}^{\mathrm{\scriptscriptstyle T}}\mathbf{U}_{\xi^{\star}}/n$ , which is the Gram matrix corresponding to the true model $\xi^{\star}=\{j:\beta^{\star}_{j}\neq 0\}$ . Assumptions 1-2 together ensure that $\mathbf{U}_{\xi^{\star}}^{\mathrm{\scriptscriptstyle T}}\mathbf{U}_{\xi^{\star}}/n$ is well conditioned.

Assumption 2.

There exists a constant $\kappa_{1}>0$ such that

[TABLE]

holds with probability approaching $1$ .

Raskutti et al. (2010); Dobriban and Fan (2016) gave sufficient conditions for correlated covariates to satisfy Assumptions 1-2. These theories typically allow $\bar{p}(n,p)\asymp n/\log p$ in Assumption 1. If $\mathbf{U}_{\xi^{\star}}$ consists of i.i.d. entries with zero mean, unit variance and only finite fourth moment, Assumption 2 holds by Bai-Yin theorem in the random matrix theory (Bai and Yin, 1988; Yin et al., 1988).

Since we feed a Bayesian sparse regression method with the estimated variables $[\widehat{\mathbf{F}},\widehat{\mathbf{U}}]$ rather than the latent variables $[\mathbf{F},\mathbf{U}]$ , it is necessary to control the error of $(\widehat{\mathbf{F}}\bm{\alpha}+\widehat{\mathbf{U}}\bm{\beta})-(\mathbf{F}\bm{\alpha}^{\star}+\mathbf{U}\bm{\beta}^{\star})$ . This goal is achieved by assumptions on the estimation errors of latent variables and the magnitudes of the true coefficient vectors. For the estimation error of the factor model, we impose a generic high-level condition as follows.

Assumption 3.

The latent common factors and idiosyncratic components can be estimated by $\widehat{\mathbf{F}}$ and $\widehat{\mathbf{U}}$ as follows.

[TABLE]

for some nearly orthogonal matrix $\mathbf{H}_{k\times k}$ such that $\|\mathbf{H}^{\mathrm{\scriptscriptstyle T}}\mathbf{H}-\mathbf{I}\|={\textnormal{O}_{\textnormal{p}}}(\sqrt{\log p/n})$ and $\|\mathbf{H}\mathbf{H}^{\mathrm{\scriptscriptstyle T}}-\mathbf{I}\|={\textnormal{O}_{\textnormal{p}}}(\sqrt{\log p/n})$ .

Since $\widehat{\mathbf{F}}$ represents the eigenspace of the top $k$ eigenvalues of $\mathbf{X}\mathbf{X}^{\mathrm{\scriptscriptstyle T}}$ and mimics the column space of $\mathbf{F}$ , there is a nearly-orthogonal transformation, represented by $\mathbf{H}$ , between $\mathbf{F}$ and $\widehat{\mathbf{F}}$ . Next section will verify this error rate in factor models under standard assumptions.

Our last assumption requires constant orders of the true parameters $(\sigma^{\star},\bm{\alpha}^{\star},\bm{\beta}^{\star})$ .

Assumption 4.

$\sigma^{\star}>0$ is fixed, $\|\bm{\alpha}^{\star}\|={\textnormal{O}}(1)$ , and $\|\bm{\beta}^{\star}\|={\textnormal{O}}(1)$ .

This condition is not restrictive. It holds if and only if the response variable has finite variance, under Assumptions 1-2. To see this point, note that the variance of a single response variable $y=\bm{f}^{\mathrm{\scriptscriptstyle T}}\bm{\alpha}^{\star}+\bm{u}^{\mathrm{\scriptscriptstyle T}}\bm{\beta}^{\star}+\sigma^{\star}\varepsilon$ is

[TABLE]

where $\bm{u}_{\xi^{\star}}$ is the sub-vector of $\bm{u}$ corresponding to the true model $\xi^{\star}$ , and $\operatorname*{\rm Cov}(\bm{u}_{\xi^{\star}})$ have all eigenvalues bounded away from [math] and $\infty$ , due to Assumptions 1-2. Although our theoretical analyses need bounded magnitude of regression coefficients to avoid the amplification of estimation errors of latent variables, we remark here that, when the underlying true factors, $\mathbf{F}_{j}$ ’s and $\mathbf{U}_{j}$ ’s, and/or more accurate estimates are available, we can allow larger magnitudes of regression coefficients.

3.2 Definition of Posterior Contraction Rate

The definition of convergence rate in the Bayesian setting differs from that in the frequentist setting. We formally define it by following the classical Bayesian literature (Ghosal et al., 2000; Shen and Wasserman, 2001).

Definition 1 (Posterior contraction).

Consider a parametric model indexed by $\bm{\theta}$ . Let $\{\mathcal{D}_{n}\}_{n\geq 1}$ be a sequence of data generations according to some true parameter $\bm{\theta}^{\star}$ . Let $\bm{\gamma}(\bm{\theta})$ be a function of $\bm{\theta}$ . Let $\ell(\bm{\gamma}(\bm{\theta}),\bm{\gamma}^{\star})$ be a loss function between the estimate $\bm{\gamma}(\bm{\theta})$ and the parameter $\bm{\gamma}^{\star}$ . A sequence of posterior distributions (random measures) $\{\pi(\bm{\theta}|\mathcal{D}_{n})\}_{n\geq 1}$ is said to achieve convergence rate $\epsilon_{n}$ of estimation error $\ell(\bm{\gamma}(\bm{\theta}),\bm{\gamma}^{\star})$ if

[TABLE]

in $\mathbb{P}_{\bm{\theta}^{\star}}$ -probability as $n\to\infty$ for some constant $M>0$ .

Specifically in the factor-adjusted regression model (4) with covariates hidden in (3), we consider

[TABLE]

where $\mathbf{H}$ is introduced by Assumption 3, and want to show that $\widehat{\pi}(\sigma^{2},\bm{\alpha},\bm{\beta}|\mathbf{X},\mathbf{Y})$ achieves the contraction rate $\epsilon_{n}=\sqrt{s\log p/n}$ of $\ell_{2}$ estimation error

[TABLE]

As noted on Assumption 3, $\widehat{\mathbf{F}}$ approximates $\mathbf{F}$ in the sense that they have almost the same column space and $\widehat{\mathbf{F}}\mathbf{H}\approx\mathbf{F}$ element-wisely for some nearly orthogonal transformation matrix $\mathbf{H}$ . Thus the pseudo-posterior distribution would concentrate around $\bm{\alpha}\approx\mathbf{H}\bm{\alpha}^{\star}$ such that $\widehat{\mathbf{F}}\bm{\alpha}\approx\widehat{\mathbf{F}}\mathbf{H}\bm{\alpha}^{\star}\approx\mathbf{F}\bm{\alpha}^{\star}$ .

3.3 Results

This subsection presents the main results of the paper. Recall that $\epsilon_{n}=\sqrt{s\log p/n}$ . Let

[TABLE]

where $M_{0},M_{1},M_{2}$ are constants, $\xi$ and $\xi^{\prime}$ are supports of $\bm{\beta}$ and $\bm{\beta}^{\prime}$ , respectively, and $|\xi\setminus\xi^{\prime}|$ is the cardinality of the set difference of $\xi^{\prime}$ and $\xi$ .

Theorem 1.

Let $\mathbb{P}^{\star}=\mathbb{P}_{(\mathbf{B},\sigma^{\star},\bm{\alpha}^{\star},\bm{\beta}^{\star})}$ denote the probability measure under the true parameters. Under Assumptions 1-4, the following statements hold.

(a)

(estimation error rate) There exist constants $M_{0},M_{1},M_{2}$ and $C_{1}$ such that

[TABLE]

as $n\to\infty$ . 2. (b)

(prediction error rate) There exist constants $M_{3}$ and $C_{2}$ such that

[TABLE]

as $n\to\infty$ . 3. (c)

(model selection consistency) If $\min_{j\in\xi^{\star}}|\beta^{\star}_{j}|\succ\epsilon_{n}$ then there exist constants $M_{0},M_{1},M_{2}$ and $C_{3}$ such that

[TABLE]

as $n\to\infty$ . It follows that

[TABLE]

as $n\to\infty$ .

Part (a) establishes the convergence rate $\epsilon_{n}$ of the $\ell_{2}$ -estimation error of $\bm{\alpha}^{\star}$ (up to a nearly orthogonal transformation $\mathbf{H}$ ) and $\bm{\beta}^{\star}$ , the adaptivity to the unknown sparsity $s$ , and the adaptivity to the unknown standard deviation $\sigma^{\star}$ .

Part (b) shows that $\widehat{\mathbf{Y}}=\widehat{\mathbf{F}}\bm{\alpha}+\widehat{\mathbf{U}}\bm{\beta}$ predicts the conditional mean $\mathbb{E}[\mathbf{Y}|\mathbf{F},\mathbf{U}]=\mathbf{F}\bm{\alpha}^{\star}\!+\!\mathbf{U}\bm{\beta}^{\star}$ with mean squared error ${\textnormal{O}_{\textnormal{p}}}(\epsilon_{n})$ for each single datum instance on average.

The first implication in Part (c) asserts that the pseudo-posterior distribution will select all variables in $\xi^{\star}$ and at most $M_{0}s$ other variables, with high probability. In simulation experiments, we observe that the pseudo-posterior distribution overestimates the true support size $s=|\xi^{\star}|$ by less than $5\%$ . The second implication asserts that

[TABLE]

in probability as $n\to\infty$ , and therefore provides a variable selection rule. Simply speaking, we can consistently select the true model $\xi^{\star}$ by thresholding the running coefficients $\beta_{j}$ at $\sigma\sqrt{|\xi|\log p/n}$ . In simulation experiments, the majority of pseudo-posterior samples of parameters hit the true model correctly even if the thresholding rule is not used.

The additional condition that $\min_{j\in\xi^{\star}}|\beta^{\star}_{j}|\succ\epsilon_{n}$ in part (c) is called “beta-min condition” in the literature on Bayesian sparse regression (Castillo et al., 2015; Song and Liang, 2017). Narisetty and He (2014) use another identifiability condition to achieve the model selection consistency. Their condition can be shown slightly stronger than the beta-min condition in presence of the minimum sparse eigenvalue condition. To see this point, one can compare their Condition 4.4 to our equation (10) in the proof of Lemma A2, part(d).

4 Factor Model Estimation

This section verifies Assumption 3, which concerns the estimation errors of factor models under standard assumptions. Following Bickel and Levina (2008), we define a uniformity class of positive semi-definite matrices as follows

[TABLE]

Assumption 5.

$\{(\bm{f}_{i},\bm{u}_{i})\}_{1\leq i\leq n}$ are identically (not necessarily independently) distributed as $(\bm{f},\bm{u})$ . $\mathbb{E}\bm{f}=\mathbf{0}$ , $\mathbb{E}\bm{u}=\mathbf{0}$ ; $\operatorname*{\rm Cov}(\bm{f})=\mathbf{I}$ , $\operatorname*{\rm Cov}(\bm{u})=\mathbf{\Sigma}\in\mathcal{S}_{q}^{+}$ with $m_{q}(p)={\textnormal{o}}(\log p)$ for some $0\leq q\leq 1$ , and $\operatorname*{\rm Cov}(\bm{f},\bm{u})=\mathbf{0}$ .

Assumption 6.

All entries in the loading matrix $\mathbf{B}$ are uniformly bounded, i.e., $\|\mathbf{B}\|_{\max}={\textnormal{O}}(1)$ , and all the eigenvalues of $\mathbf{B}^{\mathrm{\scriptscriptstyle T}}\mathbf{B}/p$ is strictly bounded away from [math] and $\infty$ .

Assumption 7.

The sample covariance matrices of $\mathbf{F}$ and $\mathbf{U}$ converge to the true covariance matrices at rate $\sqrt{\log p/n}$ in the element-wise maximum norm.

[TABLE]

In Assumption 5, $\operatorname*{\rm Cov}(\bm{f})=\mathbf{I}$ is made to avoid the non-identifiability issue of $\mathbf{B}$ and $\bm{f}$ . If rows $\bm{b}_{j},j=1,\dots,p$ of $\mathbf{B}$ are $p$ i.i.d. copies of some $k$ -dimensional distribution then $\mathbf{B}^{\mathrm{\scriptscriptstyle T}}\mathbf{B}/p$ converges almost surely to $\operatorname*{\rm Cov}(\bm{b}_{j})$ as $p\to\infty$ and Assumption 6 holds when $\operatorname*{\rm Cov}(\bm{b}_{j})$ has eigenvalues bounded away from 0 and $\infty$ . Assumptions 5-6 together characterize the “low-rank plus sparse” structure of the covariance matrix of $\bm{x}=\mathbf{B}\bm{f}+\bm{u}$ . That is,

[TABLE]

where the first part $\mathbf{B}\mathbf{B}^{\mathrm{\scriptscriptstyle T}}$ is of low rank $k$ , and the second part is sparse in the sense that the quantity $\max_{1\leq j\leq p}\sum_{i}|\mathbf{\Sigma}_{ij}|^{q}$ for some $q\in[0,1]$ is ${\textnormal{o}}(\log p)$ . This decomposition has a “spike plus non-spike” structural interpretation as well: the smallest non-zero eigenvalue of $\mathbf{B}\mathbf{B}^{\mathrm{\scriptscriptstyle T}}$ is of order $p$ , while the largest eigenvalue of $\mathbf{\Sigma}$ is of order ${\textnormal{o}}(\log p)$ . This eigen-gap plays the key role in estimating $\mathbf{F}$ and $\mathbf{U}$ .

Assumption 7 requires that the sample covariance $\mathbf{F}^{\mathrm{\scriptscriptstyle T}}\mathbf{F}/n$ , $\mathbf{U}^{\mathrm{\scriptscriptstyle T}}\mathbf{U}/n$ and $\mathbf{F}^{\mathrm{\scriptscriptstyle T}}\mathbf{U}/n$ converge to their ideal counterparts at an appropriate rate. Kneip and Sarda (2011) provided sufficient conditions for it to hold in case that $\{(\bm{f}_{i},\bm{u}_{i})\}_{1\leq i\leq n}$ are i.i.d.. Fan et al. (2013) established the same rate for stationary and weakly-correlated time-series. Our recent work on the concentration inequalities for general Markov chains (Jiang et al., 2018) can verify this assumption in case that $\{(\bm{f}_{i},\bm{u}_{i})\}_{1\leq i\leq n}$ are functionals of ergodic Markov chains.

Next theorem summarizes the theoretical results on factor model estimation under Assumptions 5-7. Part (b) of this theorem bounds the difference between column spaces of $\widehat{\mathbf{F}}$ and $\mathbf{F}$ in terms of principal angles, which is novel from the previous theory in the literature (Fan et al., 2013) and may be of independent interest. Parts (c) and (d), which are immediate corollaries of part (b), derive Assumption 3.

Definition 2.

The principal angles between two linear spaces spanned by orthonormal column vectors of $\widehat{\mathbf{\Psi}}_{n\times k}$ and $\widetilde{\mathbf{\Psi}}_{n\times k}$ are defined as

[TABLE]

where $d_{1},\dots,d_{k}\in[0,1]$ are the singular values of $\widehat{\mathbf{\Psi}}^{\mathrm{\scriptscriptstyle T}}\widetilde{\mathbf{\Psi}}$ or $\widetilde{\mathbf{\Psi}}^{\mathrm{\scriptscriptstyle T}}\widehat{\mathbf{\Psi}}$ .

Theorem 2.

Let $\widetilde{\mathbf{F}}$ consist of $\sqrt{n}$ -scaled left singular vectors of $\mathbf{F}$ , which are orthonormal vectors spanning the column space of $\mathbf{F}$ . Under Assumptions 5-7, the following statements hold.

(a)

Eigenvalue recovery:

[TABLE] 2. (b)

Eigenspace recovery:

[TABLE] 3. (c)

Common factor recovery:

[TABLE]

for some nearly orthogonal matrix $\mathbf{H}_{k\times k}$ with $\|\mathbf{H}^{\mathrm{\scriptscriptstyle T}}\mathbf{H}-\mathbf{I}\|={\textnormal{O}_{\textnormal{p}}}(\sqrt{\log p/n})$ and $\|\mathbf{H}\mathbf{H}^{\mathrm{\scriptscriptstyle T}}-\mathbf{I}\|={\textnormal{O}_{\textnormal{p}}}(\sqrt{\log p/n})$ . 4. (d)

Idiosyncratic component recovery:

[TABLE]

5 Simulation Experiments

This section reports simulation results. As a basic case, we set $(n,p,s,k)=(200,500,5,3)$ , and generate $\bm{f}_{i}\overset{i.i.d.}{\sim}\mathcal{N}(\mathbf{0},\mathbf{I}_{k\times k})$ , $\bm{u}_{i}\overset{i.i.d.}{\sim}\mathcal{N}(\mathbf{0},\mathbf{I}_{p\times p})$ and $\bm{b}_{j}\overset{i.i.d.}{\sim}\text{Uniform}[-1,+1]^{k}$ . We set true parameters $\bm{\alpha}^{\star}=(0.8,1.0,1.2)$ , $\xi^{\star}=\{1,2,3,4,5\}$ , $\bm{\beta}^{\star}_{\xi^{\star}}=(0.3,0.3,0.3,0.3,0.3)^{\mathrm{\scriptscriptstyle T}}$ , and $\sigma^{\star}=0.5$ .

For prior (6), we choose the inverse-gamma density $g$ with shape $1$ and scale $1$ , the Gaussian density $h(z)=e^{-z^{2}/2}/\sqrt{2\pi}$ and set hyperparameters $s_{0}=1$ and $\tau_{j}=\|\widehat{\mathbf{U}}_{j}\|/\sqrt{n}$ . Starting from $(\sigma,\bm{\alpha},\bm{\beta})=(1.0,\mathbf{0},\mathbf{0})$ , we iterate a Gibbs sampler $T=20$ times and drop the first $T/2=10$ iterations as the burn-in period. The implementation details of the Gibbs sampler is given in the appendix.

The pseudo-posterior distribution are evaluated in terms of five measures. The posterior mean of $\bm{\beta}$ is compared to $\bm{\beta}^{\star}$ in terms of $\ell_{2}$ estimation error. The model selection rate and the sure screening rate are also computed. The former is the portion of the posterior samples that select the true model, i.e., $\xi=\xi^{\star}$ , and the latter is the portion of the posterior samples that select all sparse coefficients, i.e., $\xi\supseteq\xi^{\star}$ . To evaluate the adaptivity to unknown sparsity $s$ , we report the average model size $|\xi|$ . To evaluate the adaptivity to unknown standard deviation $\sigma^{\star}$ , the posterior mean of $\sigma^{2}$ is compared to $\sigma^{\star 2}$ in terms of relative estimation error. These measures are evaluated over 100 replicates of the datasets, and their averages are reported.

For the comparison purpose, the factor-adjusted lasso method is implemented by using R package glmnet (Friedman et al., 2010). The $\ell_{1}$ -penalty hyperparameters of the lasso methods are tuned by 10-fold cross-validation. Since the generic Bayesian / lasso with $\mathbf{X}$ as covariates can be seen as the factor-adjusted Bayesian / lasso with the underestimate $\widehat{k}=0$ of $k=3$ , we also include them in the comparison.

5.1 Comparison of four methods, and insensitivity to misestimates of $k$

Table 1 summarizes the five measures of four methods in the basic case. Results show that the factor-adjusted Bayesian method outperforms the factor-adjusted lasso method in the tasks of $\bm{\beta}$ estimation and model selection. The poor performance of the factor-adjusted lasso method may partly result from the less satisfactory hyperparameter tuning procedure implemented in the R package glmnet.

We feed the factor-adjusted methods with the various estimates $\widehat{k}=3,6,9,12$ , and observe that their performances are insensitive to the overestimate of $k$ (Table 1). In case that there is no correlation among $\mathbf{X}$ , i.e., $k=0$ , the factor-adjusted Bayesian method performs slightly worse than the generic Bayesian method (Table 2).

We emphasize that the meaning of the model selection rate for the Bayesian methods are slightly different from that for the frequentist methods. For example, 50% model selection rate given by a frequentist method means that it select the true sparse model in 50 out of 100 replicates of the dataset. In contrast, 90% model selection rate given by a Bayesian method means that every 9 of 10 posterior samples of parameters hit the true sparse model in a single replicate of the dataset on average. In the simulation experiments reported by Tables 1-2, at least every 7 of 10 pseudo-posterior samples obtained by our method hit the true sparse model in each of 100 replicates of the dataset.

5.2 Scalability as $n,p,s$ increase

We vary the sample size $n$ , the dimensionality $p$ and the sparsity $s$ in the basic case, and test the scalability of the proposed methodology.

In Figure 1(a), we fix all parameters in the basic case but vary $n=100$ , $150$ , $200$ , $250$ , $300$ , $350$ . In Figure 1(b), we fix all parameters in the basic case but vary $p=200$ , $300$ , $400$ , $500$ , $600$ , $700$ . In Figure 1(c), we fix all parameters in the basic case but vary $s=1$ , $3$ , $5$ , $7$ , $9$ , $11$ , $13$ , $15$ . For factor-adjusted methods, $\widehat{k}=k=3$ are used.

We observe that our method outperforms the other three methods in terms of $\beta$ estimation error and model selection rate under each combination of $(n,p,s)$ , and achieves comparable relative error of $\sigma^{2}$ to the factor-adjusted lasso method.

5.3 Estimating the standard regression model

Recall that, when $\mathbf{X}$ admits a factor structure (3), the standard regression model (1) is a special case of the factor-adjusted regression model (4) with the parameter constraint $\bm{\alpha}=\mathbf{B}^{\mathrm{\scriptscriptstyle T}}\bm{\beta}$ . We expect that the factor-adjusted Bayesian method solves model (1) as well. To verify this expectation, we set $\bm{\alpha}^{\star}=\mathbf{B}^{\mathrm{\scriptscriptstyle T}}\bm{\beta}^{\star}$ (or equivalently $\mathbf{Y}=\mathbf{X}\bm{\beta}^{\star}+\sigma^{\star}\bm{\varepsilon}$ ), and test four methods on simulation datasets.

We see that the factor-adjusted Bayesian method does solve model (1). Interestingly, while the factor adjustment added to the lasso method significantly increases the model selection rate from 13% to roughly 50%, the generic Bayesian method works comparably well or even better than the factor-adjusted Bayesian method. We will discuss this phenomenon in the discussion section.

6 Predicting U.S. Bond Risk Premia

This section applies our method to predict U.S. bond risk premia with a large panel of macroeconomic variables. The response variables are monthly U.S. bond risk premia with maturity of $m=2,3,4,5$ years spanning the period from January, 1964 to December, 2003 (Ludvigson and Ng, 2009). The $m$ -year bond risk premium at period $i+1$ is defined as the (log) holding return from buying an $m$ -year bond at period $i$ and selling it as an $(m-1)$ -year boud at period $i+1$ , excessing the (log) return on one-year bond bought at period $i$ . The covariates are $p=131$ macroeconomic variables collected in the FRED-MD database (McCracken and Ng, 2016) during the same period.

The $p=131$ covariates over 480 months are strongly correlated. The scree plot of PCA of these covariates (Figure 2) shows that the first principal component accounts for 55.9% of the total variance, and that the first 5 principal components account for 89.7% of the total variance.

We consider the rolling window regression and next value prediction. Specifically, we regress a U.S. bond risk premium on the macroeconomic variables in the last month. For each time window of size $n=100$ ahead of month $t=n+2,\dots,480$ , we fit

[TABLE]

and do out-of-sample prediction

[TABLE]

The regression function $f$ is fitted as $\widehat{f}$ by one of the generic lasso method, the factor-adjusted lasso method, the generic Bayesian method, the factor-adjusted Bayesian method and the principal component regression (PCR) method (Ludvigson and Ng, 2009). For the factor-adjusted methods, the number of common factors $k$ is estimated by (5). For the Bayesian methods, we set $s_{0}=10$ in prior (6). For PCR, the top eight principal components are included in the regression model in a similar vein to (Ludvigson and Ng, 2009). The R package pls (Wehrens and Mevik, 2007) is used for implementation of PCR.

The prediction performance is evaluated by the out-of-sample $R^{2}$ , which is computed as follows.

[TABLE]

where $y_{t}$ is one of two-year, three-year, four-year and five-year U.S. bond risk premia, $\widehat{y}_{t}$ is the prediction of $y_{t}$ by one of five methods in comparison, and $\bar{y}_{t}$ is the average of $\{y_{t-n},\dots,y_{t-1}\}$ .

Table 4 summarizes the out-of-sample $R^{2}$ five methods achieve on this task. Table 5 reports the average size of the sparse models they select. We observe that the factor-adjusted Bayesian method together with the factor-adjusted Bayesian method achieve higher out-of-sample $R^{2}$ than other methods. But the factor-adjusted Bayesian method select much sparser models than the factor-adjusted lasso method.

7 Discussion

We propose a factor-adjusted regression model to handle the linear relationship between the response variable and possibly highly correlated covariates. We decompose the predictors into common factors and idiosyncratic components, where the common factors explain most of the variations, and assume all common factors but a small number of idiosyncratic components contribute to the response. The corresponding Bayesian methodology is then developed for estimating such a model. Theoretical results suggest that the proposed methodology can consistently estimate the factor-adjusted model and thus obtain consistent predictions, under an easily-to-hold sparse eigenvalue condition on the idiosyncratic components instead of the original covariates.

Our factor-adjusted model covers the standard linear model as a sub-model with the side constraint. Thus, our proposed methodology can easily handle the case when the standard linear regression model is assumed to be the underlying model. In simulation studies on the sub-model, we find that the factor adjustment greatly improves the performance of lasso, while the generic Bayesian sparse regression is comparable to the factor-adjusted Bayesian sparse regression in terms of estimation error and model selection rate (Table 3). This suggests a fundamental difference between the frequentist sparse regression method and the Bayesian sparse regression method. Indeed, one can prove under Assumptions 1-2, 5-7 that

[TABLE]

If $s={\textnormal{O}}(1)$ , these two terms are of constant order, and then a similar argument to the proof of Theorem 1 would establish the convergence and model selection consistency of the generic Bayesian regression on standard regression model (1).

Nonetheless, we recommend the factor-adjusted Bayesian regression on model (4) over the generic Bayesian regression on model (1) for three reasons. First, the theoretical analyses of the former allow $s$ to grow with $n$ , in contrast the latter requires fixed $s$ . Second, model (4) provides more flexibility than its sub-model (1) in the regression analyses and would potentially explore more explanatory power from the data. On the real dataset of U.S. bond risk premia, the factor adjusted Bayesian regression achieves 1.0%-3.0% more out-of-sample $R^{2}$ with one or two less variables (Tables 4-5). Third, in the no correlation case (although it is unlike the case in practice), the factor-adjusted Bayesian regression pays a negligible price for model misspecification (Table 2).

Acknowledgement

We would like to thank Yun Yang for helpful discussions.

Appendix A Technical Proofs for Bayesian Sparse Regression

This appendix collects technical proofs for Theorem 1.

A.1 Proof of Theorem 1

The proofs of three parts use the same techniques and have a similar structure. First, we observe under Assumptions 1-3 that, for some constant $C_{4}$ and any constant $M_{0}$ ,

[TABLE]

hold with probability approaching $1$ . The first three claimed bounds are directly taken from Assumption 3. The last two claimed bound follow from Weyl’s inequality. For any model $\xi$ of size at most $(M_{0}+1)s$ , the singular values of $\widehat{\mathbf{U}}_{\xi}$ differ from those of $\mathbf{U}_{\xi}$ by at most

[TABLE]

This implies that

[TABLE]

We thereafter need to show that the conditional probabilities of

[TABLE]

given any realization of $(\mathbf{F},\mathbf{U},\mathbf{X},\widehat{\mathbf{F}},\widehat{\mathbf{U}},\mathbf{H})$ satisfying (8), vanish $n\to\infty$ . Recall that

[TABLE]

where $\xi$ and $\xi^{\prime}$ are supports of $\bm{\beta}$ and $\bm{\beta}^{\prime}$ , respectively, and $\epsilon_{n}=\sqrt{s\log p/n}$ .

Consider the conditional probability of event (a). Since $\widehat{\pi}(\sigma,\bm{\alpha},\bm{\beta}|\mathbf{X},\mathbf{Y})$ depends on $\mathbf{X}$ through $\widehat{\mathbf{F}}$ and $\widehat{\mathbf{U}}$ , we have

[TABLE]

Next, by a change of measure trick and Cauchy-Schwarz inequality,

[TABLE]

Proceed to bound two integrals separately. The logarithm of the second integral is the Rényi Divergence of order 2 from $\mathcal{N}(\widehat{\mathbf{F}}\mathbf{H}\bm{\alpha}^{\star}+\widehat{\mathbf{U}}\bm{\beta}^{\star},\sigma^{\star 2}\mathbf{I})$ to $\mathcal{N}(\mathbf{F}\bm{\alpha}^{\star}+\mathbf{U}\bm{\beta}^{\star},\sigma^{\star 2}\mathbf{I})$ . It follows that

[TABLE]

for some constant $C_{4}^{\prime}$ , where (8) derives $\|(\widehat{\mathbf{F}}\mathbf{H})_{j}-\mathbf{F}_{j}\|\leq C_{4}\sqrt{\log p}$ and $\|\widehat{\mathbf{U}}_{\xi^{\star}}-\mathbf{U}_{\xi^{\star}}\|\leq C_{4}\sqrt{s\log p}$ , and Assumption 4 controls $\|\bm{\alpha}^{\star}\|={\textnormal{O}}(1)$ and $\|\bm{\beta}^{\star}\|={\textnormal{O}}(1)$ .

On the other hand, let $\widehat{\mathbb{P}}_{(\sigma,\bm{\alpha},\bm{\beta})}$ denote the probability measure under which $\mathbf{Y}\sim\mathcal{N}(\widehat{\mathbf{F}}\bm{\alpha}+\widehat{\mathbf{U}}\bm{\beta},\sigma^{2}\mathbf{I})$ , then

[TABLE]

which is concerning the posterior convergence rate of Bayesian sparse regression in model $\mathbf{Y}\sim\mathcal{N}(\widehat{\mathbf{F}}\bm{\alpha}+\widehat{\mathbf{U}}\bm{\beta},\sigma^{2}\mathbf{I})$ with fixed design matrix $[\widehat{\mathbf{F}},\widehat{\mathbf{U}}]$ to identify the true parameter $(\sigma^{\star},\mathbf{H}\bm{\alpha}^{\star},\bm{\beta}^{\star})$ .

This fixe-design regression is analyzed by Theorem 3. By part (a) of Theorem 3, we can find $M_{0},M_{1},M_{2},C_{1},C_{1}^{\prime}$ such that $C_{1}^{\prime}>C_{4}^{\prime}$ and the first Integral $\leq e^{-C_{1}^{\prime}s\log p}$ . Combining bounds of two integrals completes the proof of part (a) of Theorem 1. Using similar arguments, parts (b) and (c) of Theorem 3 derive parts (b) and (c) of Theorem 1, respectively.

A.2 Bayesian Sparse Regression with Fixed Design

Theorem 3.

Recall that $\widehat{\mathbb{P}}_{(\sigma,\bm{\alpha},\bm{\beta})}$ denote the probability measure under the model $\mathbf{Y}=\widehat{\mathbf{F}}\bm{\alpha}+\widehat{\mathbf{U}}\bm{\beta}+\sigma\bm{\varepsilon}$ with fixed design $[\widehat{\mathbf{F}},\widehat{\mathbf{U}}]$ . Suppose $[\widehat{\mathbf{F}},\widehat{\mathbf{U}}]$ and true parameters satisfy

[TABLE]

then the following statements hold.

(a)

(estimation error rate) For any constants $C_{1},C_{1}^{\prime}$ , there exist sufficiently large $M_{0},M_{1},M_{2}$ such that

[TABLE] 2. (b)

(prediction error rate) For any constants $C_{2},C_{2}^{\prime}$ , there exist sufficiently large $M_{3}$ such that

[TABLE] 3. (c)

(model selection consistency) Suppose $\min_{j\in\xi^{\star}}|\beta^{\star}_{j}|\succ\epsilon_{n}$ in addition. For any constants $C_{3},C_{3}^{\prime}$ , there exist sufficiently large $M_{0},M_{1},M_{2}$ such that

[TABLE]

Remark. To apply Theorem 3 in the proof of Theorem 1, we replace $(\sigma^{\star},\bm{\alpha}^{\star},\bm{\beta}^{\star})$ with $(\sigma^{\star},\mathbf{H}\bm{\alpha}^{\star},\bm{\beta}^{\star})$ , and check that $\|\mathbf{H}\bm{\alpha}^{\star}\|={\textnormal{O}}(1)$ in Theorem 1.

The following proposition, which is also (Barron, 1998, Lemma 6) and (Song and Liang, 2017, Lemma A4), is the central technique to prove Theorem 3.

Proposition 1.

Consider a parametric model $\{P_{\bm{\theta}}\}_{\bm{\theta}\in\Theta}$ . Let $\Theta_{0n}$ and $\Theta_{n}$ be two subsets of the parameter space $\Theta$ . Let $\{\mathcal{D}_{n}\}_{n\geq 1}$ be a sequence of data generations according to true parameter $\bm{\theta}^{\star}$ . Let $\pi(\bm{\theta})$ be a prior distribution over $\Theta$ . If

(1)

$\pi(\Theta_{0n})\leq\delta_{0n}$ , 2. (2)

there exists a test function $\phi_{n}(\mathcal{D}_{n})$ such that

[TABLE] 3. (3)

and

[TABLE]

then for any $\delta_{3n}$ ,

[TABLE]

The intuition of this proposition is that any less preferred $\bm{\theta}\in\Theta_{0n}\cup\Theta_{n}$ should either excluded by the prior (for $\bm{\theta}\in\Theta_{0n}$ ) or distinguished from $\bm{\theta}^{\star}$ by a uniformly powerful test $\phi_{n}$ (for $\bm{\theta}\in\Theta_{n}$ ).

Lemmas A1-A3 are useful to verify the three conditions in Proposition 1, respectively. Their proofs are collected in the next subsection.

Lemma A1.

(Theorem 1.1 in (Pelekis, 2016)) For a Binomial distributed random variable $\texttt{Binomial}(p,\mu)$ , if $p\mu<m\leq p-1$ then

[TABLE]

where $\widetilde{m}=\lfloor(m-p\mu)/(1-\mu)\rfloor<m$ .

Lemma A2.

Under the same assumption of Theorem 3,

(a)

For

[TABLE]

we have

[TABLE] 2. (b)

For

[TABLE]

we have

[TABLE] 3. (c)

For

[TABLE]

we have

[TABLE] 4. (d)

Suppose $\min_{j\in\xi^{\star}}|\beta_{j}^{\star}|\geq M_{4}\sigma^{\star}\epsilon_{n}$ in addition. For

[TABLE]

we have

[TABLE] 5. (e)

For

[TABLE]

we have

[TABLE]

Lemma A3.

Under the same assumption of Theorem 3,

[TABLE]

for sufficiently large $C_{5}$ and $C_{5}^{\prime}$ .

Proof of Theorem 1, part (a).

We verify the three conditions in Proposition 1 one by one. Let

[TABLE]

where $\Theta_{1n},\Theta_{2n},\phi_{1n},\phi_{2n}$ are defined in Lemma A2. Then $\Theta_{0n}\cup\Theta_{n}=\Theta_{0n}\cup\Theta_{1n}\cup\Theta_{2n}=A^{c}(\sigma^{\star},\bm{\alpha}^{\star},\bm{\beta}^{\star},M_{0},M_{1},M_{2},\epsilon_{n})$ . Applying Lemma A1 yields that

[TABLE]

for sufficiently large $M_{0}$ . From parts (a),(b) of Lemma A2 and Lemma A5, it follows that

[TABLE]

By Lemma A3, the third condition in Proposition 1 hold asymptotically with

[TABLE]

for any sufficiently large $C_{5},C_{5}^{\prime}$ . For any $C_{1}$ , $C_{1}^{\prime}$ , We can find sufficiently large $M_{0},M_{1},M_{2},C_{5},C_{5}^{\prime}$ and suitable $\delta_{3n}$ such that

[TABLE]

to complete the proof. ∎

Proof of Theorem 3, part(b).

We use a similar argument to the proof of Theorem 3, part (a) but different $\Theta_{n}$ and $\phi_{n}$ . Let

[TABLE]

where $\Theta_{1n},\Theta_{3n},\phi_{1n},\phi_{3n}$ are defined in Lemma A2. Then

[TABLE]

The second condition in Proposition 1 follows from parts (a),(c) of Lemma A2 and Lemma A5.

[TABLE]

∎

Proof of Theorem 3, part(c).

We use a similar argument to the proof of Theorem 3, part (a) but different $\Theta_{n}$ and $\phi_{n}$ .

[TABLE]

where $\Theta_{1n},\Theta_{4n},\Theta_{5n},\phi_{1n},\phi_{4n},\phi_{5n}$ are defined in Lemma A2. Then $\Theta_{0n}\cup\Theta_{n}=\Theta_{0n}\cup\Theta_{1n}\cup\Theta_{4n}\cup\Theta_{5n}=A^{c}(\sigma^{\star},\bm{\alpha}^{\star},\bm{\beta}^{\star},M_{0},M_{1},M_{2},\epsilon_{n})\cup\{\xi\not\supseteq\xi^{\star}\}$ . The second condition in Proposition 1 follows from parts (a),(d),(e) of Lemma A2 and Lemma A5.

[TABLE]

∎

A.3 Technical Proofs of Lemmas

The proofs of Lemmas A2-A3 invoke a few preliminary results. We list them as follows.

Lemma A4 (Probability bounds of chi-squared random variables).

Let $\chi^{2}_{d}$ be a chi-squared random variable of degree $d$ .

(a)

For any $\epsilon_{n}$ such that $n\epsilon_{n}>d_{n}$ ,

[TABLE]

In addition, if $\epsilon_{n}\to 0$ but $n\epsilon_{n}\succ d_{n}$ ,

[TABLE] 2. (b)

[TABLE]

In addition, if $t_{n}\succ d_{n}$ then for any $\widetilde{t}_{n}$ such that $\widetilde{t}_{n}/t_{n}\to 1$

[TABLE]

Proof.

For part (a), the first assertion follows from the sub-exponential tail of chi-squared distribution, and the second assertion is due to

[TABLE]

For part (b), the first assertion is a corollary of (Laurent and Massart, 2000, Lemma 1), and the second assertion follows from

[TABLE]

∎

Lemma A5.

For a collection of subspace $\{\Theta_{j}\}_{j=1}^{m}$ and a collection of test functions $\{\varphi_{j}\}_{j=1}^{m}$

[TABLE]

Proof.

[TABLE]

∎

Lemma A6 (Part of Corollary 2.4 in (Liu, 2005)).

Let

[TABLE]

be a $p\times p$ positive semi-definite matrix with $q\times q$ non-singular principal sub-matrix $\mathbf{G}_{11}$ then

[TABLE]

Proof of Lemma A2, part (a).

Under the null hypothesis, write

[TABLE]

(1) follows from the facts that $\mathbf{Y}=\widehat{\mathbf{F}}\bm{\alpha}^{\star}+\widehat{\mathbf{U}}\bm{\beta}^{\star}+\sigma^{\star}\bm{\varepsilon}$ with $\bm{\beta}^{\star}_{\xi^{\star c}}=\mathbf{0}$ under $\widehat{\mathbb{P}}_{(\sigma^{\star},\bm{\alpha}^{\star},\bm{\beta}^{\star})}$ and that $\widehat{\mathbf{F}}^{\mathrm{\scriptscriptstyle T}}\widehat{\mathbf{U}}=\mathbf{0}$ . For (2), we observe that projection matrices

[TABLE]

for nested models $\xi^{\prime}\subseteq\xi^{\prime\prime}$ , and thus the term $\bm{\varepsilon}^{\mathrm{\scriptscriptstyle T}}\left[\mathbf{I}-\widehat{\mathbf{F}}\widehat{\mathbf{F}}^{\dagger}-\widehat{\mathbf{U}}_{\xi\cup\xi^{\star}}\widehat{\mathbf{U}}_{\xi\cup\xi^{\star}}^{\dagger}\right]\bm{\varepsilon}$ achieves its maximum value at any $\xi\subseteq\xi^{\star}$ and its minimum value at some $\xi$ s.t. $|\xi|=M_{0}s$ and $\xi\setminus\xi^{\star}=\emptyset$ . (3) uses the fact that

[TABLE]

Applying Lemma A4, part (a) yields

[TABLE]

Under the alternative hypothesis, observe that $\phi_{1n}=\max_{\xi^{\prime}:~{}|\xi^{\prime}\setminus\xi^{\star}|\leq M_{0}s}\phi_{1n}^{\xi^{\prime}}$ , where

[TABLE]

Using Lemma A5,

[TABLE]

For any $\xi^{\prime}$ such that $|\xi^{\prime}\setminus\xi^{\star}|\leq M_{0}s$ and any $(\sigma,\bm{\alpha},\bm{\beta})\in\Theta_{1n}\cap\{\xi=\xi^{\prime}\}$ , write

[TABLE]

(1) follows from the facts that $\mathbf{Y}=\widehat{\mathbf{F}}\bm{\alpha}+\widehat{\mathbf{U}}\bm{\beta}+\sigma\bm{\varepsilon}$ with $\bm{\beta}_{\xi^{c}}=\mathbf{0}$ under $\widehat{\mathbb{P}}_{(\sigma,\bm{\alpha},\bm{\beta})}$ and that $\widehat{\mathbf{F}}^{\mathrm{\scriptscriptstyle T}}\widehat{\mathbf{U}}=\mathbf{0}$ . (2) plugs in the restriction

[TABLE]

from the definition of $\Theta_{1n}$ . (3) uses the fact that

[TABLE]

again. Since the final bound in the last display is uniform for any $\xi^{\prime}$ such that $|\xi^{\prime}\setminus\xi^{\star}|\leq M_{0}s$ and any $(\sigma,\bm{\alpha},\bm{\beta})\in\Theta_{1n}\cap\{\xi=\xi^{\prime}\}$ , we apply Lemma A4, part (a) and yield

[TABLE]

∎

Proof of Lemma A2, part (b).

Under the null hypothesis, write

[TABLE]

(1) follows from the facts that $\mathbf{Y}=\widehat{\mathbf{F}}\bm{\alpha}^{\star}+\widehat{\mathbf{U}}\bm{\beta}^{\star}+\sigma^{\star}\bm{\varepsilon}$ with $\bm{\beta}^{\star}_{\xi^{\star c}}=\mathbf{0}$ under $\widehat{\mathbb{P}}_{(\sigma^{\star},\bm{\alpha}^{\star},\bm{\beta}^{\star})}$ and that $\widehat{\mathbf{F}}^{\mathrm{\scriptscriptstyle T}}\widehat{\mathbf{U}}=\mathbf{0}$ . (2) is due to

[TABLE]

For (3), we observe that projection matrices

[TABLE]

for nested models $\xi^{\prime}\subseteq\xi^{\prime\prime}$ , and thus the term $\bm{\varepsilon}^{\mathrm{\scriptscriptstyle T}}\widehat{\mathbf{U}}_{\xi\cup\xi^{\star}}\widehat{\mathbf{U}}_{\xi\cup\xi^{\star}}^{\dagger}\bm{\varepsilon}$ achieves its maximum value at some $\xi$ s.t. $|\xi|=M_{0}s$ and $\xi\setminus\xi^{\star}=\emptyset$ . (4) uses the fact that

[TABLE]

Applying Lemma A4, part (b) yields

[TABLE]

Under the alternative hypothesis, observe that $\phi_{2n}=\max_{\xi^{\prime}:~{}|\xi^{\prime}\setminus\xi^{\star}|\leq M_{0}s}\phi_{2n}^{\xi^{\prime}}$ , where

[TABLE]

Using Lemma A5,

[TABLE]

For any $\xi^{\prime}$ such that $|\xi^{\prime}\setminus\xi^{\star}|\leq M_{0}s$ and any $(\sigma,\bm{\alpha},\bm{\beta})\in\Theta_{2n}\cap\{\xi=\xi^{\prime}\}$ , write

[TABLE]

(1) follows from the facts that $\mathbf{Y}=\widehat{\mathbf{F}}\bm{\alpha}+\widehat{\mathbf{U}}\bm{\beta}+\sigma\bm{\varepsilon}$ with $\bm{\beta}_{\xi^{c}}=\mathbf{0}$ under $\widehat{\mathbb{P}}_{(\sigma,\bm{\alpha},\bm{\beta})}$ and that $\widehat{\mathbf{F}}^{\mathrm{\scriptscriptstyle T}}\widehat{\mathbf{U}}=\mathbf{0}$ . (2) plugs in the restrictions

[TABLE]

from the definition of $\Theta_{2n}$ . (3) uses a similar argument to what we have used for the null hypothesis. Since the final bound in the last display is uniform for any $\xi^{\prime}$ such that $|\xi^{\prime}\setminus\xi^{\star}|\leq M_{0}s$ and any $(\sigma,\bm{\alpha},\bm{\beta})\in\Theta_{2n}\cap\{\xi=\xi^{\prime}\}$ , we apply Lemma A4, part (b) and yield

[TABLE]

∎

Proof of Lemma A2, part (c).

Under the null hypothesis, write

[TABLE]

(1) follows from the facts that $\mathbf{Y}=\widehat{\mathbf{F}}\bm{\alpha}^{\star}+\widehat{\mathbf{U}}\bm{\beta}^{\star}+\sigma^{\star}\bm{\varepsilon}$ with $\bm{\beta}^{\star}_{\xi^{\star c}}=\mathbf{0}$ under $\widehat{\mathbb{P}}_{(\sigma^{\star},\bm{\alpha}^{\star},\bm{\beta}^{\star})}$ and that $\widehat{\mathbf{F}}^{\mathrm{\scriptscriptstyle T}}\widehat{\mathbf{U}}=\mathbf{0}$ . For (2), we observe that projection matrices

[TABLE]

for nested models $\xi^{\prime}\subseteq\xi^{\prime\prime}$ , and thus the term $\bm{\varepsilon}^{\mathrm{\scriptscriptstyle T}}\widehat{\mathbf{U}}_{\xi\cup\xi^{\star}}\widehat{\mathbf{U}}_{\xi\cup\xi^{\star}}^{\dagger}\bm{\varepsilon}$ achieves its maximum value at some $\xi$ s.t. $|\xi|=M_{0}s$ and $\xi\setminus\xi^{\star}=\emptyset$ . (3) uses the fact that

[TABLE]

Applying Lemma A4, part (b) yields

[TABLE]

Under the alternative hypothesis, observe that $\phi_{3n}=\max_{\xi^{\prime}:~{}|\xi^{\prime}\setminus\xi^{\star}|\leq M_{0}s}\phi_{3n}^{\xi^{\prime}}$ , where

[TABLE]

Using Lemma A5,

[TABLE]

For any $\xi^{\prime}$ such that $|\xi^{\prime}\setminus\xi^{\star}|\leq M_{0}s$ and any $(\sigma,\bm{\alpha},\bm{\beta})\in\Theta_{3n}\cap\{\xi=\xi^{\prime}\}$ , write

[TABLE]

(1) follows from the facts that $\mathbf{Y}=\widehat{\mathbf{F}}\bm{\alpha}+\widehat{\mathbf{U}}\bm{\beta}+\sigma\bm{\varepsilon}$ with $\bm{\beta}_{\xi^{c}}=\mathbf{0}$ under $\widehat{\mathbb{P}}_{(\sigma,\bm{\alpha},\bm{\beta})}$ and that $\widehat{\mathbf{F}}^{\mathrm{\scriptscriptstyle T}}\widehat{\mathbf{U}}=\mathbf{0}$ . (2) plugs in the restrictions

[TABLE]

from the definition of $\Theta_{3n}$ . (3) uses the fact that

[TABLE]

Since the final bound in the last display is uniform for any $\xi^{\prime}$ such that $|\xi^{\prime}\setminus\xi^{\star}|\leq M_{0}s$ and any $(\sigma,\bm{\alpha},\bm{\beta})\in\Theta_{3n}\cap\{\xi=\xi^{\prime}\}$ , we apply Lemma A4, part (b) and yield

[TABLE]

∎

Proof of Lemma A2, part (d).

We first show that

[TABLE]

Indeed, for any $\xi\not\supseteq\xi^{\star}$ s.t. $|\xi\setminus\xi^{\star}|\leq M_{0}s$ ,

[TABLE]

Note that $\widehat{\mathbf{U}}_{\xi^{\star}\setminus\xi}^{\mathrm{\scriptscriptstyle T}}\left(\mathbf{I}-\widehat{\mathbf{U}}_{\xi}\widehat{\mathbf{U}}_{\xi}^{\dagger}\right)\widehat{\mathbf{U}}_{\xi^{\star}\setminus\xi}$ is the Schur complement of the principal submatrix $\widehat{\mathbf{U}}_{\xi^{\star}\setminus\xi}^{\mathrm{\scriptscriptstyle T}}\widehat{\mathbf{U}}_{\xi^{\star}\setminus\xi}$ in the matrix $\widehat{\mathbf{U}}_{\xi\cup\xi^{\star}}^{\mathrm{\scriptscriptstyle T}}\widehat{\mathbf{U}}_{\xi\cup\xi^{\star}}$ . Thus, by Lemma A6,

[TABLE]

It further implies that

[TABLE]

Under the null hypothesis, write

[TABLE]

(1) follows from the facts that $\mathbf{Y}=\widehat{\mathbf{F}}\bm{\alpha}^{\star}+\widehat{\mathbf{U}}\bm{\beta}^{\star}+\sigma^{\star}\bm{\varepsilon}$ with $\bm{\beta}^{\star}_{\xi^{\star c}}=\mathbf{0}$ under $\widehat{\mathbb{P}}_{(\sigma^{\star},\bm{\alpha}^{\star},\bm{\beta}^{\star})}$ and that $\widehat{\mathbf{F}}^{\mathrm{\scriptscriptstyle T}}\widehat{\mathbf{U}}=\mathbf{0}$ . (2) plugs in (10). (3) is due to the fact that

[TABLE]

(4) uses the fact that

[TABLE]

Applying Lemma A4, part (b) yields

[TABLE]

Under the alternative hypothesis, observe that $\phi_{4n}=\max_{\xi^{\prime}\not\supseteq\xi^{\star}:~{}|\xi^{\prime}\setminus\xi^{\star}|\leq M_{0}s}\phi_{4n}^{\xi^{\prime}}$ , where

[TABLE]

Using Lemma A5,

[TABLE]

For any $\xi^{\prime}\not\supseteq\xi^{\star}$ such that $|\xi^{\prime}\setminus\xi^{\star}|\leq M_{0}s$ and any $(\sigma,\bm{\alpha},\bm{\beta})\in\Theta_{4n}\cap\{\xi=\xi^{\prime}\}$ , write

[TABLE]

(1) follows from the facts that $\mathbf{Y}=\widehat{\mathbf{F}}\bm{\alpha}+\widehat{\mathbf{U}}\bm{\beta}+\sigma\bm{\varepsilon}$ with $\bm{\beta}_{\xi^{c}}=\mathbf{0}$ under $\widehat{\mathbb{P}}_{(\sigma,\bm{\alpha},\bm{\beta})}$ and that $\widehat{\mathbf{F}}^{\mathrm{\scriptscriptstyle T}}\widehat{\mathbf{U}}=\mathbf{0}$ . (2) plugs in the restriction

[TABLE]

from the definition of $\Theta_{4n}$ . (3) uses a similar argument to what we have used for the null hypothesis. Since the final bound in the last display is uniform for any $\xi^{\prime}\not\supseteq\xi^{\star}$ such that $|\xi^{\prime}\setminus\xi^{\star}|\leq M_{0}s$ and any $(\sigma,\bm{\alpha},\bm{\beta})\in\Theta_{4n}\cap\{\xi=\xi^{\prime}\}$ , we apply Lemma A4, part (b) and yield

[TABLE]

∎

Proof of Lemma A2, part (e).

Under the null hypothesis, write

[TABLE]

(1) follows from the facts that $\mathbf{Y}=\widehat{\mathbf{F}}\bm{\alpha}^{\star}+\widehat{\mathbf{U}}\bm{\beta}^{\star}+\sigma^{\star}\bm{\varepsilon}$ with $\bm{\beta}^{\star}_{\xi^{\star c}}=\mathbf{0}$ under $\widehat{\mathbb{P}}_{(\sigma^{\star},\bm{\alpha}^{\star},\bm{\beta}^{\star})}$ and that $\widehat{\mathbf{F}}^{\mathrm{\scriptscriptstyle T}}\widehat{\mathbf{U}}=\mathbf{0}$ . (2) is due to

[TABLE]

For (3), we observe that projection matrices

[TABLE]

for nested models $\xi^{\prime}\subseteq\xi^{\prime\prime}$ , and thus the term $\bm{\varepsilon}^{\mathrm{\scriptscriptstyle T}}\widehat{\mathbf{U}}_{\xi}\widehat{\mathbf{U}}_{\xi}^{\dagger}\bm{\varepsilon}$ achieves its maximum value at some $\xi\supseteq\xi^{\star}$ s.t. $|\xi\setminus\xi^{\star}|=M_{0}s$ . (4) uses the fact that

[TABLE]

Applying Lemma A4, part (b) yields

[TABLE]

Under the alternative hypothesis, observe that $\phi_{5n}=\max_{\xi^{\prime}\supseteq\xi^{\star}:~{}|\xi^{\prime}\setminus\xi^{\star}|\leq M_{0}s}\phi_{5n}^{\xi^{\prime}}$ , where

[TABLE]

Using Lemma A5,

[TABLE]

For any $\xi^{\prime}\supseteq\xi^{\star}$ such that $|\xi^{\prime}\setminus\xi^{\star}|\leq M_{0}s$ and any $(\sigma,\bm{\alpha},\bm{\beta})\in\Theta_{5n}\cap\{\xi=\xi^{\prime}\}$ , write

[TABLE]

(1) follows from the facts that $\mathbf{Y}=\widehat{\mathbf{F}}\bm{\alpha}+\widehat{\mathbf{U}}\bm{\beta}+\sigma\bm{\varepsilon}$ with $\bm{\beta}_{\xi^{c}}=\mathbf{0}$ under $\widehat{\mathbb{P}}_{(\sigma,\bm{\alpha},\bm{\beta})}$ , that $\widehat{\mathbf{F}}^{\mathrm{\scriptscriptstyle T}}\widehat{\mathbf{U}}=\mathbf{0}$ and that $\xi\supseteq\xi^{\star}$ . (2) plugs in the restrictions

[TABLE]

from the definition of $\Theta_{5n}$ . (3) uses a similar argument to what we have used for the null hypothesis. Since the final bound in the last display is uniform for any $\xi^{\prime}\supseteq\xi^{\star}$ such that $|\xi^{\prime}\setminus\xi^{\star}|\leq M_{0}s$ and any $(\sigma,\bm{\alpha},\bm{\beta})\in\Theta_{5n}\cap\{\xi=\xi^{\prime}\}$ , we apply Lemma A4, part (b) and yield

[TABLE]

∎

Proof of Lemma A3.

Define

[TABLE]

Step 1. We first choose sufficiently small $\eta_{1},\eta_{2}$ such that

[TABLE]

Observing that

[TABLE]

We proceed to find sufficiently small $\eta_{1},\eta_{2}$ such that

[TABLE]

with conditional probability $1$ given $\bm{\varepsilon}^{\mathrm{\scriptscriptstyle T}}\left[\widehat{\mathbf{F}}\widehat{\mathbf{F}}^{\dagger}+\widehat{\mathbf{U}}_{\xi^{\star}}\widehat{\mathbf{U}}_{\xi^{\star}}^{\dagger}\right]\bm{\varepsilon}\leq 3C_{5}^{\prime}n\epsilon_{n}^{2}$ . To this end, write

[TABLE]

where

[TABLE]

We choose sufficiently small $\eta_{1},\eta_{2}$ such that $(1+\widehat{\kappa}_{1})\eta_{2}^{2}/2+\sqrt{3C_{5}^{\prime}}(1+\sqrt{\widehat{\kappa}_{1}})\eta_{2}+\eta_{1}\leq C_{5}/2$ .

Step 2. Since

[TABLE]

it is left to show

[TABLE]

Note that $\|\bm{\alpha}^{\star}\|={\textnormal{O}}(1)$ , $\|\bm{\beta}^{\star}\|={\textnormal{O}}(1)$ , and for $j\in\xi^{\star}$ , $\tau_{j}^{-1}=\|\widehat{\mathbf{U}}_{j}\|/\sqrt{n}\in[\sqrt{\widehat{\kappa}_{0}},\sqrt{\widehat{\kappa}_{1}}]$ . For all $(\sigma,\bm{\alpha},\bm{\beta})\in A^{\star}_{n}(\eta_{1},\eta_{2})$ , we can find constant $C>0$ such that

[TABLE]

hold for sufficiently large $n$ . Thus

[TABLE]

if $C_{5}>3$ . ∎

Appendix B Technical Proofs for Factor Model Estimation

This section is devoted to the proofs of Theorem 3. Parts (a) and (b) of Theorem 3 are restated as Lemmas B7-B8. The proof of Lemma B7 is straightforward. To prove Lemma B8, we generalize the Davis-Kahan theorem (Davis and Kahan, 1970; Yu et al., 2014) as Proposition 2 and apply it to bound the principal angles from the perturbed eigenspace to the target eigenspace. Two preliminary lemmas, required by the proof of Lemma B8, are stated as Lemmas B9-B10. Parts (c) and (d) of Theorem 3, restated as Lemmas B11-B12, are immediate corollaries of Lemma B8.

Lemma B7 (Theorem 3, part (a)).

Suppose Assumptions 5-7. Recall that $\widehat{\lambda}_{1}\geq\dots\geq\widehat{\lambda}_{n}$ are the $n$ eigenvalues of $\mathbf{X}\mathbf{X}^{\mathrm{\scriptscriptstyle T}}/n$ , and that $\lambda_{1}\geq\dots\geq\lambda_{k}$ are $k$ eigenvalues of $\mathbf{B}^{\mathrm{\scriptscriptstyle T}}\mathbf{B}$ . Then

[TABLE]

Proof.

It suffices to show $\|\mathbf{X}^{\mathrm{\scriptscriptstyle T}}\mathbf{X}/n-\mathbf{B}\mathbf{B}^{\mathrm{\scriptscriptstyle T}}\|={\textnormal{O}_{\textnormal{p}}}(p\sqrt{\log p/n})$ so that Weyl’s inequality applies. To this end, write

[TABLE]

where

[TABLE]

Plugging into it rates in Assumptions 5-7 and $\|\mathbf{\Sigma}\|\leq\|\mathbf{\Sigma}\|_{1}\leq m_{q}(p)C_{0}^{1-q}=o(\log p)$ completes the proof. ∎

Proposition 2.

Let $\widehat{\mathbf{A}}$ be an $n\times n$ symmetric matrix with eigenvalues $\widehat{\lambda}_{1}\geq\widehat{\lambda}_{2}\geq\dots\geq\widehat{\lambda}_{n}$ and corresponding eigenvectors $\widehat{\psi}_{1},\dots,\widehat{\psi}_{n}$ . Fix $1\leq l\leq r\leq n$ and assume that $\min\{\widehat{\lambda}_{l-1}-\widehat{\lambda}_{l},\widehat{\lambda}_{r}-\widehat{\lambda}_{r+1}\}>0$ , where $\widehat{\lambda}_{0}:=+\infty$ and $\widehat{\lambda}_{n+1}:=-\infty$ . Let $k=l-r+1$ . Let $\widehat{\mathbf{\Lambda}}={\rm diag}(\widehat{\lambda}_{l},\dots,\widehat{\lambda}_{r})$ and $\widehat{\mathbf{\Lambda}}_{c}$ consists of the other $n-k$ eigenvalues of $\widehat{\mathbf{A}}$ . Let $\widehat{\mathbf{\Psi}}=(\widehat{\psi}_{l},\dots,\widehat{\psi}_{r})$ and $\widehat{\mathbf{\Psi}}_{c}$ consists of the other $n-k$ eigenvectors of $\widehat{\mathbf{A}}$ . Let $\mathbf{A}$ be an $n\times n$ (not necessarily symmetric) matrix with “ $\mathbf{\Delta}$ -approximate” eigenvalues $\lambda_{l}\geq\dots\geq\lambda_{r}$ in the sense that

[TABLE]

where $\mathbf{\Lambda}={\rm diag}(\lambda_{l},\dots,\lambda_{r})$ and $\mathbf{\Psi}=(\psi_{l},\dots,\psi_{r})$ consists of $k$ (not necessarily orthonormal) vectors. Then

[TABLE]

Proof.

Write

[TABLE]

Using the facts that $\|\mathbf{T}_{1}\mathbf{T}_{2}\|_{\text{F}}\leq\|\mathbf{T}_{1}\|\|\mathbf{T}_{2}\|_{\text{F}}$ and that $\|\mathbf{T}_{1}\|\leq\|\mathbf{T}_{1}\|_{\text{F}}\leq\sqrt{{\rm rank}(\mathbf{T}_{1})}\|\mathbf{T}_{1}\|$ , we derive that

[TABLE]

which is the numerator on the right hand side of the claimed inequality. On the other hand,

[TABLE]

where the first (in)equality follows from the identities that $\mathbf{I}=\widehat{\mathbf{\Psi}}\widehat{\mathbf{\Psi}}^{\mathrm{\scriptscriptstyle T}}+\widehat{\mathbf{\Psi}}_{c}\widehat{\mathbf{\Psi}}_{c}^{\mathrm{\scriptscriptstyle T}}$ and that $\widehat{\mathbf{A}}=\widehat{\mathbf{\Psi}}\widehat{\mathbf{\Lambda}}\widehat{\mathbf{\Psi}}^{\mathrm{\scriptscriptstyle T}}+\widehat{\mathbf{\Psi}}_{c}\widehat{\mathbf{\Lambda}}_{c}\widehat{\mathbf{\Psi}}_{c}^{\mathrm{\scriptscriptstyle T}}$ , the second (in)equality uses the orthogonality of $[\widehat{\mathbf{\Psi}},\widehat{\mathbf{\Psi}}_{c}]$ to derive that

[TABLE]

and the third (in)equality uses the column orthonormality of $\widehat{\mathbf{\Psi}}_{c}$ again.

Proceed to consider the term $\|\widehat{\mathbf{\Psi}}_{c}^{\mathrm{\scriptscriptstyle T}}\mathbf{\Psi}\widehat{\mathbf{\Lambda}}-\widehat{\mathbf{\Lambda}}_{c}\widehat{\mathbf{\Psi}}_{c}^{\mathrm{\scriptscriptstyle T}}\mathbf{\Psi}\|_{\text{F}}$ . For real matrices $\mathbf{T}_{1},\mathbf{T}_{2},\mathbf{T}_{3}$ , we write ${\rm vec}(\mathbf{T}_{1})$ as the vectorization of $\mathbf{T}_{1}$ , which is the vector obtained by stacking columns of $\mathbf{T}_{1}$ , and denote by $\mathbf{T}_{1}\otimes\mathbf{T}_{2}$ the kronecker product of matrices $\mathbf{T}_{1}$ and $\mathbf{T}_{2}$ . Using the identity ${\rm vec}(\mathbf{T}_{1}\mathbf{T}_{2}\mathbf{T}_{3})=\mathbf{T}_{3}^{\mathrm{\scriptscriptstyle T}}\otimes\mathbf{T}_{1}{\rm vec}(\mathbf{T}_{2})$ for any matrices $\mathbf{T}_{1},\mathbf{T}_{2},\mathbf{T}_{3}$ with appropriate dimensions, we have

[TABLE]

which is the left hand side times the denominator on the right hand side in the claimed inequality. ∎

Lemma B8 (Theorem 3, part (b)).

Suppose Assumptions 5-7 hold. Recall that $\widetilde{\mathbf{F}}$ consists of $\sqrt{n}$ -scaled left singular vectors of $\mathbf{F}$ , and that $\widehat{\mathbf{F}}$ consists of $\sqrt{n}$ -scalded top $k$ left singular vectors of $\mathbf{X}$ . Then $\widehat{\mathbf{F}}$ recovers the column space of the latent common factor matrix $\mathbf{F}$ in the sense that

[TABLE]

Proof.

Let $\widehat{\mathbf{F}}_{c}$ consist of $\sqrt{n}$ -scaled left singular vectors of $\mathbf{X}$ except those in $\widetilde{\mathbf{F}}$ , then

[TABLE]

Thus it suffices to show $\|\widehat{\mathbf{F}}_{c}^{\mathrm{\scriptscriptstyle T}}\widetilde{\mathbf{F}}/n\|_{\text{F}}={\textnormal{O}_{\textnormal{p}}}(\sqrt{\log p/n})$ . For this goal, we first apply Proposition 2 to show that

[TABLE]

Recall that $\mathbf{R}$ consist of the right singular vectors of $\mathbf{B}$ , i.e. $\mathbf{B}^{\mathrm{\scriptscriptstyle T}}\mathbf{B}=\mathbf{R}\mathbf{\Lambda}\mathbf{R}^{\mathrm{\scriptscriptstyle T}}$ , and write

[TABLE]

where $\mathbf{\Delta}=\mathbf{F}\mathbf{R}\mathbf{\Lambda}\left(\mathbf{R}^{\mathrm{\scriptscriptstyle T}}\mathbf{F}^{\mathrm{\scriptscriptstyle T}}\mathbf{F}\mathbf{R}/n-\mathbf{I}\right)/\sqrt{n}$ . Applying Proposition 2 yields

[TABLE]

To prove (11), we are going to bound each term in the last display by Assumptions 5-7, and Lemmas B7,B9.

(a)

For the term $\|\mathbf{F}\mathbf{R}/\sqrt{n}\|=\|\mathbf{F}/\sqrt{n}\|$ , we have by Assumption 7 that

[TABLE] 2. (b)

Using the facts that $\|\mathbf{T}_{1}\mathbf{T}_{2}\|_{\text{F}}=\|\mathbf{T}_{1}\|_{\text{F}}\|\mathbf{T_{2}}\|$ , that $\|\mathbf{R}\|=1$ and that $\|\mathbf{T}_{1}\|_{\text{F}}\leq\sqrt{{\rm rank}(\mathbf{T}_{1})}\|\mathbf{T}_{1}\|$ ,

[TABLE] 3. (c)

For the term

[TABLE]

we have, by Assumptions 5 and 7,

[TABLE]

and, by Lemma B9,

[TABLE] 4. (d)

From Lemma B7, it follows that

[TABLE]

Next, recall that $\mathbf{F}/\sqrt{n}=(\widetilde{\mathbf{F}}/\sqrt{n})\mathbf{D}\mathbf{O}_{0}^{\mathrm{\scriptscriptstyle T}}$ is the singular value decomposition of $\mathbf{F}/\sqrt{n}$ . Write

[TABLE]

where the second (in)equality follows from the fact that, if $\mathbf{T_{2}}$ is diagonal,

[TABLE]

the third (in)equality uses the orthogonality of $\mathbf{O}_{0}$ , and the final (in)equality combines rates given by Lemma B10 and (11). ∎

Lemma B9.

Suppose Assumptions 5 and 6 hold. Then

[TABLE]

Proof.

Write

[TABLE]

Applying Markov’s inequalities to $\|\mathbf{U}\mathbf{B}\|_{\text{F}}^{2}$ completes the proof. ∎

Lemma B10.

Suppose Assumption 6 and 7 holds. Let $\mathbf{F}/\sqrt{n}=\widetilde{\mathbf{F}}/\sqrt{n}\mathbf{D}\mathbf{O}_{0}^{\mathrm{\scriptscriptstyle T}}$ be the singular value decomposition of $\mathbf{F}/\sqrt{n}$ . Then

[TABLE]

Proof.

Write

[TABLE]

. ∎

Lemma B11 (Theorem 3, part (c)).

Suppose Assumptions 5-7 hold. For some non-singular matrix $\mathbf{H}_{k\times k}$ with $\|\mathbf{H}^{\mathrm{\scriptscriptstyle T}}\mathbf{H}-\mathbf{I}\|={\textnormal{O}_{\textnormal{p}}}(\sqrt{\log p/n})$ and $\|\mathbf{H}\mathbf{H}^{\mathrm{\scriptscriptstyle T}}-\mathbf{I}\|={\textnormal{O}_{\textnormal{p}}}(\sqrt{\log p/n})$ ,

[TABLE]

Proof.

Recall that $\mathbf{F}/\sqrt{n}=(\widetilde{\mathbf{F}}/\sqrt{n})\mathbf{D}\mathbf{O}_{0}^{\mathrm{\scriptscriptstyle T}}$ is the singular value decomposition of $\mathbf{F}/\sqrt{n}$ . Note that all singular values of $\widehat{\mathbf{F}}^{\mathrm{\scriptscriptstyle T}}\widetilde{\mathbf{F}}/n$ is bounded by $\|\widehat{\mathbf{F}}^{\mathrm{\scriptscriptstyle T}}\widetilde{\mathbf{F}}/n\|\leq 1$ . Let $\mathbf{O}_{1}$ and $\mathbf{O}_{2}$ consist of the left and right singular vectors of $\widehat{\mathbf{F}}^{\mathrm{\scriptscriptstyle T}}\widetilde{\mathbf{F}}/n$ (the signs of vectors are properly set such that the singular values of $\widehat{\mathbf{F}}^{\mathrm{\scriptscriptstyle T}}\widetilde{\mathbf{F}}/n$ are non-negative). Thus

[TABLE]

where the last step uses Lemma B8. Set $\mathbf{H}=\mathbf{O}_{1}\mathbf{O}_{2}^{\mathrm{\scriptscriptstyle T}}\mathbf{D}\mathbf{O}_{0}^{\mathrm{\scriptscriptstyle T}}$ then

[TABLE]

The eigenvalues of $\mathbf{H}\mathbf{H}^{\mathrm{\scriptscriptstyle T}}$ or $\mathbf{H}^{\mathrm{\scriptscriptstyle T}}\mathbf{H}$ are the diagonal elements in $\mathbf{D}^{2}$ , which are ${\textnormal{O}_{\textnormal{p}}}(\sqrt{\log p/n})$ -close to $1$ as shown by Lemma B10. ∎

Lemma B12 (Theorem 3, part (d)).

Suppose Assumptions 5-7 hold. $\widehat{\mathbf{U}}=(\mathbf{I}-\widehat{\mathbf{F}}\widehat{\mathbf{F}}^{\mathrm{\scriptscriptstyle T}}/n)\mathbf{X}$ recovers the latent individual factor matrix $\mathbf{U}$ in the sense that

[TABLE]

Proof.

Recall that $\bm{b}_{j}$ denote the $j$ -th row of $\mathbf{B}$ , $j=1,\dots,p$ . Recall the definition of $\mathbf{H}$ in Lemma B11. It is elementary that $\|\mathbf{H}\|={\textnormal{O}_{\textnormal{p}}}(1)$ and $\|\mathbf{H}^{-1}\|={\textnormal{O}_{\textnormal{p}}}(1)$ . Note that $\widehat{\mathbf{F}}^{\mathrm{\scriptscriptstyle T}}\widehat{\mathbf{F}}/n=\mathbf{I}$ . Write

[TABLE]

For the first term,

[TABLE]

For the second term,

[TABLE]

For the third term,

[TABLE]

∎

Appendix C Implementation of Gibbs Samplers

For the prior (6), we set $h$ as the Gaussian density function $h(z)=e^{-z^{2}/2}/\sqrt{2\pi}$ and $g$ as the inverse-gamma density function with shape $a_{0}=1$ and scale $b_{0}=1$ . A Gibbs sampler is implemented to explore the pseudo-posterior distribution (7). This Gibbs sampler runs towards the pseudo-posterior joint distribution of $(\sigma^{2},\bm{\alpha},\bm{\beta})$ by iterating the following steps: (1) draw $\xi$ given $\bm{\alpha}$ and $\sigma^{2}$ , (2) draw $\bm{\beta}$ given $\xi$ , $\bm{\alpha}$ and $\sigma^{2}$ , (3) draw $\bm{\alpha}$ given $\xi,\bm{\beta}$ and $\sigma^{2}$ , (4) draw $\sigma^{2}$ given $\xi,\bm{\beta}$ and $\bm{\alpha}$ .

For simplicity, we illustrate the implementation details with $s_{0}=1$ , $\tau_{j}=1$ for $j=1,\dots,p$ . For the first step, we have

[TABLE]

This implies

[TABLE]

where $\mathbf{S}_{\xi}=\widehat{\mathbf{U}}_{\xi}\widehat{\mathbf{U}}_{\xi}^{\mathrm{\scriptscriptstyle T}}+\mathbf{I}$ . However, it is computationally prohibitive to directly sample from this conditional distribution, as $\xi$ takes $2^{p}$ possible values. As a remedy, we flip $Z_{j}=1\{j\in\xi\}$ in Gibbs random scans. In our experiments, we found that just one random scan suffices for the proposed method to perform well. Details of flipping $Z_{j}$ will be given at the end of this section.

For the second step, we derive, by elementary calculus,

[TABLE]

Recall that $\bm{\beta}_{\xi^{c}}\equiv 0$ . Similarly, for the third step,

[TABLE]

The final step uses the conjugacy of normal distribution and inverse-gamma distribution

[TABLE]

In the first step, in order to sample from the conditional distribution (12), we flip $Z_{j}$ with probability

[TABLE]

where $\omega=\{j^{\prime}\neq j:Z_{j}^{\prime}=1\}$ . The posterior probability ratio is computed as

[TABLE]

where we derive, by Sylvester’s determinant theorem and properties of Schur complements, that

[TABLE]

and, by Sherman-Morrison-Woodbury identity, that

[TABLE]

As shown in our theoretical analyses, this Gibbs sampler will deal with $|\omega|\leq(M_{0}+1)s$ in most time. The computation of terms in the posterior probability ratio is numerically stable as the Gram matrices involved in the computation has small size. The computation

It is also time-efficient with complexity ${\textnormal{O}}(n|\omega|^{2})\leq{\textnormal{O}}(ns^{2})$ . The overall time complexity running $T$ iterations of Gibbs samplers in our Bayesian method is then ${\textnormal{O}}(Tpns^{2})$ . In contrast, the factor-adjusted lasso method costs ${\textnormal{O}}(p^{3})$ time. In the simulation studies, we choose $T=20$ , $n=200$ , $p=500$ , $s=5$ as the typical setting, and observe that our Bayesian method runs as fast as its lasso analogue.

Bibliography57

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ahn and Horenstein (2013) Ahn, S. C. and Horenstein, A. R. (2013). Eigenvalue ratio test for the number of factors. Econometrica 81 1203–1227.
2Armagan et al. (2013) Armagan, A. , Dunson, D. B. and Lee, J. (2013). Generalized double pareto shrinkage. Statistica Sinica 23 119.
3Bai (2003) Bai, J. (2003). Inferential theory for factor models of large dimensions. Econometrica 71 135–171.
4Bai and Ng (2006) Bai, J. and Ng, S. (2006). Confidence intervals for diffusion index forecasts and inference for factor-augmented regressions. Econometrica 74 1133–1150.
5Bai and Yin (1988) Bai, Z.-D. and Yin, Y.-Q. (1988). Necessary and sufficient conditions for almost sure convergence of the largest eigenvalue of a wigner matrix. Annals of Probability 1729–1741.
6Barron (1998) Barron, A. R. (1998). Information-theoretic characterization of bayes performance and the choice of priors in parametric and nonparametric problems. Bayesian Statistics 6 27–52.
7Belloni et al. (2012) Belloni, A. , Chen, D. , Chernozhukov, V. and Hansen, C. (2012). Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80 2369–2429.
8Bhattacharya et al. (2015) Bhattacharya, A. , Pati, D. , Pillai, N. S. and Dunson, D. B. (2015). Dirichlet-Laplace priors for optimal shrinkage. Journal of the American Statistical Association 110 1479–1490.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Bayesian Factor-adjusted Sparse Regression

Abstract

1 Introduction

2 Model and Methodology

3 Theory

3.1 Assumptions

Assumption 1**.**

Assumption 2**.**

Assumption 3**.**

Assumption 4**.**

3.2 Definition of Posterior Contraction Rate

Definition 1** (Posterior contraction).**

3.3 Results

Theorem 1**.**

4 Factor Model Estimation

Assumption 5**.**

Assumption 6**.**

Assumption 7**.**

Definition 2**.**

Theorem 2**.**

5 Simulation Experiments

5.1 Comparison of four methods, and insensitivity to misestimates of kkk

5.2 Scalability as n,p,sn,p,sn,p,s increase

5.3 Estimating the standard regression model

6 Predicting U.S. Bond Risk Premia

7 Discussion

Acknowledgement

Appendix A Technical Proofs for Bayesian Sparse Regression

A.1 Proof of Theorem 1

A.2 Bayesian Sparse Regression with Fixed Design

Theorem 3**.**

Proposition 1**.**

Lemma A1**.**

Lemma A2**.**

Lemma A3**.**

Proof of Theorem 1, part (a).

Proof of Theorem 3, part(b).

Proof of Theorem 3, part(c).

A.3 Technical Proofs of Lemmas

Lemma A4** (Probability bounds of chi-squared random variables).**

Proof.

Lemma A5**.**

Proof.

Lemma A6** (Part of Corollary 2.4 in (Liu, 2005)).**

Proof of Lemma A2, part (a).

Proof of Lemma A2, part (b).

Proof of Lemma A2, part (c).

Proof of Lemma A2, part (d).

Proof of Lemma A2, part (e).

Proof of Lemma A3.

Appendix B Technical Proofs for Factor Model Estimation

Lemma B7** (Theorem 3, part (a)).**

Proof.

Proposition 2**.**

Proof.

Lemma B8** (Theorem 3, part (b)).**

Proof.

Lemma B9**.**

Proof.

Lemma B10**.**

Proof.

Lemma B11** (Theorem 3, part (c)).**

Proof.

Lemma B12** (Theorem 3, part (d)).**

Proof.

Appendix C Implementation of Gibbs Samplers

Assumption 1.

Assumption 2.

Assumption 3.

Assumption 4.

Definition 1 (Posterior contraction).

Theorem 1.

Assumption 5.

Assumption 6.

Assumption 7.

Definition 2.

Theorem 2.

5.1 Comparison of four methods, and insensitivity to misestimates of $k$

5.2 Scalability as $n,p,s$ increase

Theorem 3.

Proposition 1.

Lemma A1.

Lemma A2.

Lemma A3.

Lemma A4 (Probability bounds of chi-squared random variables).

Lemma A5.

Lemma A6 (Part of Corollary 2.4 in (Liu, 2005)).

Lemma B7 (Theorem 3, part (a)).

Proposition 2.

Lemma B8 (Theorem 3, part (b)).

Lemma B9.

Lemma B10.

Lemma B11 (Theorem 3, part (c)).

Lemma B12 (Theorem 3, part (d)).