Inverse Moment Methods for Sufficient Forecasting using High-Dimensional   Predictors

Wei Luo; Lingzhou Xue; Jiawei Yao; Xiufan Yu

arXiv:1705.00395·math.ST·April 22, 2021

Inverse Moment Methods for Sufficient Forecasting using High-Dimensional Predictors

Wei Luo, Lingzhou Xue, Jiawei Yao, Xiufan Yu

PDF

Open Access

TL;DR

This paper introduces inverse moment methods for sufficient forecasting with high-dimensional predictors, combining factor analysis and dimension reduction to effectively model nonlinear relationships in macroeconomic data.

Contribution

It proposes a novel approach using inverse third-moment methods for dimension reduction, accommodating diverging factors and avoiding time reversibility assumptions, enhancing applicability.

Findings

01

Methods outperform traditional approaches in simulations.

02

Effective in forecasting macroeconomic data from 1959 to 2016.

03

Provides theoretical guarantees including invariance and order determination.

Abstract

We consider forecasting a single time series using a large number of predictors in the presence of a possible nonlinear forecast function. Assuming that the predictors affect the response through the latent factors, we propose to first conduct factor analysis and then apply sufficient dimension reduction on the estimated factors, to derive the reduced data for subsequent forecasting. Using directional regression and the inverse third-moment method in the stage of sufficient dimension reduction, the proposed methods can capture the non-monotone effect of factors on the response. We also allow a diverging number of factors and only impose general regularity conditions on the distribution of factors, avoiding the undesired time reversibility of the factors by the latter. These make the proposed methods fundamentally more applicable than the sufficient forecasting method in Fan et al.…

Figures1

Click any figure to enlarge with its caption.

Tables3

Table 1. Table 1: Performance of estimated ϕ ^ ^ italic-ϕ \widehat{\phi} using median R 2 ( ϕ ^ ) superscript 𝑅 2 ^ italic-ϕ R^{2}(\widehat{\phi}) (%) with standard deviations in parentheses over 1000 replications.

Model I		SIR		DR		SEE
$p$	$T$	$R^{2} ({\hat{ϕ}}_{1})$	$R^{2} ({\hat{ϕ}}_{2})$	$R^{2} ({\hat{ϕ}}_{1})$	$R^{2} ({\hat{ϕ}}_{2})$	$R^{2} ({\hat{ϕ}}_{1})$	$R^{2} ({\hat{ϕ}}_{2})$
100	100	75.0(21.3)	28.4(27.4)	82.9(14.8)	79.9(21.9)	80.4(26.8)	27.0(23.5)
100	200	88.7(10.4)	17.7(27.6)	94.5(5.4)	91.5(8.5)	83.4(26.6)	21.7(20.7)
100	500	95.9(3.6)	14.4(28.2)	98.4(1.4)	96.0(3.4)	87.6(26.9)	30.8(20.7)
200	100	63.2(24.5)	26.6(24.8)	74.6(20.3)	67.9(24.4)	40.9(23.4)	13.0(18.5)
500	200	76.6(16.1)	16.1(23.2)	86.8(20.1)	80.2(22.1)	26.6(15.4)	9.4(15.8)
500	500	90.5(5.5)	9.2(22.4)	96.0(29.9)	87.7(26.0)	24.2(13.4)	7.6(13.5)
Model II		SIR		DR		SEE
100	100	95.8(3.5)	21.0(25.7)	95.8(3.5)	26.4(26.6)	89.7(22.5)	33.0(20.1)
100	200	97.8(1.8)	32.4(27.7)	97.9(1.8)	43.4(28.7)	90.4(15.0)	30.2(19.5)
100	500	99.1(0.7)	63.8(27.0)	99.1(0.7)	74.8(23.8)	91.9(20.5)	48.7(20.9)
200	100	94.6(3.6)	17.6(22.4)	94.2(10.6)	21.4(23.4)	81.6(26.8)	21.2(18.7)
500	200	95.9(2.1)	18.2(22.6)	95.5(11.9)	24.7(23.4)	37.8(26.5)	13.9(17.4)
500	500	98.4(0.9)	41.1(25.6)	97.9(15.2)	48.3(26.3)	30.7(24.9)	13.1(17.1)
Model III		SIR		DR		SEE
100	100	33.4(26.7)	26.1(23.4 )	83.0(19.7)	47.6(28.2)	40.1(30.7)	29.9(18.4)
100	200	34.8(27.3)	23.8(22.7)	94.9(4.1)	83.2(22.9)	68.4(35.1)	20.2(18.1)
100	500	33.0(28.1)	24.2(23.4)	98.4(1.4)	97.6(2.1)	77.2(34.6)	21.5(16.7)
200	100	29.5(25.9)	19.8(20.4)	75.0(23.3)	36.5(25.7)	37.9(26.8)	12.9(17.9)
500	200	20.3(23.7)	15.2(10.1)	88.9(22.2)	48.8(27.8)	20.5(16.1)	8.6(14.5)
500	500	21.3(23.1)	14.5(18.1)	95.6(29.6)	92.9(28.0)	14.0(13.5)	6.6(13.7)
Model IV		SIR		DR		SEE
100	100	61.8(29.1)	31.3(26.0)	85.6(14.2)	79.1(23.5)	64.4(27.8)	43.9(18.4)
100	200	75.1(26.4)	41.6(27.9)	94.5(4.9)	93.5(5.2)	71.7(34.1)	51.1(19.4)
100	500	89.4(15.0)	67.8(27.4)	98.1(1.7)	97.7(1.9)	88.2(37.7)	66.6(17.0)
200	100	51.9(28.6)	29.0(24.8)	79.6(19.7)	71.0(24.3)	41.5(25.9)	12.2(18.2)
500	200	59.4(27.9)	30.2(24.4)	87.5(21.7)	86.2(20.2)	19.5(15.4)	7.4(13.5)
500	500	83.3(17.8)	54.9(26.9)	95.1(28.3)	94.6(26.4)	10.2(13.3)	4.8(13.1)

Table 2. Table 2: Comparison of out-of-sample median R 2 superscript 𝑅 2 R^{2} in percentage (%) over 1000 replications.

		Model I				Model II
$p$	$T$	SIR	DR	PC	SEE	SIR	DR	PC	SEE
100	100	-11.7	28.8	-0.4	1.4	94.6	94.8	93.3	78.0
100	200	-3.9	72.1	18.0	9.9	95.7	95.8	94.6	79.4
100	500	0.4	92.2	27.4	11.4	96.1	96.2	94.9	79.3
200	100	-11.4	18.6	-6.9	-4.0	95.3	95.6	94.2	61.8
500	200	-5.3	57.5	-1.1	-0.7	96.2	96.5	94.8	45.9
500	500	-0.9	91.4	13.8	1.7	97.1	97.1	95.8	45.4
		Model III				Model IV
$p$	$T$	SIR	DR	PC	SEE	SIR	DR	PC	SEE
100	100	-9.4	34.8	17.8	17.2	-0.2	23.6	21.2	18.4
100	200	1.0	77.1	30.8	22.7	13.5	53.7	35.8	28.4
100	500	5.2	90.5	38.0	25.5	29.6	57.3	43.2	30.8
200	100	-9.7	21.5	3.8	6.3	-2.3	16.9	6.8	7.6
500	200	-4.4	62.5	6.6	2.6	5.6	46.0	9.7	5.3
500	500	-1.3	89.5	19.1	5.2	22.4	58.3	21.6	48.5

Table 3. Table 3: RMSE in Out of Sample Forecast (Median/Max/Min): out-of-sample RMSE relative to the linear diffusion index. In each group, the median, maximum and minimum of RMSE is reported. SIR( i 𝑖 i ) denotes sufficient forecasting using i 𝑖 i indices, DR denotes sufficient directional forecasting, and NL-PC denotes a nonlinear additive model on all the estimated factors.

Group ( $h = 1$ )	SIR(1)	SIR(2)	DR(1)	DR(2)	NL-PC
Output & Income	1.03/1.61/0.96	1.02/1.13/0.94	0.99/1.19/0.92	1.02/1.14/0.90	1.21/1.38/1.05
Consumption	1.00/2.10/0.80	0.95/1.05/0.74	0.92/1.02/0.86	1.00/1.05/0.81	1.16/1.44/1.04
Labor market	1.02/2.27/0.71	1.00/1.21/0.42	0.97/1.13/0.52	0.98/1.16/0.42	1.21/1.53/0.46
Housing	1.04/1.32/0.64	0.92/1.08/0.52	0.83/1.04/0.50	0.79/0.94/0.44	0.83/0.97/0.49
Money & Credit	0.94/1.04/0.86	0.97/1.05/0.90	0.96/1.10/0.86	1.04/1.24/0.92	1.14/1.41/1.07
Stock market	0.99/1.39/0.90	1.02/1.12/0.83	0.92/1.08/0.88	1.04/1.07/0.91	1.36/1.39/1.14
Interest rates	1.04/1.79/0.79	0.93/1.17/0.61	0.90/1.04/0.59	0.92/1.15/0.62	1.12/1.32/0.73
Prices	0.97/1.42/0.80	0.99/1.05/0.83	0.95/1.12/0.81	0.97/1.12/0.88	1.12/1.47/0.92
Group ( $h = 6$ )	SIR(1)	SIR(2)	DR(1)	DR(2)	NL-PC
Output & Income	1.07/1.47/0.93	0.97/1.23/0.81	0.99/1.18/0.89	1.05/1.27/0.95	1.28/1.52/0.97
Consumption	1.16/1.73/0.90	0.90/1.12/0.67	0.94/1.16/0.71	1.03/1.14/0.73	1.28/1.66/0.77
Labor market	1.15/2.02/0.68	0.89/1.22/0.39	0.90/1.26/0.48	0.98/1.39/0.43	1.24/1.42/0.45
Housing	0.96/1.29/0.66	0.85/0.95/0.51	0.73/0.89/0.50	0.69/0.86/0.47	0.78/1.02/0.55
Money & Credit	0.95/3.51/0.76	1.01/3.65/0.83	0.99/1.52/0.76	1.02/1.74/0.78	1.23/2.90/0.92
Stock market	0.91/1.20/0.83	0.94/1.05/0.89	0.89/1.08/0.84	1.00/1.03/0.94	1.23/1.27/0.83
Interest rates	1.01/1.61/0.75	0.90/1.12/0.64	0.84/1.13/0.50	0.88/1.18/0.58	1.11/1.46/0.70
Prices	1.16/1.37/0.51	1.03/1.12/0.82	1.11/1.37/0.94	1.14/1.36/0.95	1.17/1.35/1.11
Group ( $h = 12$ )	SIR(1)	SIR(2)	DR(1)	DR(2)	NL-PC
Output & Income	1.24/1.67/0.79	1.01/1.45/0.76	0.99/1.22/0.76	1.01/1.36/0.86	1.17/1.34/0.92
Consumption	1.27/1.60/0.83	1.08/1.44/0.62	1.09/1.32/0.65	1.06/1.38/0.66	1.16/1.38/0.87
Labor market	1.07/1.76/0.67	0.83/1.40/0.41	0.91/1.44/0.54	0.89/1.41/0.46	1.13/1.39/0.56
Housing	0.85/1.35/0.59	0.69/0.93/0.46	0.67/0.91/0.40	0.68/0.83/0.36	0.89/1.16/0.54
Money & Credit	1.14/2.03/0.41	1.03/2.16/0.80	1.05/1.52/0.85	1.00/1.40/0.82	1.20/1.69/0.87
Stock market	1.09/1.20/0.89	1.01/1.13/0.84	0.96/1.17/0.94	1.08/1.16/0.75	1.06/1.14/0.89
Interest rates	1.00/1.31/0.75	0.82/1.22/0.59	0.80/1.27/0.53	0.85/1.18/0.51	1.07/1.62/0.70
Prices	1.18/1.40/0.53	1.21/1.40/0.66	1.19/1.31/0.71	1.21/1.33/0.77	1.25/1.52/0.94

Equations37

y_{t + 1}

y_{t + 1}

x_{i t}

y_{t+1}\rotatebox[origin={c}]{90.0}{$\models$}f_{t}\,|\,(\phi_{1},\ldots,\phi_{L})^{\prime}f_{t}.

y_{t+1}\rotatebox[origin={c}]{90.0}{$\models$}f_{t}\,|\,(\phi_{1},\ldots,\phi_{L})^{\prime}f_{t}.

y_{t + 1} = g (ϕ_{1}^{'} f_{t} + ψ_{1}^{'} ω_{t}, \dots, ϕ_{L}^{'} f_{t} + ψ_{L}^{'} ω_{t}, ϵ_{t + 1})

y_{t + 1} = g (ϕ_{1}^{'} f_{t} + ψ_{1}^{'} ω_{t}, \dots, ϕ_{L}^{'} f_{t} + ψ_{L}^{'} ω_{t}, ϵ_{t + 1})

(B_{K}, F_{K}) = arg (B, F) min

(B_{K}, F_{K}) = arg (B, F) min

M_{d r} = E {2 var (f_{t}) - E [(f_{t} - g_{s}) (f_{t} - g_{s})^{'} ∣ y_{t + 1}, η_{s + 1}]}^{2},

M_{d r} = E {2 var (f_{t}) - E [(f_{t} - g_{s}) (f_{t} - g_{s})^{'} ∣ y_{t + 1}, η_{s + 1}]}^{2},

f_{t} = f_{t} + u_{t}^{*},

f_{t} = f_{t} + u_{t}^{*},

M_{d r}

M_{d r}

∥ (ϕ_{1}, \dots, ϕ_{L}) (ϕ_{1}, \dots, ϕ_{L})^{'} - (ϕ_{1}, \dots, ϕ_{L}) (ϕ_{1}, \dots, ϕ_{L})^{'} ∥_{F} = O_{P} (K^{3/2} p^{- 1/2} + K T^{- 1/2}) .

∥ (ϕ_{1}, \dots, ϕ_{L}) (ϕ_{1}, \dots, ϕ_{L})^{'} - (ϕ_{1}, \dots, ϕ_{L}) (ϕ_{1}, \dots, ϕ_{L})^{'} ∥_{F} = O_{P} (K^{3/2} p^{- 1/2} + K T^{- 1/2}) .

∥ (ϕ_{1}, \dots, ϕ_{L}) (ϕ_{1}, \dots, ϕ_{L})^{'} - (ϕ_{1}, \dots, ϕ_{L}) (ϕ_{1}, \dots, ϕ_{L})^{'} ∥_{F} = o_{P} (1) .

∥ (ϕ_{1}, \dots, ϕ_{L}) (ϕ_{1}, \dots, ϕ_{L})^{'} - (ϕ_{1}, \dots, ϕ_{L}) (ϕ_{1}, \dots, ϕ_{L})^{'} ∥_{F} = o_{P} (1) .

κ (ψ_{1}, \dots, ψ_{L})

κ (ψ_{1}, \dots, ψ_{L})

∥ (ϕ_{1}, \dots, ϕ_{L}) (ϕ_{1}, \dots, ϕ_{L})^{'} - (ϕ_{1}, \dots, ϕ_{L}) (ϕ_{1}, \dots, ϕ_{L})^{'} ∥_{F} = O_{P} (p^{- 1/2} + T^{- 1/2}) .

∥ (ϕ_{1}, \dots, ϕ_{L}) (ϕ_{1}, \dots, ϕ_{L})^{'} - (ϕ_{1}, \dots, ϕ_{L}) (ϕ_{1}, \dots, ϕ_{L})^{'} ∥_{F} = O_{P} (p^{- 1/2} + T^{- 1/2}) .

K = ar g 0 \leq k \leq K_{m a x} min lo g (p^{- 1} T^{- 1} ∥ X - T^{- 1} X F_{k} F_{k}^{'} ∥_{F}^{2}) + k \cdot q (p, T),

K = ar g 0 \leq k \leq K_{m a x} min lo g (p^{- 1} T^{- 1} ∥ X - T^{- 1} X F_{k} F_{k}^{'} ∥_{F}^{2}) + k \cdot q (p, T),

q (p, T) = (p + T) (pT)^{- 1} lo g {pT (p + T)^{- 1}} .

q (p, T) = (p + T) (pT)^{- 1} lo g {pT (p + T)^{- 1}} .

G (l) = (T /2) \sum_{i = 1 + m i n (τ, l)}^{K_{c}} {lo g (λ_{i} + 1) - λ_{i}} - C_{T} l (2 K - l + 1) /2,

G (l) = (T /2) \sum_{i = 1 + m i n (τ, l)}^{K_{c}} {lo g (λ_{i} + 1) - λ_{i}} - C_{T} l (2 K - l + 1) /2,

y_{t + 1} = g (ϕ_{1}^{'} f_{t}, ϕ_{2}^{'} f_{t}) + σ ϵ_{t + 1}, and x_{i t} = b_{i}^{'} f_{t} + u_{i t},

y_{t + 1} = g (ϕ_{1}^{'} f_{t}, ϕ_{2}^{'} f_{t}) + σ ϵ_{t + 1}, and x_{i t} = b_{i}^{'} f_{t} + u_{i t},

R^{2} = 1 - \sum_{t = T + 1}^{T + n_{T}} (y_{t} - \overset{y}{^}_{t})^{2} / \sum_{t = T + 1}^{T + n_{T}} (y_{t} - \overset{y}{ˉ}_{t})^{2},

R^{2} = 1 - \sum_{t = T + 1}^{T + n_{T}} (y_{t} - \overset{y}{^}_{t})^{2} / \sum_{t = T + 1}^{T + n_{T}} (y_{t} - \overset{y}{ˉ}_{t})^{2},

y_{t + h}^{h} = g (ϕ_{1}^{'} f_{t}, ..., ϕ_{L}^{'} f_{t}) + ϵ_{t + h}^{h},

y_{t + h}^{h} = g (ϕ_{1}^{'} f_{t}, ..., ϕ_{L}^{'} f_{t}) + ϵ_{t + h}^{h},

RMSE (M) = MSE (M) / MSE (PC), \mbox w h er e MSE (M) = m^{- 1} \sum_{t = T + 1}^{T + m} (y_{t} - \overset{y}{^}_{t})^{2},

RMSE (M) = MSE (M) / MSE (PC), \mbox w h er e MSE (M) = m^{- 1} \sum_{t = T + 1}^{T + m} (y_{t} - \overset{y}{^}_{t})^{2},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsProbabilistic and Robust Engineering Design · Statistical Methods and Inference · Grey System Theory Applications

Full text

Inverse Moment Methods for Sufficient Forecasting using High-Dimensional Predictors

Wei Luo1, Lingzhou Xue2, Jiawei Yao3 and Xiufan Yu2

1Zhejiang University, 2Pennsylvania State University and 3Princeton University

Abstract

We consider forecasting a single time series using a large number of predictors in the presence of a possible nonlinear forecast function. Assuming that the predictors affect the response through the latent factors, we propose to first conduct factor analysis and then apply sufficient dimension reduction on the estimated factors, to derive the reduced data for subsequent forecasting. Using directional regression and the inverse third-moment method in the stage of sufficient dimension reduction, the proposed methods can capture the non-monotone effect of factors on the response. We also allow a diverging number of factors and only impose general regularity conditions on the distribution of factors, avoiding the undesired time reversibility of the factors by the latter. These make the proposed methods fundamentally more applicable than the sufficient forecasting method in Fan et al. (2017). The proposed methods are demonstrated in both simulation studies and an empirical study of forecasting monthly macroeconomic data from 1959 to 2016. Also, our theory contributes to the literature of sufficient dimension reduction, as it includes an invariance result, a path to perform sufficient dimension reduction under the high-dimensional setting without assuming sparsity, and the corresponding order-determination procedure.

Key Words: Forecasting; Factor model; Principal components; Sufficient dimension reduction; Invariance property; High-dimensional asymptotics.

1 Introduction

Forecasting using high-dimensional predictors is an increasingly important research topic in statistics, biostatistics, macroeconomics and finance. A large body of literature has contributed to forecasting in a data rich environment, with various applications such as the forecasts of market prices, dividends and bond risks (Sharpe, 1964; Lintner, 1965; Ludvigson and Ng, 2009), macroeconomic outputs (Stock and Watson, 1989; Bernanke et al., 2005), macroeconomic uncertainty and fluctuations (Ludvigson and Ng, 2007; Jurado et al., 2015), and clinical outcomes based on massive genetic, genomic and imaging measurements. Motivated by principal component regression, the pioneering papers by Stock and Watson (2002a, b) systematically introduced the forecasting procedure using factor models, which has played an important role in macroeconomic analysis. Recently, Fan et al. (2017) extended Stock and Watson (2002a, b) to allow for a nonlinear forecast function and multiple nonadditive forecasting indices. Following Fan et al. (2017), we consider the following factor model with a target variable $y_{t+1}$ that we aim to forecast:

[TABLE]

where $x_{it}$ is the $i$ -th high-dimensional predictor observed at time $t$ , $b_{i}$ is a $K\times 1$ vector of factor loadings, $f_{t}$ is a $K\times 1$ vector of common factors driving both predictor and response, $g(\cdot)$ is an unknown forecast function that is possibly nonadditive and nonseperable, $u_{it}$ is an idiosyncratic error, and $\epsilon_{t+1}$ is an independent stochastic error. Here, $\phi_{1},\ldots,\phi_{L}$ , $b_{1},\ldots,b_{p}$ and $f_{1},\ldots,f_{T}$ are unobserved vectors. Model (1.1) equivalently assumes

[TABLE]

The linear space spanned by $\phi_{1},\ldots,\phi_{L}$ , denoted by ${\mathcal{S}}_{\scriptscriptstyle y|f}$ , is the parameter of interest that is identifiable and known as the central subspace (Cook, 1998). Fan et al. (2017) introduced the sufficient forecasting scheme to use factor analysis in model (1.2) to estimate $f_{t}$ , and apply the sliced inverse regression (Li, 1991) in model (1.1) with the estimated factors as the predictor. Such a combination provides a promising forecasting technique that not only extracts the underlying commonality of the high-dimensional predictor but also models the complex dependence between the predictor and the forecast target. It allows the dimension of the predictor to diverge and even become much larger than the number of observations.

The consistency result of Fan et al. (2017) is not granted as it may appear. If we replace the true factors $f_{t}$ with a consistent estimate $\widehat{f}_{t}$ in (1.3) and define the central subspace ${\mathcal{S}}_{\scriptscriptstyle y|\widehat{f}}$ similarly, then ${\mathcal{S}}_{\scriptscriptstyle y|\widehat{f}}$ may differ with ${\mathcal{S}}_{\scriptscriptstyle y|f}$ drastically in general. Thus, the naive method by applying existing dimension reduction methods to the estimated factors $\widehat{f}_{t}$ ’s may not necessarily lead to the consistent estimation of $S_{y|f}$ , even if it consistently estimates ${\mathcal{S}}_{\scriptscriptstyle y|\widehat{f}}$ . Fan et al. (2017) effectively addressed this issue by developing an important invariance result between $E(f_{t}|y_{t+1})$ and $E(\widehat{f}_{t}|y_{t+1})$ . See Proposition 2.1 and Equation (2.9) of Fan et al. (2017). This invariance result provides an essential foundation for using the sliced inverse regression under Models (1.1)–(1.2).

Nonetheless, the applicability of Fan et al. (2017) is restricted by the requirements that the number of factors $K$ must be fixed as $p$ and $T$ grow, and, for each set of factors, a linearity condition (see (B1) below) must hold. In particular, as ${\mathcal{S}}_{\scriptscriptstyle y|f}$ is unknown, the linearity condition is commonly strengthened to equivalently require an elliptically distributed $f_{t}$ , which causes the undesired time reversibility (Xia et al., 2002). In addition, the consistency of Fan et al. (2017) and Yu et al. (2020) hinges on an exhaustive estimation of ${\mathcal{S}}_{\scriptscriptstyle y|f}$ , (i.e. detecting all the directions), for which $\phi_{1}^{\prime}\Sigma_{f|y}\phi_{1},\ldots,\phi_{L}^{\prime}\Sigma_{f|y}\phi_{L}$ must be positive (See their Assumption (A2)). This condition is violated, i.e. $\phi^{\prime}\Sigma_{f|y}\phi$ being zero for some $\phi\in{\mathcal{S}}_{\scriptscriptstyle y|f}$ , if $\phi^{\prime}f_{t}|y_{t+1}$ has a symmetric distribution, which occurs when the forecast target was investigated using squared factors (Bai and Ng, 2008; Ludvigson and Ng, 2007). These limitations motivate us to construct more powerful forecasting methods based on Fan et al. (2017)’s work.

In this paper, we propose to use factor analysis and sufficient dimension reduction sequentially for sufficient forecasting, with second- or higher-order inverse moment methods being the working sufficient dimension reduction method. In the main text, we focus on a commonly used second-order inverse moment method called directional regression (Li and Wang, 2007), and defer the development with the third-order inverse moment method to the online supplement. Based on models (1.1) and (1.2), the proposed method includes the following steps:

Step 1. Estimate the factor loadings $B$ and the factors $f_{t}$ in Model (1.2).

Step 2. Use the estimates $\widehat{B}$ and $\widehat{f}_{t}$ in directional regression to estimate ${\mathcal{S}}_{\scriptscriptstyle y|f}$ .

Step 3. Use the nonparametric methods (Fan and Gijbels, 1996; Matzkin, 2003; Yu et al., 2020) to estimate $g(\cdot)$ in Model (1.1) and forecast $y_{t+1}$ , based on the estimate of $(\phi_{1}^{\prime}f_{t},\ldots,\phi_{L}^{\prime}f_{t})$ .

By studying both $E(f_{t}|y_{t+1})$ and $E(f_{t}f_{t}^{\prime}|y_{t+1})$ in Step 2, we explore the full power of the factor space. To this end, we first provide an important invariance result (i.e. Lemma 1) for directional regression. With the help of this invariance result, we do not require the coincidence or closeness of two central subspaces $S_{y|f}$ and ${\mathcal{S}}_{\scriptscriptstyle y|\widehat{f}}$ , so the proposed method can be applied to more general data, such as non-normally distributed factors.

Our work extends the method, theory and applicability of the forecasting using factor models. Compared with Fan et al. (2017), we relax the linearity condition to the general moment conditions on $f_{t}$ . From the discussion above, the proposed method does not require time reversibility of the factors, so it can be applied to the generalized forecasting model

[TABLE]

where $\omega_{t}$ is an $m\times 1$ vector of the observed variables (e.g. lags of $y_{t+1}$ ). In addition, by using the higher-order inverse moments, the proposed method requires weaker condition than Fan et al. (2017) and Yu et al. (2020) for exhaustive estimation of ${\mathcal{S}}_{\scriptscriptstyle y|f}$ . In particular, it can detect non-monotone effect of the factors on the response. Furthermore, we allow the number of underlying factors $K$ to diverge as $p,T\rightarrow\infty$ . By Lam and Yao (2012); Li et al. (2017) and Jurado et al. (2015), our method will deliver a more powerful forecast than Stock and Watson (2002a, b) and Fan et al. (2017).

Using the directional regression as an illustration, the proposed method also provides a novel framework of performing sufficient dimension reduction with large panel data under the high-dimensional setting, without the commonly-adopted sparsity assumption but with the assumption that the predictor affects the response only through the latent factors. The original direction regression (Li and Wang, 2007) can only deal with independently and identically distributed data under the low-dimensional setting. This enhances the applicability of model-free dimension reduction for high-dimensional data, when the sparsity assumption is not suitable.

The consistency of the proposed method hinges on the consistency of both factor analysis and directional regression based on the estimated factors, which we study next. For ease of presentation, we assume that both the number of factors $K$ and the dimension $L$ of ${\mathcal{S}}_{\scriptscriptstyle y|f}$ known a priori. This does not affect the asymptotic development of the resulting estimator, as long as $K$ and $L$ can be consistently estimated; see the supplement for details. The consistent estimation of $K$ and $L$ is deferred to §5. Throughout the article, we assume $L$ to be fixed as $K$ diverges.

2 Consistency of factor analysis

To make forecast, we need to estimate the factor loadings $B$ and the error covariance matrix $\Sigma_{u}$ . Consider the following constrained least squares problem:

[TABLE]

where $X=(x_{1},\cdots,x_{T}),F^{\prime}=(f_{1},\cdots,f_{T})$ , and $\|\cdot\|_{F}$ denotes the Frobenius norm of a matrix. The constraints $T^{-1}F^{\prime}F=I_{K}$ and that $B^{\prime}B$ is diagonal address the issue of identifiability during the minimization. As these conditions can always be satisfied for any $BF^{\prime}$ after appropriate matrix operations on $B$ and $F$ , they impose no additional restrictions on the factor model (1.2). It is known that the minimizers $\widehat{F}_{K}$ and $\widehat{B}_{K}$ of (2.1) are such that the columns of $\widehat{F}_{K}/\sqrt{T}$ are the eigenvectors corresponding to the $K$ largest eigenvalues of the $T\times T$ matrix $X^{\prime}X$ and $\widehat{B}_{K}=T^{-1}X\widehat{F}_{K}$ . To simplify notation, let $\widehat{B}=\widehat{B}_{K}$ and $\widehat{F}=\widehat{F}_{K}$ .

As both the dimension $p$ of the predictor $x_{t}$ and the number of factors $K$ are diverging, it is necessary to regulate the magnitude of the factor loadings $B$ and the idiosyncratic error $u_{t}$ , so that the latter is negligible with respect to the former. We should also regulate the stationarity of the time series. In this paper, we adopt the following assumptions. For simplicity in notation, we let $U=(u_{it})_{p\times T}$ , $B=(b_{1},\ldots,b_{p})^{\prime}$ , and $\|B\|_{\max}$ be the maximum of the absolute values of all the entries in $B$ . Let $\mathcal{F}_{\infty}^{0}$ and $\mathcal{F}_{T}^{\infty}$ denote the $\sigma-$ algebras generated by $\{(f_{t},u_{t},\epsilon_{t+1}):t\leq 0\}$ and $\{(f_{t},u_{t},\epsilon_{t+1}):t\geq T\}$ respectively. Let $\alpha(T)=\sup_{A\in\mathcal{F}_{\infty}^{0},B\in\mathcal{F}_{T}^{\infty}}|P(A)P(B)-P(AB)|$ .

Assumption 1 (Factors and Loadings).

(1) There exists $b>0$ such that $\|b_{i}\|\leq b$ for $i=1,\ldots,p$ and there exist two positive constants $c_{1}$ and $c_{2}$ such that $c_{1}<p^{-1}\lambda_{\min}(B^{\prime}B)<p^{-1}\lambda_{\max}(B^{\prime}B)<c_{2}$ ;

(2) Identification: $T^{-1}F^{\prime}F=I_{K}$ , and $B^{\prime}B$ is a diagonal matrix with distinct entries.

Assumption 2 (Data Generating Process).

$\{f_{t}\}_{t\geq 1}$ , $\{u_{t}\}_{t\geq 1}$ and $\{\epsilon_{t+1}\}_{t\geq 1}$ are three independent groups, and they are strictly stationary. $\{K^{-2}E\|f_{t}\|^{4}:K\in\mathbb{N}\}$ and $\{K^{-1}E(\|f_{t}\|^{2}|y_{t+1}):K\in\mathbb{N}\}$ are bounded sequences, and $\alpha(T)<c\rho^{T}$ for $T\in\mathbb{Z}^{+}$ and some $\rho\in(0,1)$ .

Assumption 3 (Residuals and Dependence).

There is a constant $M>0$ such that (1) $E|u_{it}|^{8}\leq M$ ; (2) $\|\Sigma_{u}\|_{1}\leq M$ ; (3) For every $(t,s)$ , $E|p^{-1/2}(u_{s}^{\prime}u_{t}-E(u_{s}^{\prime}u_{t}))|^{4}\leq M$ ; (4) $U=LER$ where $L\in\mathbb{R}^{p\times p}$ and $R\in\mathbb{R}^{T\times T}$ are non-random positive definite matrices and $E=(e_{it})_{p\times T}$ includes independent elements with $E(e_{ti})=0$ and $E|e_{it}|^{7}\leq M$ .

Assumptions 1 and 3 ensure that signals dominate errors in the population level as $p$ grows. Assumptions 1 regulates the signal strength of factors contained in the predictor through the convergence rate of estimated factor loadings, and Assumption 3 regulates the idiosyncratic errors. Assumption 3(4) regulates weak autocorrelation and cross-sectional correlation as in Li et al. (2017). Assumption 2 imposes independence between factors and idiosyncratic errors as in Lam and Yao (2012). Assumption 2 implies that the observations are only weakly dependent, so that the estimation accuracy grows with $T$ . Assumption 2 and Assumption 3(2) imply that for every $i,j,t,s>0$ , $\max_{t\leq T}p^{-1}\sum_{i,j}|E(u_{it}u_{jt})|=O(1)$ and $(pT)^{-1}\sum_{i,j,t,s}|E(u_{it}u_{js})|=O(1)$ (See Lemma 6 of Fan et al. (2013)).

Under these assumptions, we have the following consistency result for estimating the factor loadings. Instead of the Frobenius norm used in (2.1), we use the spectral norm to measure the magnitude of a matrix, defined as $\|A\|=\lambda_{\max}^{1/2}(A^{\prime}A)$ , the square root of the largest eigenvalue of the symmetric matrix $A^{\prime}A$ , for any matrix $A$ .

Theorem 1.

Let $\Lambda_{b}=(B^{\prime}B)^{-1}B^{\prime}$ and $\widehat{\Lambda}_{b}=(\widehat{B}^{\prime}\widehat{B})^{-1}\widehat{B}^{\prime}$ . Given $K=o(\min\{p^{1/3},T\})$ and Assumptions 1, 2 and 3(1)-(3), we have

$\|\widehat{B}-B\|=O_{p}(p^{1/2}(K^{3/2}p^{-1/2}+K^{1/2}T^{-1/2}))$ ,

2)

$\|\widehat{\Lambda}_{b}-\Lambda_{b}\|=O_{p}(p^{-1/2}(K^{3/2}p^{-1/2}+K^{1/2}T^{-1/2}))$ .

Theorem 1 extends the existing consistency result for estimating the factor loadings (Lam et al., 2011; Fan et al., 2013, 2017) by pinpointing the effect of diverging $K$ . Because the dimension $p$ of factor loadings $B$ is diverging, the estimation error $\widehat{B}-B$ accumulates as $p$ grows. For a $p$ -dimensional vector whose entries are constantly one, its spectral norm is $p^{1/2}$ , which diverges to infinity. Thus, we should treat $p^{1/2}$ as the unit magnitude of the spectral norm of matrices with $p$ rows, in which sense the statement 1) of Theorem 1 justifies the estimation consistency of the factor loadings $B$ . As the error term $u_{t}$ shrinks as $p$ grows under Assumption 3, the convergence rate of the factor loading estimation largely depends on $p$ - a higher dimensional predictor means a more accurate estimation. The convergence rate in this theorem can be further improved if we impose stronger assumptions on the negligibility of the error terms in the factor model (1.2).

Given $\widehat{B}$ , it is easy to see $\widehat{f}_{t}=\widehat{\Lambda}_{b}Bf_{t}+\widehat{\Lambda}_{b}u_{t}$ . Thus, together with the negligibility of the error term $u_{t}$ , the consistency of $\widehat{B}$ and $\widehat{\Lambda}_{b}$ indicates the closeness between the true factors $f_{t}$ and the estimated factors $\widehat{f}_{t}$ , of which the latter will be used in the subsequent sufficient dimension reduction. The error covariance matrix $\Sigma_{u}$ can be estimated by thresholding the sample covariance matrix of the estimated residual $x_{t}-\widehat{B}\widehat{f}_{t}$ , denoted by $\widehat{\Sigma}_{u}=(\hat{\sigma}^{u}_{ij})_{p\times p}$ , as in Cai and Liu (2011), Xue et al. (2012) and Fan et al. (2013, 2016).

3 Directional regression based on an invariance result

3.1 An invariance result

Had the true factors $f_{t}$ been observed, directional regression would estimate the central subspace ${\mathcal{S}}_{\scriptscriptstyle y|f}$ as the column space of

[TABLE]

where $(g_{s},\eta_{s+1})$ is a hypothetical independent copy of $(f_{t},y_{t+1})$ . The term $\mathrm{var}(f_{t})$ can be replaced with the identity matrix as in Li and Wang (2007), but we keep it in this form for the convenience in the theoretical work developed later. For the resulting directions being included in ${\mathcal{S}}_{\scriptscriptstyle y|f}$ , $f_{t}$ needs to satisfy the linearity condition and the constant variance condition; that is,

(B1) $E(b^{\prime}f_{t}|\phi_{1}^{\prime}f_{t},\cdots,\phi_{L}^{\prime}f_{t})$ is a linear function of $(\phi_{1}^{\prime}f_{t},\ldots,\phi_{L}^{\prime}f_{t})$ for any $b\in\mathbb{R}^{K}$ .

(B2) $\mathrm{var}(f_{t}\mid\phi_{1}^{\prime}f_{t},\ldots,\phi_{L}^{\prime}f_{t})$ is degenerate.

Since ${\mathcal{S}}_{\scriptscriptstyle y|f}$ is unknown, (B1) and (B2) are commonly strengthened such that they are satisfied for basis matrices of any $L$ -dimensional subspace of $\mathbb{R}^{K}$ . The strengthened conditions equivalently require the factors to be jointly normally distributed. To assess these conditions, one can treat $f_{t}$ as the response and $(\phi_{1}^{\prime}f_{t},\ldots,\phi_{L}^{\prime}f_{t})$ as the predictor in regression, then (B1) is the linearity assumption on the regression function and (B2) is the homoscedasticity assumption on the error term. In this sense, we follow the convention in the literature of regression to treat (B2) less worrisome than (B1) in practice. We tentatively assume (B1) and relax it in §4.

Under general conditions, the column space of $M_{dr}$ is $L$ -dimensional, which, together with the linearity condition (B1) and the constant variance condition (B2), means the exhaustive recovery of ${\mathcal{S}}_{\scriptscriptstyle y|f}$ . These conditions are proposed in Li and Wang (2007) and reviewed in the supplement. They are weaker than those required for the exhaustiveness of sliced inverse regression, as more information about $f_{t}|y_{t+1}$ , i.e. the second moment, is used. We assume these conditions throughout the paper, including §4 where (B1) is violated.

To pinpoint the effect of using the estimated factors in directional regression, we next propose an invariance result for $M_{dr}$ . As mentioned in §1, a similar invariance result for sliced inverse regression can be found in Fan, Xue and Yao (2017) where only the inverse first moment is involved (see their equation (2.6)). To simplify the discussion, in the rest of the subsection we assume an oracle scenario where $B$ is known a priori, which gives

[TABLE]

where $u^{*}_{t}=\Lambda_{b}u_{t}$ is independent of $f_{t}$ . Let $u_{s}^{*}$ be an independent copy of $u_{t}^{*}$ in (3.2) and let $\widehat{g}_{s}=g_{s}+u_{s}^{*}$ . Since $B$ is known, $\widehat{g}_{s}$ is an independent copy of $\widehat{f}_{t}$ .

Lemma 1.

(The invariance result) Under model (1.2), $M_{dr}$ defined in (3.1) is invariant if the true factors $f_{t}$ and $g_{s}$ are replaced with the estimated factors $\widehat{f}_{t}$ and $\widehat{g}_{s}$ .

Using the estimated factors, one would naturally treat $S_{\scriptscriptstyle y|\widehat{f}}$ as the working parameter in the stage of sufficient dimension reduction. However, as no distributional assumptions are imposed on $u^{*}_{t}$ , both (B1) and (B2) can be violated for $\widehat{f}_{t}$ , which causes inconsistency of directional regression for recovering ${\mathcal{S}}_{\scriptscriptstyle y|\widehat{f}}$ . In addition, ${\mathcal{S}}_{\scriptscriptstyle y|\widehat{f}}$ itself may deviate from the parameter of interest ${\mathcal{S}}_{\scriptscriptstyle y|f}$ , as the identity between the two essentially requires the normality of both $f_{t}$ and $u^{*}_{t}$ (Li and Yin, 2007). The invariance result provides the key to address these issues; that is, we can bypass ${\mathcal{S}}_{\scriptscriptstyle y|\widehat{f}}$ and directly estimate ${\mathcal{S}}_{\scriptscriptstyle y|f}$ using the estimated factors, as if the true factors were used. As $\mathrm{var}(\widehat{f}_{t})$ is no longer the identity matrix, $M_{dr}$ adopted here modifies its original form in Li and Wang (2007). This modification is crucial as it averages out the effect of the estimation error $u_{t}^{*}$ . It also means that the column space of the working $M_{dr}$ does differ from ${\mathcal{S}}_{\scriptscriptstyle y|\widehat{f}}$ .

3.2 Consistency of directional regression

In reality, the hypothetical independent copies $(g_{s},\eta_{s+1})$ and $(f_{t},y_{t+1})$ do not exist in the observed data, so we expand (3.1) and estimate an equivalent form of $M_{dr}$ ,

[TABLE]

By Lemma 1, we can replace $f_{t}$ with $\widehat{f}_{t}$ , in which $B$ is replaced with $\widehat{B}$ . For the ease of estimation, in the literature of sufficient dimension reduction, it has been a common practice to employ the slicing technique; that is, we partition the sample of $y_{t+1}$ into $H$ slices with equal sample proportion. In the population level, it corresponds to partitioning the support of $y_{t+1}$ into $H$ slices with equal probability, and using the corresponding indicator, denoted by $y_{t+1}^{D}$ , as the new working response variable.

Because the slice indicator $y_{t+1}^{D}$ is a measurable function of the original response $y_{t+1}$ , $f_{t}$ must affect $y_{t+1}^{D}$ through $y_{t+1}$ . Thus, the working central subspace ${\mathcal{S}}_{y^{D}|f}$ is always a subspace of the central subspace of interest ${\mathcal{S}}_{\scriptscriptstyle y|f}$ . The two spaces further coincide for large $H$ . Because the dimension $L$ of ${\mathcal{S}}_{\scriptscriptstyle y|f}$ is fixed as $K$ grows, without loss of generality, we fix $H$ as $K$ grows and assume the identity between ${\mathcal{S}}_{y^{D}|f}$ and ${\mathcal{S}}_{\scriptscriptstyle y|f}$ . Such identity is conformed by an omitted simulation study that shows the robustness of the proposed method to the choice of $H$ , for a reasonable range of $H$ , e.g. from three to ten. The same phenomenon has also been commonly observed in the literature (Li, 1991; Li and Wang, 2007).

Using $y_{t+1}^{D}$ , the inverse moments $E(\widehat{f}_{t}|y_{t+1})$ and $E(\widehat{f}_{t}\widehat{f}_{t}^{\prime}|y_{t+1})$ in ${M}_{dr}$ become the marginal moments of $\widehat{f}_{t}$ within each slice, and can be estimated by the usual sample moments. Hence, the slicing technique simplifies the estimation. In detail, we have

Implementation of Step 2. Let $y_{(0)/H}=-\infty$ , and, for $i=1,\ldots,H$ , let $y_{(i)/H}$ be the $(i/H)$ th quantile of $\{y_{1},\ldots,y_{T}\}$ . Let $y_{t+1}^{D}=i$ if $y_{t+1}\in(y_{(i)/H},y_{(i+1)/H}]$ . Estimate $E(\widehat{f}_{t}|y_{t+1}^{D}=i)$ by $\textstyle\sum_{t=1}^{T}\widehat{f}_{t}I(y_{t+1}^{D}=i)/(T/H)$ and $E(\widehat{f}_{t}\widehat{f}_{t}^{\prime}|y_{t+1}^{D}=i)$ by $\textstyle\sum_{t=1}^{T}\widehat{f}_{t}\widehat{f}_{t}^{\prime}I(y_{t+1}^{D}=i)/(T/H)$ . Estimate $\mathrm{var}(\widehat{f}_{t})$ by $I_{K}$ . Plugging these into (3.3) to derive $\widehat{M}_{dr}$ . Estimate ${\mathcal{S}}_{\scriptscriptstyle y|f}$ by the space spanned by $(\widehat{\phi}_{1},\ldots,\widehat{\phi}_{L})$ , the leading $L$ eigenvectors of $\widehat{M}_{dr}$ .

To estimate $\mathrm{var}(\widehat{f}_{t})$ in (3.3), one can alternatively use $I_{K}+\widehat{\Sigma}_{u^{*}}$ by the restriction $\mathrm{var}(f_{t})=I_{K}$ , where $\widehat{\Sigma}_{u^{*}}$ is the thresholding covariance estimator. An omitted simulation study shows that the resulting estimator of $M_{dr}$ performs similarly.

Theorem 2.

Suppose $K=o(\min(p^{1/3},T^{1/2}))$ . Under Assumptions 1, 2 and 3(1)-(3), the linearity condition (B1), and the constant variance condition (B2), $(\widehat{\phi}_{1},\ldots,\widehat{\phi}_{L})$ span a consistent estimator of ${\mathcal{S}}_{\scriptscriptstyle y|f}$ in the sense that

[TABLE]

In connection with Theorem 1, this theorem justifies that the estimation error of ${\mathcal{S}}_{\scriptscriptstyle y|f}$ comes from two parts. The first part, which is of order $O_{P}(K^{3/2}p^{-1/2})$ , is inherited from factor analysis. This part represents the price we pay for estimating the factor loadings $B$ , and it depends on the dimension $p$ of the original predictor. By contrast, the second part, which is of order $O_{P}(KT^{-1/2})$ , does not depend on $p$ and is newly generated in the sufficient dimension reduction stage. From the proof of Theorem 2 (see the supplement), it represents the price we pay for estimating the unknown inverse second moment involved in the kernel matrix. Therefore, this part would persist even if no error were generated in factor analysis.

4 Relaxing the linearity condition

As mentioned in §3, (B1) can be regarded as a parametric assumption and can be violated in real applications. For example, this occurs when one incorporates the lag variables of $y_{t+1}$ in forecasting and consider Model (1.4). In this section, we address this issue in two ways: first, we justify the consistency of the proposed method without (B1) under the setting that the number of factors $K$ must diverge; second, we weaken (B1) and generalize the proposed method accordingly following the spirit of Dong and Li (2010) under the setting that $K$ is fixed.

When (B1) is violated, Theorem 2 still holds if we treat $(\phi_{1},\ldots,\phi_{L})$ as the $L$ leading eigenvectors of $M_{dr}$ . Thus, the consistency of the proposed methodology depends on the closeness between the column space of $M_{dr}$ and the central subspace ${\mathcal{S}}_{\scriptscriptstyle y|f}$ , which hinges on the approximation of (B1). Fortunately, the latter has been justified in Hall and Li (1993) for all large $K$ .

Theorem 3.

Suppose $K\rightarrow\infty$ and $K=o(\min(p^{1/3},T^{1/2}))$ . Under Assumptions 1, 2, 3(1)-(3), the constant variance condition (B2), and other regularity conditions (see the supplement), $\widehat{\phi}_{1},\ldots,\widehat{\phi}_{L}$ span a consistent estimator of ${\mathcal{S}}_{\scriptscriptstyle y|f}$ in the sense that

[TABLE]

In the literature, Hall and Li’s result on the approximation of (B1) was used heuristically to support the effectiveness of inverse moment methods when (B1) is violated; see, for example, Cook and Weisberg (1991) and Li and Wang (2007). As we are aware of, this is the first attempt to rigorously build the consistency of inverse moment methods using Hall and Li’s result.

When $K$ is small and the factors clearly violate (B1), the approximation result in Hall and Li (1993) no longer applies. In this case, we treat $K$ as fixed, and relax (B1) to

(B1’) $E(f_{t}|\phi_{1}^{\prime}f_{t},\ldots,\phi_{L}^{\prime}f_{t})$ is a linear combination of $\{h_{i}(\phi_{1}^{\prime}f_{t},\ldots,\phi_{L}^{\prime}f_{t}):i=1,\ldots,q\}$ .

One can set the basis functions in (B1’) to be power functions, trigonometric functions, etc. In addition to (B1’), we require the constant variance condition (B2), which, as mentioned in §1, is quite mild. These conditions closely resemble those in Dong and Li (2010). We generalize directional regression from the eigen-decomposition of $M_{dr}$ to minimizing:

[TABLE]

over all the semi-orthogonal matrices $(\psi_{1},\ldots,\psi_{L})$ , where $v^{\otimes 2}$ denotes $vv^{\prime}$ for any real vector $v$ and $E(f_{t}|\psi_{1}^{\prime}f_{t},\ldots,\psi_{L}^{\prime}f_{t})$ is modeled parametrically as if (B1’) held for $(\psi_{1},\ldots,\psi_{L})$ . Using the estimated factors $\widehat{f}_{t}$ and $\widehat{g}_{s}$ and the slicing strategy, we can similarly construct $\widehat{\kappa}(\cdot)$ .

Under fairly general assumptions (Dong and Li, 2010), there exists the unique minimizer of $\kappa(\cdot)$ up to orthogonal column transformations, which spans the central subspace ${\mathcal{S}}_{y|f}$ ; we omit these assumptions here. Intuitively, a minimizer of $\widehat{\kappa}(\cdot)$ spans a consistent estimator of ${\mathcal{S}}_{y|f}$ .

Theorem 4.

Let $(\widehat{\phi}_{1},\ldots,\widehat{\phi}_{L})$ denote any minimizer of $\widehat{\kappa}(\psi_{1},\ldots,\psi_{L})$ . Under Assumptions 1 – 3, condition (B1’), and the constant variance condition (B2), we have

[TABLE]

By Theorem 3 and Theorem 4, we can apply the proposed forecasting method or its generalization without concerning the linearity condition (B1), for both fixed and diverging $K$ . For example, we now allow the predictor $x_{t}$ , as well as the factors $f_{t}$ , to contain discrete components.

5 Determining $K$ and $L$

We now discuss how to determine the number of factors $K$ and the dimension $L$ of the central subspace ${\mathcal{S}}_{\scriptscriptstyle y|f}$ . The problem is commonly called order determination in the literature of dimension reduction (Luo and Li, 2016).

In the literature, various order-determination methods have been proposed to estimate $K$ , including Bai and Ng (2002, 2008); Onatski (2010); Ahn and Horenstein (2013), Ludvigson and Ng (2009), Jurado et al. (2015). Recently, Li et al. (2017) extended Bai and Ng’s approach to the case of diverging $K$ , and estimated $K$ by

[TABLE]

where $K_{\max}$ is a prescribed upper bound that possibly increases with $p$ and $T$ , and $\widehat{F}_{k}$ denotes the solution to (2.1) with $k$ being the working number of factors. $q(p,T)$ is a penalty function such that $q(p,T)=o(1)$ and $(K_{\max}^{6}/p+K_{\max}^{4}/T)^{-1}q(p,T)\rightarrow\infty$ . We adopt Li et al.’s approach, and follow their suggestion to take

[TABLE]

To estimate the dimension $L$ of the central subspace ${\mathcal{S}}_{\scriptscriptstyle y|f}$ , multiple methods have been proposed, including the sequential tests (Li, 1991; Li and Wang, 2007), the bootstrap procedure (Ye and Weiss, 2003), the cross-validation method (Xia et al., 2002; Wang and Xia, 2008), the BIC-type procedure (Zhu et al., 2006), and the ladle estimator (Luo and Li, 2016), among which we adopt the BIC-type procedure and extend it to the high-dimensional case. For a positive semi-definite matrix parameter $M$ who columns span ${\mathcal{S}}_{\scriptscriptstyle y|f}$ and its sample estimator $\widehat{M}$ , let $\{\lambda_{1},\ldots,\lambda_{K}\}$ and $\{\widehat{\lambda}_{1},\ldots,\widehat{\lambda}_{K}\}$ be their eigenvalues in the descending order, respectively. By definition, $\lambda_{L}$ must be positive. We introduce a constant $c\in(0,1)$ and set $K_{c}$ , the nearest integer to $cK$ , as an upper bound of $L$ . This is reasonable because $L$ is fixed and usually small in practice. We modify the objective function in Zhu et al. (2006) to $G:\{1,\ldots,K_{c}\}\rightarrow\mathbb{R}$ with

[TABLE]

where $\tau$ is the number of positive $\widehat{\lambda}_{i}$ ’s. We then estimate $L$ as the maximizer $\widehat{L}$ of $G(\cdot)$ . Due to the introduction of the non-trivial upper bound $K_{c}$ , we do not need to impose additional constraints on $K$ or $\|\widehat{M}-M\|$ for the consistency of $\widehat{L}$ . This improves the result in Zhu et al. (2006).

Theorem 5.

Suppose $\|\widehat{M}-M\|=o_{P}(1)$ . If $C_{T}$ satisfies $C_{T}KT^{-1}\rightarrow 0$ and $\|\widehat{M}-M\|^{2}=o_{P}(C_{T}KT^{-1})$ , then $\widehat{L}$ converges to $L$ in probability.

A candidate of $C_{T}$ is $K^{-1}T\|\widehat{M}-M\|$ . Referring to Theorem 2, if we apply the BIC-type procedure to directional regression, then we can choose $C_{T}$ to be $K^{1/2}p^{-1/2}T+T^{1/2}$ .

6 Simulation studies

We now present a numerical example to illustrate the performance of the proposed forecasting method that uses directional regression in the sufficient dimension reduction stage. The data generating process is specified as the following:

[TABLE]

We fix $\phi_{1}=(1,1,1,0^{\prime}_{K-3})/\sqrt{3},\phi_{2}=(1,0^{\prime}_{K-3},1,3)/\sqrt{11}$ . Following Li et al. (2017), we set the number of factors $K$ to increase with $p$ in the form of $K=[1.5\log(p)]$ , where $[x]$ denotes the integer part of a real number $x$ . The factor loadings $b_{i}$ are independently sampled from $U[-1,2]$ . We generate the latent factors $f_{j,t}$ and the error terms $u_{it}$ from two AR(1) processes, $f_{j,t}=\alpha_{j}f_{j,t-1}+e_{jt}$ and $u_{it}=\rho_{i}u_{i,t-1}+\nu_{it}$ , with $\alpha_{j},\rho_{i}$ drawn from $U[0.2,0.8]$ and fixed during the simulation, and the noises $e_{jt},\nu_{it}$ , are $N(0,1)$ . We set $\epsilon_{t+1}\sim N(0,1)$ and $\sigma=0.2$ .

We consider four different choices of the link function $g(\cdot)$ ,

Model I: $y_{t+1}=0.4(\phi_{1}^{\prime}f_{t})^{2}+3\sin(\phi_{2}^{\prime}f_{t}/4)+\sigma\epsilon_{t+1}$ ;

Model II: $y_{t+1}=3\sin(\phi_{1}^{\prime}f_{t}/4)+3\sin(\phi_{2}^{\prime}f_{t}/4)+\sigma\epsilon_{t+1}$ ;

Model III: $y_{t+1}=0.4(\phi_{1}^{\prime}f_{t})^{2}+|\phi_{2}^{\prime}f_{t}|^{1/2}+\sigma\epsilon_{t+1}$ ;

Model IV: $y_{t+1}=(\phi_{1}^{\prime}f_{t})(\phi_{2}^{\prime}f_{t}+1)+\sigma\epsilon_{t+1}$ .

The proposed forecasting by directional regression (DR) is compared with the forecasting by sliced inverse regression (SIR) (Fan et al., 2017), the linear PC-estimator (principal components), and the semi-parametric efficient estimator (SEE) proposed by Ma and Zhu (2013). Model I and III includes at least one symmetric component, which cannot be estimated well by SIR. Model II is favorable to SIR. Model IV contains the interaction component to examine the ability of each method in detecting such nonlinear effect.

To gauge the quality of the estimated directions, we adopt the squared multiple correlation coefficient $R^{2}(\widehat{\phi})=\max_{\phi\in{\mathcal{S}}_{\scriptscriptstyle y|f},\|\phi\|=1}(\phi^{\prime}\widehat{\phi})^{2}$ , where ${\mathcal{S}}_{\scriptscriptstyle y|f}$ is spanned by $\phi_{1}$ and $\phi_{2}$ . We ensure that the true factors and loadings meet the identifiability conditions by calculating $H$ such that $T^{-1}HF^{\prime}FH^{\prime}=I_{K}$ and $H^{-1}B^{\prime}BH^{-1}$ is diagonal. The rotated central subspace is then understood as $H^{-1}{\mathcal{S}}_{\scriptscriptstyle y|f}$ , which is still denoted as ${\mathcal{S}}_{\scriptscriptstyle y|f}$ (see Fan et al. (2017)).

Table 1 compares the estimation of SIR and DR in simulation studies, where the PC is omitted as it produces only one directional estimate. It is evident that DR has substantial improvement over SIR in model I, III and IV, with higher $R^{2}(\widehat{\phi})$ and lower variance. This is not surprising as DR explores higher conditional moments and hence incorporates more information. SEE is slightly better than SIR in these cases, but it also fails to capture $\phi_{2}$ accurately, partially due to its semi-parametric nature which typically requires lengthy steps to converge. In model II, SIR, DR and SEE yield comparable results. We also observe that DR has outstanding performance in small samples, which makes it favorable in practice.

We next investigate the predictive power of DR through the out-of-sample $R^{2}$ , i.e.,

[TABLE]

where we use a fixed length $n_{T}=50$ of testing samples to evaluate the out-of-sample performance. $\hat{y}_{t}$ is the predicted value using all information prior to $t$ . The fitting is done by building an additive model in Step 3 of the proposed estimator. In the case of PC-estimator, $\widehat{K}$ smooth functions are constructed for the estimated factors. In contrast, only $\widehat{L}$ smooth functions are applied in the cases of SIR, DR and SEE. $\widehat{K}$ and $\widehat{L}$ are obtained using the procedures introduced in Section 5. It is clear from Table 2 that DR enjoys great performance in almost all the cases. Similar to DR, SEE is better than SIR as it explores structural dimension more thoroughly with different forms of the target. But SEE is often limited to a large sample size to produce accurate estimation. The PC-estimator is more robust in the presence of symmetric components, but fails to capture the interaction effect in general. To investigate the accuracy of $\widehat{K}$ and $\widehat{L}$ used above, which are obtained from Section 5, we carry out simulations to investigate the accuracy of the estimation procedures, and examine the sensitivity of forecasting performance with respect to $\widehat{K}$ and $\widehat{L}$ . In addition, we conduct experiments to show the effectiveness of the proposed method when the linearity condition is violated for factors $f_{t}$ . Due to space limit, these numerical results are presented in the supplementary materials.

7 Macro Index Forecast

We now analyze how the diffusion indices constructed by the proposed DR impact real-data forecasts. We use a monthly macro dataset consisting of 134 macroeconomic time series recently composed by McCracken and Ng (2016), which are classified into 8 groups : (1) output and income, (2) labor market, (3) housing, (4) consumption, orders and inventories, (5) money and credit, (6) bond and exchange rates, (7) prices, and (8) stock market. The dataset spans from 1959:01 to 2016:01. For a given target time series, we model the multi-step-ahead variable as:

[TABLE]

where $y_{t+h}^{h}=h^{-1}\sum_{i=1}^{h}y_{t+i}$ is the variable to forecast, as in Stock and Watson (2002a).

We follow McCracken and Ng (2016) to preprocess the data. We also employ the Ljung-Box test with various lags to test for uncorrelatedness in residuals, which suggests the appropriateness to use our proposed methods. Forecasts of $y_{t+h}^{h}$ are constructed based on a moving window with fixed length ( $T=120$ ) to account for timeliness. For each fixed window, the factors in the forecasting equation are estimated by the method of principal components using all time series except the target. As noted by McCracken and Ng (2016), 8 factors have good explanatory power in various cases, so we set $K=8$ throughout the exercise. For each method $M$ , we compare out-of-sample forecasting performances using the relative MSE (RMSE) to the PC method,

[TABLE]

which we evaluate on the last $m=240$ months (20 years). The methods we consider here include SIR( $i$ ), DR( $i$ ) ( $i=1,2$ ), where SIR( $i$ ) denotes sufficient forecasting with $L=i$ , and similarly for DR. Both methods use an additive model in specifying the forecasting equation. We also impose an additive model to the estimated factors, denoted by NL-PC, to see how much we can leverage on the nonlinearity without projecting principal components.

We report results in Table 3 for $h=1,6,12$ , on the maximum, minimum and median of RMSE in each broad sector. Several features are noteworthy. First, a nonlinear additive model built on estimated factors does not buy us more predictive power, except in the housing sector, where most of the nonlinear methods improve prediction accuracy. Second, the one-step-ahead out-of-sample forecast favors DR(1), as we observe the median RMSEs are uniformly less than 1 and some of the reductions in RMSE are substantial. Moving from short horizon to long horizon changes predictability of the targets, but DR(1) manages to improve the forecast over the PC method in many instances. Finally, as an illustration, we plot the out-of-sample $R^{2}$ for the 6-month-ahead forecast using DR(1) and PC. Notably, macro time series in housing and labor market sectors have higher predictability than in rates and stock market sectors.

Bibliography48

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Ahn and Horenstein (2013) Ahn, S. C. and Horenstein, A. R. (2013), ‘Eigenvalue ratio test for the number of factors’, Econometrica 81 (3), 1203–1227.
3Bai (2003) Bai, J. (2003), ‘Inferential theory for factor models of large dimensions’, Econometrica 71 (1), 135–171.
4Bai and Ng (2002) Bai, J. and Ng, S. (2002), ‘Determining the number of factors in approximate factor models’, Econometrica 70 (1), 191–221.
5Bai and Ng (2008) Bai, J. and Ng, S. (2008), ‘Forecasting economic time series using targeted predictors’, Journal of Econometrics 146 , 304–317.
6Bernanke et al. (2005) Bernanke, B., Boivin, J. and Eliasz, P. (2005), ‘Measuring the effects of monetary policy: a factor-augmented vector autoregressive (favar) approach’, The Quarterly Journal of Economics 120 (1), 387–422.
7Billingsley (1999) Billingsley, P. (1999), Convergence of Probability Measures , 2nd edn, John Wiley & Sons.
8Cai and Liu (2011) Cai, T. and Liu, W. (2011), ‘Adaptive thresholding for sparse covariance matrix estimation’, Journal of the American Statistical Association 106 (494), 672–684.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Inverse Moment Methods for Sufficient Forecasting using High-Dimensional Predictors

Abstract

1 Introduction

2 Consistency of factor analysis

Assumption 1** (Factors and Loadings).**

Assumption 2** (Data Generating Process).**

Assumption 3** (Residuals and Dependence).**

Theorem 1**.**

3 Directional regression based on an invariance result

3.1 An invariance result

Lemma 1**.**

3.2 Consistency of directional regression

Theorem 2**.**

4 Relaxing the linearity condition

Theorem 3**.**

Theorem 4**.**

5 Determining KKK and LLL

Theorem 5**.**

6 Simulation studies

7 Macro Index Forecast

Assumption 1 (Factors and Loadings).

Assumption 2 (Data Generating Process).

Assumption 3 (Residuals and Dependence).

Theorem 1.

Lemma 1.

Theorem 2.

Theorem 3.

Theorem 4.

5 Determining $K$ and $L$

Theorem 5.