Post-Selection Inference in Three-Dimensional Panel Data

Harold D. Chiang; Joel Rodrigue; Yuya Sasaki

arXiv:1904.00211·econ.EM·May 2, 2019

Post-Selection Inference in Three-Dimensional Panel Data

Harold D. Chiang, Joel Rodrigue, Yuya Sasaki

PDF

TL;DR

This paper develops a post-selection inference method for three-dimensional panel data models, improving the accuracy and efficiency of fixed effects estimation after model selection, especially when using lasso techniques.

Contribution

It introduces a novel post-selection inference approach that accommodates many nonzero fixed effects, enhancing model accuracy and efficiency in three-dimensional panel analysis.

Findings

01

More precise than under-fitting fixed effect estimators

02

More efficient than over-fitting fixed effect estimators

03

Achieves inference accuracy comparable to the oracle estimator

Abstract

Three-dimensional panel models are widely used in empirical analysis. Researchers use various combinations of fixed effects for three-dimensional panels. When one imposes a parsimonious model and the true model is rich, then it incurs mis-specification biases. When one employs a rich model and the true model is parsimonious, then it incurs larger standard errors than necessary. It is therefore useful for researchers to know correct models. In this light, Lu, Miao, and Su (2018) propose methods of model selection. We advance this literature by proposing a method of post-selection inference for regression parameters. Despite our use of the lasso technique as means of model selection, our assumptions allow for many and even all fixed effects to be nonzero. Simulation studies demonstrate that the proposed method is more precise than under-fitting fixed effect estimators, is more efficient…

Tables5

Table 1. Table 1: World Import Shares Over Time (%). Notes: The above table reports the share of world imports for the 10 largest importers from the WITS database in selected years. ROW represents the combined share of all other countries. N R O W subscript 𝑁 𝑅 𝑂 𝑊 N_{R}OW is the number of countries which comprise ROW and ROW/ N R O W subscript 𝑁 𝑅 𝑂 𝑊 N_{ROW} is the average import share among the ROW countries.

Country	1990	1995	2000	2005	2010	2015
China	3.08	5.76	7.22	10.74	13.32	15.67
USA	18.47	12.32	12.62	8.79	8.19	8.82
Germany	6.98	9.87	8.18	8.84	7.86	7.69
Japan	8.20	8.95	7.97	6.34	5.46	4.40
Korea	1.86	2.22	2.70	2.86	3.09	3.38
France	6.55	5.54	4.59	4.20	3.49	3.20
Italy	5.36	4.31	3.42	3.31	2.82	2.76
Netherlands	3.93	3.47	3.03	2.97	2.83	2.61
Canada	1.34	4.08	4.47	3.48	2.61	2.60
United Kingdom	4.78	4.66	4.36	3.41	2.57	2.56
ROW	39.47	38.81	41.43	45.07	47.78	46.30
$N_{R O W}$	192	214	224	222	224	227
ROW/ $N_{R O W}$	0.21	0.18	0.18	0.20	0.21	0.20

Table 2. Table 2: World Export Shares Over Time (%). Notes: The above table reports the share of world exports for the 10 largest exporters from the WITS database in selected years. ROW represents the combined share of all other countries. N R O W subscript 𝑁 𝑅 𝑂 𝑊 N_{R}OW is the number of countries which comprise ROW and ROW/ N R O W subscript 𝑁 𝑅 𝑂 𝑊 N_{ROW} is the average export share among the ROW countries.

Country	1990	1995	2000	2005	2010	2015
USA	22.47	14.73	1925	15.95	12.26	13.79
China	3.34	6.01	6.13	8.33	11.13	12.35
Germany	5.19	9.09	6.64	6.36	5.78	5.45
United Kingdom	5.74	5.35	4.72	4.15	3.37	3.39
Japan	5.72	5.39	4.90	4.30	3.55	3.30
France	6.64	5.56	4.42	4.26	3.63	3.08
Netherlands	4.29	3.93	3.19	3.22	3.21	2.84
Canada	1.13	3.44	3.59	2.85	2.46	2.47
Korea	2.22	2.23	2.01	1.99	2.27	2.39
Switzerland	2.38	1.80	2.31	2.30	2.26	2.29
ROW	40.88	42.47	42.84	46.30	50.07	48.65
$N_{R O W}$	186	209	218	218	219	222
ROW/ $N_{R O W}$	0.22	0.20	0.20	0.21	0.23	0.22

Table 3. Table 3: Monte Carlo simulation results under Model (I) (top panel), Model (II) (middle panel), and Model (III) (bottom panel) with size N = 10 𝑁 10 N=10 ( N M T = 450 𝑁 𝑀 𝑇 450 NMT=450 ).

$N = 10$ ( $N M T = 450)$	OLS	FE-I	FE-II	FE-III	POST
True Model = (I)		Fixed Effect Estimators
Under-Fitting or Over-Fitting	Under	Correct	Over	Over	Robust
Average	1.466	0.996	0.996	0.996	1.066
Bias	0.466	-0.004	-0.004	-0.004	0.066
Standard Deviation	0.342	0.484	0.486	0.539	0.422
Root Mean Square Error	0.578	0.484	0.486	0.539	0.428
95% Coverage	0.712	0.941	0.938	0.909	0.961

Table 4. Table 4: Monte Carlo simulation results under Model (I) (top panel), Model (II) (middle panel), and Model (III) (bottom panel) with size N = 15 𝑁 15 N=15 ( N M T = 1050 𝑁 𝑀 𝑇 1050 NMT=1050 ).

$N = 15$ ( $N M T = 1050)$	OLS	FE-I	FE-II	FE-III	POST
True Model = (I)		Fixed Effect Estimators
Under-Fitting or Over-Fitting	Under	Correct	Over	Over	Robust
Average	1.385	1.002	1.002	1.000	1.006
Bias	0.385	0.002	0.002	0.000	0.006
Standard Deviation	0.215	0.316	0.317	0.336	0.274
Root Mean Square Error	0.441	0.316	0.317	0.336	0.274
95% Coverage	0.568	0.941	0.940	0.924	0.957

Table 5. Table 5: Monte Carlo simulation results under Model (I) (top panel), Model (II) (middle panel), and Model (III) (bottom panel) with size N = 20 𝑁 20 N=20 ( N M T = 1900 𝑁 𝑀 𝑇 1900 NMT=1900 ).

$N = 20$ ( $N M T = 1900)$	OLS	FE-I	FE-II	FE-III	POST
True Model = (I)		Fixed Effect Estimators
Under-Fitting or Over-Fitting	Under	Correct	Over	Over	Robust
Average	1.213	0.996	0.996	0.996	0.956
Bias	0.213	-0.004	-0.004	-0.004	-0.044
Standard Deviation	0.162	0.228	0.229	0.239	0.199
Root Mean Square Error	0.268	0.228	0.229	0.239	0.204
95% Coverage	0.739	0.948	0.948	0.938	0.955

Equations396

y_{ij t} = x_{ij t}^{'} β + Fixed Effects α_{i} + γ_{j} + λ_{t} + ε_{ij t}

y_{ij t} = x_{ij t}^{'} β + Fixed Effects α_{i} + γ_{j} + λ_{t} + ε_{ij t}

y_{ij t} = x_{ij t}^{'} β

y_{ij t} = x_{ij t}^{'} β

+ i^{'} = 1 \sum N t^{'} = 1 \sum T α_{i^{'} t^{'}} \mathbbm 1_{i = i^{'}} \mathbbm 1_{t = t^{'}} + j^{'} = 1 \sum N t^{'} = 1 \sum T γ_{j^{'} t^{'}} \mathbbm 1_{i = i^{'}} \mathbbm 1_{t = t^{'}} + ε_{ij t}

y_{ij t}

y_{ij t}

∥ \overline{α} ∥ is bounded and i = 1 \sum N j = 1 \sum M t = 1 \sum T (d_{1, i t}^{'} (α - \overline{α}))^{2} ≲ ∥ β ∥_{0} + ∥ \overline{α} ∥_{0} + ∥ \overline{γ} ∥_{0}

∥ \overline{α} ∥ is bounded and i = 1 \sum N j = 1 \sum M t = 1 \sum T (d_{1, i t}^{'} (α - \overline{α}))^{2} ≲ ∥ β ∥_{0} + ∥ \overline{α} ∥_{0} + ∥ \overline{γ} ∥_{0}

∥ \overline{γ} ∥ is bounded and i = 1 \sum N j = 1 \sum M t = 1 \sum T (d_{2, i t}^{'} (γ - \overline{γ}))^{2} ≲ ∥ β ∥_{0} + ∥ \overline{α} ∥_{0} + ∥ \overline{γ} ∥_{0}

∥ \overline{γ} ∥ is bounded and i = 1 \sum N j = 1 \sum M t = 1 \sum T (d_{2, i t}^{'} (γ - \overline{γ}))^{2} ≲ ∥ β ∥_{0} + ∥ \overline{α} ∥_{0} + ∥ \overline{γ} ∥_{0}

y_{ij t}

y_{ij t}

r_{ij t} = d_{1, i t}^{'} (α - \overline{α}) + d_{2, j t}^{'} (γ - \overline{γ})

r_{ij t} = d_{1, i t}^{'} (α - \overline{α}) + d_{2, j t}^{'} (γ - \overline{γ})

i = 1 \sum N j = 1 \sum M t = 1 \sum T r_{ij t}^{2} ≲ ∥ β ∥_{0} + ∥ \overline{α} ∥_{0} + ∥ \overline{γ} ∥_{0} .

i = 1 \sum N j = 1 \sum M t = 1 \sum T r_{ij t}^{2} ≲ ∥ β ∥_{0} + ∥ \overline{α} ∥_{0} + ∥ \overline{γ} ∥_{0} .

Y = X β + D_{1} \overline{α} + D_{2} \overline{γ} + R + ε = Z \overline{η} + R + ε,

Y = X β + D_{1} \overline{α} + D_{2} \overline{γ} + R + ε = Z \overline{η} + R + ε,

η \in ar g η \in R^{p} min ∥ Y - Z η ∥ + μ P (η),

η \in ar g η \in R^{p} min ∥ Y - Z η ∥ + μ P (η),

P (η) = Υ_{1} β_{1} + \frac{1}{N} Υ_{2} α_{1} + \frac{1}{M} Υ_{3} γ_{1}

P (η) = Υ_{1} β_{1} + \frac{1}{N} Υ_{2} α_{1} + \frac{1}{M} Υ_{3} γ_{1}

ϕ^{ℓ} \in ar g ϕ \in R^{k_{0} - 1} min Z^{ℓ} - Z^{- ℓ} ϕ^{2} + μ_{node}^{ℓ} \frac{1}{N M} S_{- ℓ} Υ_{node}^{ℓ} ϕ_{1}

ϕ^{ℓ} \in ar g ϕ \in R^{k_{0} - 1} min Z^{ℓ} - Z^{- ℓ} ϕ^{2} + μ_{node}^{ℓ} \frac{1}{N M} S_{- ℓ} Υ_{node}^{ℓ} ϕ_{1}

S

S

Θ

Θ

τ_{ℓ}^{2} = \frac{1}{N M} Z_{ℓ} - Z_{- ℓ} ϕ^{ℓ}^{2} + \frac{μ _{node}^{ℓ}}{N M} \frac{1}{N M} S_{- ℓ} Υ_{- ℓ} ϕ^{ℓ}_{1}

τ_{ℓ}^{2} = \frac{1}{N M} Z_{ℓ} - Z_{- ℓ} ϕ^{ℓ}^{2} + \frac{μ _{node}^{ℓ}}{N M} \frac{1}{N M} S_{- ℓ} Υ_{- ℓ} ϕ^{ℓ}_{1}

η_{ℓ} = η_{ℓ} + \frac{1}{N M} Θ_{ℓ}^{'} Z^{'} (Y - Z η),

η_{ℓ} = η_{ℓ} + \frac{1}{N M} Θ_{ℓ}^{'} Z^{'} (Y - Z η),

V_{ℓℓ} = Θ_{ℓ}^{'} Ω Θ_{ℓ}

V_{ℓℓ} = Θ_{ℓ}^{'} Ω Θ_{ℓ}

Ω = \frac{1}{N M} i = 1 \sum N j = 1 \sum M (t = 1 \sum T Z_{ij t} ε_{ij t}) (t = 1 \sum T Z_{ij t} ε_{ij t})^{'},

Ω = \frac{1}{N M} i = 1 \sum N j = 1 \sum M (t = 1 \sum T Z_{ij t} ε_{ij t}) (t = 1 \sum T Z_{ij t} ε_{ij t})^{'},

η = η - \frac{μ}{N M} Θ P^{'} (η),

η = η - \frac{μ}{N M} Θ P^{'} (η),

\overset{ˉ}{Ψ} = \frac{1}{N M} X^{'} X \frac{1}{M N} D_{1}^{'} X \frac{1}{N M} D_{2}^{'} X \frac{1}{M N} X^{'} D_{1} \frac{1}{M} D_{1}^{'} D_{1} \frac{1}{N M} D_{2}^{'} D_{1} \frac{1}{N M} X^{'} D_{2} \frac{1}{N M} D_{1}^{'} D_{2} \frac{1}{N} D_{2}^{'} D_{2},

\overset{ˉ}{Ψ} = \frac{1}{N M} X^{'} X \frac{1}{M N} D_{1}^{'} X \frac{1}{N M} D_{2}^{'} X \frac{1}{M N} X^{'} D_{1} \frac{1}{M} D_{1}^{'} D_{1} \frac{1}{N M} D_{2}^{'} D_{1} \frac{1}{N M} X^{'} D_{2} \frac{1}{N M} D_{1}^{'} D_{2} \frac{1}{N} D_{2}^{'} D_{2},

V_{l l}^{- 1/2} Θ_{l}^{'} Z^{'} ε / N M ⇝ N (0, 1) .

V_{l l}^{- 1/2} Θ_{l}^{'} Z^{'} ε / N M ⇝ N (0, 1) .

η_{l} - \overline{η}_{l} = \frac{1}{N M} Θ_{l}^{'} Z^{'} ε + o_{p} (1/ N M)

η_{l} - \overline{η}_{l} = \frac{1}{N M} Θ_{l}^{'} Z^{'} ε + o_{p} (1/ N M)

N M (β_{l} - β_{l}) ⇝ N (0, V_{l l})

N M (β_{l} - β_{l}) ⇝ N (0, V_{l l})

Z_{l} = Z_{- l} ϕ^{l} + r_{l} + ζ_{l},

Z_{l} = Z_{- l} ϕ^{l} + r_{l} + ζ_{l},

E [Z_{- l} ζ_{l}] = 0

φ_{m i n} (A, m) = ∥ ξ ∥ = 1 ∥ ξ ∥_{0} \leq m in f ξ^{'} A ξ and φ_{m a x} (A, m) = ∥ ξ ∥ = 1 ∥ ξ ∥_{0} \leq m sup ξ^{'} A ξ .

φ_{m i n} (A, m) = ∥ ξ ∥ = 1 ∥ ξ ∥_{0} \leq m in f ξ^{'} A ξ and φ_{m a x} (A, m) = ∥ ξ ∥ = 1 ∥ ξ ∥_{0} \leq m sup ξ^{'} A ξ .

\underline{k} \leq φ_{m i n} (\overset{ˉ}{Ψ}, C s) \leq φ_{m a x} (\overset{ˉ}{Ψ}, C s) \leq \overline{k}

\underline{k} \leq φ_{m i n} (\overset{ˉ}{Ψ}, C s) \leq φ_{m a x} (\overset{ˉ}{Ψ}, C s) \leq \overline{k}

Ω =

Ω =

N M (β_{l} - β_{l}) ⇝ N (0, V_{l l})

N M (β_{l} - β_{l}) ⇝ N (0, V_{l l})

V_{l l} =

V_{l l} =

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Post-Selection Inference in Three-Dimensional Panel Data††thanks: First arXiv version: March 30, 2019.

Harold D. Chiang Joel Rodrigue Yuya Sasaki Harold D. Chiang: [email protected]. Department of Economics, Vanderbilt University, VU Station B #351819, 2301 Vanderbilt Place, Nashville, TN 37235-1819, USA

Joel Rodrigue: [email protected]. Department of Economics, Vanderbilt University, VU Station B #351819, 2301 Vanderbilt Place, Nashville, TN 37235-1819, USA

Yuya Sasaki: [email protected]. Department of Economics, Vanderbilt University, VU Station B #351819, 2301 Vanderbilt Place, Nashville, TN 37235-1819, USA

Abstract

Three-dimensional panel models are widely used in empirical analysis. Researchers use various combinations of fixed effects for three-dimensional panels. When one imposes a parsimonious model and the true model is rich, then it incurs mis-specification biases. When one employs a rich model and the true model is parsimonious, then it incurs larger standard errors than necessary. It is therefore useful for researchers to know correct models. In this light, Lu et al. (2018) propose methods of model selection. We advance this literature by proposing a method of post-selection inference for regression parameters. Despite our use of the lasso technique as means of model selection, our assumptions allow for many and even all fixed effects to be nonzero. Simulation studies demonstrate that the proposed method is less biased than under-fitting fixed effect estimators, is more efficient than over-fitting fixed effect estimators, and allows for as accurate inference as the oracle estimator.

Keywords: post-selection inference, three-dimensional panel data.

JEL Code: C23

1 Introduction

Mátyás (1997) suggests the three-dimensional panel model

[TABLE]

for $(i,j,t)\in\{1,...,N\}\times\{1,...,M\}\times\{0,...,T\}$ , where $y_{ijt}$ denotes an outcome variable of unit $(i,j)$ at time $t$ , ${x}_{ijt}$ denotes $k$ -dimensional explanatory variables of unit $(i,j)$ at time $t$ , and $\alpha_{i}$ , $\gamma_{j}$ , and $\lambda_{t}$ are fixed effects associated with indices $i$ , $j$ , and $t$ , respectively. To fix our ideas, consider the gravity model (Tinbergen, 1962) from the empirical trade literature where $y_{ijt}$ denotes the logarithm of the volume of exports from country $i$ to country $j$ in year $t$ , and the $k$ -dimensional covariates ${x}_{ijt}$ contain observed characteristics of the trade pair $(i,j)$ in year $t$ , including the log GDP of country $i$ in year $t$ ( $GDP_{it}$ ), the log GDP of country $j$ in year $t$ ( $GDP_{jt}$ ), the log distance between countries $i$ and $j$ ( $DIST_{ij}$ ), and the dummy variable of a bilateral trade agreement between countries $i$ and $j$ ( $TA_{ij}$ ), among others. The fixed effects $\alpha_{i}$ , $\gamma_{j}$ , and $\lambda_{t}$ represent the unobserved exporting country effects, destination country effects, and year effects, respectively. Researchers are often interested in the coefficient of $DIST_{ij}$ interpreted as the trade elasticity or the trade cost. Another important parameter of empirical interest is the coefficient of $TA_{ij}$ interpreted as the effect of bilateral trade agreements on trade volumes. See Head and Mayer (2014) for a comprehensive review of gravity models.

To date, variants of the three-dimensional panel model (1.1) have been extensively used in empirical analysis of international trade (see Baltagi et al. (2017) for a survey), housing (see Baltagi and Bresson (2017) for a survey), migration (see Ramos (2017) for a survey), and consumer price. In these analyses, researchers employ various combinations of fixed effects, including (I) $\alpha_{i}+\gamma_{j}$ , (II) $\alpha_{i}+\gamma_{j}+\lambda_{t}$ , and (III) $\alpha_{it}+\gamma_{jt}$ , among others.111 Parameters $\beta$ of certain types of controls are not identified under more general combinations of fixed effects. For example, the coefficients of $GDP_{it}$ and $GDP_{jt}$ are not identified under the fixed effect model (III) due to the collinearity. However, the coefficients of $DIST_{ij}$ and $TA_{ij}$ would be identifiable under any of the three models. In empirical analysis of bilateral trade flows, the latter two coefficients are of more common interest. In fact substituting fixed effects (such as $\alpha_{it}$ and $\gamma_{jt}$ ) for observed proxies (such as $GDP_{it}$ and $GDP_{jt}$ ) is “now common practice and recommended by major empirical trade economists” (Head and Mayer, 2014).

See Balazsi et al. (2017, Tables 1.1–1.3) for a comprehensive list of empirical papers and their specifications of the combinations of fixed effects. Researchers in general do not know which combination of fixed effects correctly specifies the model of their interest. If the true model is parsimonious and a researcher erroneously assumes a rich specification, then naïve fixed effect estimators generally entail exacerbated variances. On the other hand, if the true model is rich and a researcher erroneously assumes a parsimonious specification, then naïve fixed effect estimators generally entail mis-specification biases. The lack of knowledge of the true model specification therefore leads to undesired econometric results in any event.

A recent paper by Lu et al. (2018) develops a method of model selection. Their method serves as a useful guideline for empirical researchers to choose a correct combination of fixed effects in three-dimensional panel models. When a researcher uses a selected model to compute estimates of $\beta$ and their standard errors, it is also important that she takes into account the statistical effects of the model selection. To our knowledge, the existing literature does not provide a method of post-selection inference for three-way panel models. In this light, we extend the frontier of this existing econometric literature (Lu et al., 2018) by providing a method of inference for $\beta$ accounting for the effect of the model selection. We make use of the lasso technique along with de-biasing to this end, but our method does not require exactly sparse fixed effects. In other words, our assumptions do allow for many and even all of the fixed effects to be nonzero in a general combination of fixed effects.

Related Literature A three-dimensional panel model was suggested by Mátyás (1997). The literature on multi-dimensional panels is extensive today, and is surveyed in the book of article collections edited by Mátyás (2017). Its chapter written by Balazsi et al. (2017) provides a comprehensive list of empirical research papers employing multi-dimensional panel data.

Methods of model selection in three-dimensional panels are developed by Lu et al. (2018), and this paper was motivated by Lu et al. (2018). As stated earlier, we aim to extend this frontier of the literature by developing a post-selection inference for the regression parameters.

We use the lasso technique for model selection and post-selection inference, but our assumptions do allow for all fixed effects to be nonzero. This is because we rely on the approximate sparsity condition as opposed to the conventional sparsity. Post-selection inference via lasso is studied by an extensive body of the literature in various contexts. This literature includes, but are not limited to, Belloni et al. (2012) for IV models, and Belloni et al. (2014), Javanmard and Montanari (2014), Van de Geer et al. (2014), and Zhang and Zhang (2014) for linear regression models.

Lasso estimation for panel models are suggested by Koenker (2004), Lamarche (2010), Kock (2013), Caner and Han (2014), Lu and Su (2016), Li et al. (2016), Qian and Su (2016), Caner et al. (2018), Harding and Lamarche (2019), among others. Classification and estimation by lasso for panel models are proposed by Su et al. (2016) – also see Lu and Su (2017), Su and Ju (2018), and Su et al. (2017). For post-selection inference with panel data using lasso, Belloni et al. (2016) work with de-meaned fixed effect models with high-dimensional controls using post-double-selection estimator. Kock (2016) and Kock and Tang (2019) work with correlated random effect panel models and dynamic panel models with sparse fixed effects via de-biased lasso, respectively. We extend this frontier of the literature to three-dimensional panels. Besides the different framework of three-dimensional panels as opposed to two-dimensional ones, this paper is different from Kock (2016) and Kock and Tang (2019) in the following four technical points. First, we extend the theory of nodewise lasso by allowing for different convergence rates to incorporate a larger class of fixed effect models. Second, we use a different proof strategy with the sparsity requirement of $ss_{l}(\log(p\vee(NM)))^{2}/(N\wedge M)=o(1)$ inspired by Belloni et al. (2012, Lemma 8), whereas an adaptation of the proof strategies of Kock (2016)222See Assumption A3 (b) of Kock (2016). and Kock and Tang (2019)333See Assumption 5 (c) of Kock and Tang (2019). to our framework would require $ss^{2}_{l}(\log(p\vee(NM)))^{2}/(N\wedge M)=o(1)$ . This feature further extends the class of models that can be handled under our framework. Third, the sub-gaussianity assumption of covariates, which is assumed by the majority of papers in the de-biased lasso literature, is not required. Fourth, we allow for non-sparse coefficients based on the notion of approximate sparsity following that of Belloni et al. (2012) instead of the $L^{v}$ sparsity for $0<v<1$ as in Kock and Tang (2019).

With all these technical relations to the existing literature, we once again emphasize that our main contribution is the robust inference method for three-dimensional panels. Unlike two-dimensional panels, there are a number of alternative combinations of fixed effect specifications in three-dimensional panels, and hence model selection is more important in these models (Lu et al., 2018). We apply and extend state-of-the-art technology (e.g., Belloni et al., 2012; Kock, 2016; Kock and Tang, 2019) to this three-dimensional panel framework which concerns many empirical researchers.

Organization: The rest of this paper is organized as follows. We introduce the model framework in Section 2. An overview of our proposed method is presented in Section 3. The main theoretical result is presented in Section 4, followed by sufficient conditions discussed in Section 5. We discuss the key assumption in the context of gravity analysis of international trade in Section 6. We conduct simulation studies in Section 7. Section 8 concludes the paper.

2 The Model Framework

Consider the following representation of a general class of three-dimensional panel models with large $N$ and large $M$ .

[TABLE]

This representation consists of a $k$ -dimensional parameter vector $\beta$ , $N$ -dimensional parameter vector $\alpha_{[N]}=(\alpha_{1},...,\alpha_{N})^{\prime}$ , $M$ -dimensional parameter vector $\gamma_{[M]}=(\gamma_{1},...,\gamma_{M})^{\prime}$ , $T$ -dimensional parameter vector $\lambda_{[T]}=(\lambda_{1},...,\lambda_{T})^{\prime}$ , $NT$ -dimensional parameter vector $\alpha_{[NT]}=(\alpha_{11},...,\alpha_{NT})^{\prime}$ , and $MT$ -dimensional parameter vector $\gamma_{[MT]}=(\gamma_{11},...,\gamma_{MT})^{\prime}$ . In total, there are $k+N+M+T+NT+MT$ parameters involved in this representation (2.1).

Recall that conventional fixed effect models include

(I)

$\alpha_{i}+\gamma_{j}$ , 2. (II)

$\alpha_{i}+\gamma_{j}+\lambda_{t}$ , and 3. (III)

$\alpha_{it}+\gamma_{jt}$ ,

among others. Model (I) entails $k+N+M$ of possibly nonzero parameters $(\beta^{\prime},\alpha_{[N]}^{\prime},\gamma_{[M]}^{\prime})^{\prime}$ , while the rest of the $T+NT+MT$ parameters $(\lambda_{[T]}^{\prime},\alpha_{[NT]}^{\prime},\gamma_{[MT]}^{\prime})^{\prime}$ are all zero. Similarly, Model (II) entails $k+N+M+T$ of possibly nonzero parameters $(\beta^{\prime},\alpha_{[N]}^{\prime},\gamma_{[M]}^{\prime},\lambda_{[T]})^{\prime}$ , while the rest of the $NT+MT$ parameters $(\alpha_{[NT]}^{\prime},\gamma_{[MT]}^{\prime})^{\prime}$ are all zero. Likewise, Model (III) entails $k+NT+MT$ of possibly nonzero parameters $(\beta^{\prime},\alpha_{[NT]}^{\prime},\gamma_{[MT]}^{\prime})^{\prime}$ , while the rest of the $N+M+T$ parameters $(\alpha_{[N]}^{\prime},\gamma_{[M]}^{\prime},\lambda_{[T]}^{\prime})^{\prime}$ are all zero. Furthermore, the representation (2.1) includes many other combinations than these three models.

When Model (I) is true for example, then the representation (2.1) has $T+NT+MT$ redundant parameters and hence estimating the model (2.1) generally yields much larger standard errors for the parameters $\beta$ of interest than necessary. This motivates the need of model selection. We propose to use the lasso to select such redundant fixed effect parameters out of the representation (2.1), and then conduct inference robustly accounting for the statistical effects of the model selection.

For ease of conducting econometric analysis, we further rewrite the representation (2.1) as

[TABLE]

where $\mathbf{x}_{ijt}=({x}_{ijt}^{\prime},\mathbbm{1}_{t=1},...,\mathbbm{1}_{t=T})^{\prime}$ and $\boldsymbol{\beta}=(\beta^{\prime},\lambda_{1},...,\lambda_{T})^{\prime}$ are of dimension $k_{0}=k+T$ , $\mathbf{d}_{1,it}=(\mathbbm{1}_{i=1},...,\mathbbm{1}_{i=N},\mathbbm{1}_{i=1}\mathbbm{1}_{t=1},...,\mathbbm{1}_{i=N}\mathbbm{1}_{t=T})^{\prime}$ and $\boldsymbol{\alpha}=(\alpha_{[N]},\alpha_{[NT]})^{\prime}$ are of dimension $N_{0}=N+NT$ , and $\mathbf{d}_{2,jt}=(\mathbbm{1}_{j=1},...,\mathbbm{1}_{j=N},\mathbbm{1}_{j=1}\mathbbm{1}_{t=1},...,\mathbbm{1}_{j=M}\mathbbm{1}_{t=T})^{\prime}$ and $\boldsymbol{\gamma}=(\gamma_{[M]},\gamma_{[MT]})^{\prime}$ are of dimension $M_{0}=M+MT$ .

Suppose that we can decompose the fixed effects $\boldsymbol{\alpha}$ into $\overline{\boldsymbol{\alpha}}$ and $\boldsymbol{\alpha}-\overline{\boldsymbol{\alpha}}$ and decompose the fixed effects $\boldsymbol{\gamma}$ into $\overline{\boldsymbol{\gamma}}$ and $\boldsymbol{\gamma}-\overline{\boldsymbol{\gamma}}$ such that

[TABLE]

and

[TABLE]

hold, where $\left\|\cdot\right\|_{0}$ denotes the support cardinality (the $L^{0}$ norm).444With this said, we emphasize that this decomposition is merely theoretical, and a researcher need not implement such a decomposition in practice. Precise requirements for the decomposition are stated in Assumptions 2 and 5 (4) ahead, followed by discussions in the context of our motivating application (1.1) in Remark 2. In Section 6, we use world trade data to argue that these assumptions are plausible in the application (1.1).

Such a decomposition is constructed for example by setting $\overline{\boldsymbol{\alpha}}_{\ell}$ equal to $\boldsymbol{\alpha}_{\ell}$ for those coordinates $\ell$ for which $\left|\boldsymbol{\alpha}_{\ell}\right|$ is large and setting $\overline{\boldsymbol{\alpha}}_{\ell}$ equal to zero for those coordinates $\ell$ for which $\left|\boldsymbol{\alpha}_{\ell}\right|$ is small, and similarly for $\boldsymbol{\gamma}$ . Consequently, we can further rewrite the representation (2.2) as

[TABLE]

where $r_{ijt}$ is the approximation error defined by

[TABLE]

and it satisfies

[TABLE]

Stacking the three-dimensional panel data across the $NMT$ observations, we in turn construct the matrix representation

[TABLE]

where $Y=(y_{111},...,y_{NMT})^{\prime}$ , $R=(r_{111},...,r_{NMT})^{\prime}$ , and $\varepsilon=(\varepsilon_{111},...,\varepsilon_{NMT})^{\prime}$ , are vectors of dimension $NMT$ , $X=(\mathbf{x}_{111},...,\mathbf{x}_{NMT})^{\prime}$ is a matrix of size $NMT\times k_{0}$ , $D_{1}=(\mathbf{d}_{1,11},...,\mathbf{d}_{1,NT})^{\prime}$ is a matrix of size $NMT\times N_{0}$ , $D_{2}=(\mathbf{d}_{2,11},...,\mathbf{d}_{2,MT})^{\prime}$ is a matrix of size $NMT\times M_{0}$ , $Z=[X\ D_{1}\ D_{2}]$ , and $\overline{\boldsymbol{\eta}}=[\boldsymbol{\beta}^{\prime}\ \overline{\boldsymbol{\alpha}}^{\prime}\ \overline{\boldsymbol{\gamma}}^{\prime}]^{\prime}$ is a vector of dimension $p=k_{0}+N_{0}+M_{0}$ .

If the true model is parsimonious, like Model (I), then a large number of the elements of the high-dimensional parameters, $\boldsymbol{\alpha}$ and $\boldsymbol{\gamma}$ , will be zero. Thus, a large number of the elements of $\overline{\boldsymbol{\alpha}}$ and $\overline{\boldsymbol{\gamma}}$ will be zero. Furthermore, for those coordinates of $\boldsymbol{\alpha}$ and $\boldsymbol{\gamma}$ that are small in absolute value, the corresponding coordinates of $\overline{\boldsymbol{\alpha}}$ and $\overline{\boldsymbol{\gamma}}$ are set to zero in the decomposition in light of the relatively smaller approximation errors caused by setting them to zero. We propose to use the lasso technique to select such redundant parameters in $\overline{\boldsymbol{\alpha}}$ and $\overline{\boldsymbol{\gamma}}$ out of this high-dimensional model as means of model selection for the purpose of obtaining smaller standard errors. Furthermore, accounting for the statistical effects of this model selection, we then conduct robust inference for the main parameters $\boldsymbol{\beta}$ in the panel model. Section 3 illustrates an overview of our proposed method. A formal theoretical analysis will then follow in Sections 4 and 5.

3 Overview of the Method

Our proposed method consists of four steps. The first step is a lasso estimation of the parameter vector $\boldsymbol{\eta}$ entailing a model selection. The second step is an auxiliary step to calculate an approximate inverse of the Gram matrix to be used in the subsequent two steps. The third step de-biases the regularized lasso estimate from the first step. The fourth step is a calculation of the asymptotic variance of each coordinate of the de-biased lasso estimator of $\boldsymbol{\beta}$ .

Step 1: For the representing equation (2.5), define the lasso estimator

[TABLE]

where $\mu\in[0,\infty)$ is a regularization tuning parameter and the penalty function $P$ is defined by

[TABLE]

for some diagonal normalization matrix $\widehat{\Upsilon}_{\ell}$ for each $\ell\in\{1,2,3\}$ .555See Remark 6 in Appendix B.1. In practice, the regularization tuning parameter $\mu$ can be chosen using a cross validation via software packages.

Step 2: The next step is an auxiliary process to obtain a $p\times p$ matrix $\widehat{\Theta}$ of approximate inverse of the Gram matrix to be used in Step 3. We define the nodewise lasso estimator

[TABLE]

of the $\ell$ -th column $Z^{\ell}$ on all the other $(p-1)$ columns $Z^{-\ell}$ for each $\ell\in\{1,...,p\}$ , where $\mu_{\text{node}}^{\ell}\in[0,\infty)$ is a regularization tuning parameter, $\widehat{\Upsilon}_{\text{node}}^{\ell}$ is some diagonal normalization matrix for each $\ell\in\{1,...,p\}$ , and $S_{-\ell}$ is the $(p-1)\times(p-1)$ matrix obtained by removing the $\ell$ -th row and the $\ell$ -th column of

[TABLE]

In practice, the regularization tuning parameter $\mu_{\text{node}}^{\ell}$ can be chosen using a cross validation via software packages.

Once the nodewise lasso estimates $\widehat{\phi}^{\ell}$ are obtained, a $p\times p$ matrix $\widehat{\Theta}$ approximating the inverse Gram matrix can be constructed by

[TABLE]

with $\widehat{\tau}_{\ell}$ given by

[TABLE]

for each $\ell\in\{1,...,p\}$ and $\widehat{\phi}^{\ell}_{l}$ denoting the $l$ -th coordinate of the nodewise lasso estimate $\widehat{\phi}^{\ell}$ for each $\ell\in\{1,...,p\}$ and $l\in\{1,...,p-1\}$ .

Step 3: The shrinkage by the regularization $\mu P(\boldsymbol{\eta})$ forces a sub-vector of the lasso estimates $\widehat{\boldsymbol{\eta}}$ to be zero, and this mechanism serves as means of model selection. Since this regularization biases the second-stage lasso estimator $\widehat{\boldsymbol{\eta}}$ , we further ‘de-bias’ it according to

[TABLE]

for each $\ell\in[p]$ , where $\widehat{\Theta}_{\ell}$ is the $\ell$ -th column of $\widehat{\Theta}$ and $\widehat{\Theta}$ is the $p\times p$ approximate inverse Gram matrix constructed in Step 2. The sub-vectors of $\widetilde{\boldsymbol{\eta}}$ will be denoted by $\widetilde{\boldsymbol{\eta}}=\left(\widetilde{\boldsymbol{\beta}}^{\prime},\widetilde{\boldsymbol{\alpha}}^{\prime},\widetilde{\boldsymbol{\gamma}}\right)^{\prime}$ .

Step 4: The asymptotic variance of $\sqrt{NM}\left(\widetilde{\boldsymbol{\beta}}_{\ell}-\boldsymbol{\beta}_{\ell}\right)$ for $\ell\in\{1,...,k_{0}\}$ is approximated by

[TABLE]

where $\widehat{\Theta}_{\ell}$ is defined in Step 3,

[TABLE]

and $\widehat{\varepsilon}_{ijt}$ is the residual from the lasso in Step 1.

4 The Main Theory

Define the de-biased lasso estimator by

[TABLE]

where $P^{\prime}$ denotes the sub-gradient of $P$ . Recall that the sub-vectors of $\widetilde{\boldsymbol{\eta}}$ are denoted by $\widetilde{\boldsymbol{\eta}}=[\widetilde{\boldsymbol{\beta}}^{\prime},\widetilde{\boldsymbol{\alpha}}^{\prime},\widetilde{\boldsymbol{\gamma}}]^{\prime}$ , corresponding to $\overline{\boldsymbol{\eta}}=[\boldsymbol{\beta}^{\prime}\ \overline{\boldsymbol{\alpha}}^{\prime}\ \overline{\boldsymbol{\gamma}}^{\prime}]^{\prime}$ . This section presents a general limit distribution result for each coordinate of the de-biased lasso estimator $\widetilde{\boldsymbol{\beta}}$ for the coefficients of $x_{ijt}$ . We focus on short panels with fixed $T$ and large $(N,M)$ , although an extension to large $T$ cases may be feasible with alternative assumptions. While we maintain high-level assumptions in the current section for the sake of generality, we will follow up with lower-level sufficient conditions in Section 5.1. Define the $p\times p$ rate-adjusted Gram matrix

[TABLE]

Let $[n]=\{1,...,n\}$ for any $n\in\mathbb{N}$ . With these notations, consider the following assumption.

Assumption 1 (Asymptotic Normality).

For all $(N,M)$ , there exists a column random vector $\widehat{\Theta}_{l}$ such that the following conditions hold for an $(N,M)$ -dependent choice of $\mu$ as $N,M\rightarrow\infty$ .

(i)

$\max_{l\in[k_{0}]}\left|\sqrt{NM}(\widehat{\Theta}_{l}^{\prime}Q\bar{\Psi}Q-e_{l}^{\prime})(\widehat{\boldsymbol{\eta}}-\overline{\boldsymbol{\eta}})\right|=o_{p}(1)$ . 2. (ii)

$\max_{l\in[k_{0}]}\left|\widehat{\Theta}_{l}^{\prime}Z^{\prime}R/\sqrt{NM}\right|=o_{p}(1)$ . 3. (iii)

For each $l\in[k_{0}]$ , there exists $V_{ll}\in(0,\infty)$ that can depend on $(N,M)$ such that

[TABLE]

In the current general theoretical discussions, Assumption 1 merely requires an existence of some $\widehat{\Theta}_{l}$ satisfying the three conditions, and does not say how it should be constructed. Recall that the overview of the method in Section 3 suggests a concrete way to construct such $\widehat{\Theta}_{l}$ . Section 5 ahead will discuss lower-level sufficient conditions to guarantee that such a concrete construction of $\widehat{\Theta}_{l}$ satisfies the three high-level conditions in Assumption 1.

Theorem 1 (Asymptotic Normality).

Suppose that Assumption 1 (i)–(ii) are satisfied. Then,

[TABLE]

for each $l\in[p]$ . Furthermore, if Assumption 1 (iii) is satisfied in addition, then we have

[TABLE]

for each $l\in[k_{0}]$ .

A proof is found in Appendix A.1.

Remark 1.

The de-biased lasso estimator $\widetilde{\boldsymbol{\eta}}_{l}=\widehat{\boldsymbol{\eta}}_{l}-\frac{\mu}{NM}\widehat{\Theta}^{\prime}_{l}P^{\prime}(\widehat{\boldsymbol{\eta}})$ can be also rewritten by replacing $\mu P^{\prime}(\widehat{\boldsymbol{\eta}})$ by $-Z^{\prime}(Y-Z\widehat{\boldsymbol{\eta}})$ following the K.K.T. condition, i.e., $\widetilde{\boldsymbol{\eta}}_{l}=\widehat{\boldsymbol{\eta}}_{l}+\frac{1}{NM}\widehat{\Theta}_{l}^{\prime}Z^{\prime}(Y-Z\widehat{\boldsymbol{\eta}})$ . This representation yields the de-biased lasso formula proposed in (3.14).

5 Sufficient Conditions and Variance Estimation

In this section, we propose lower-level sufficient conditions for the high-level general statements in Assumption 1. These conditions provide a theoretical guarantee for the concrete practical procedure of Section 3 to work. While the general limit distribution result in Theorem 1 did not specify a concrete form of the asymptotic variance $V_{ll}$ , the current section also provides a formula for it under these sufficient conditions. Furthermore, we propose an analog variance estimator $\widehat{V}_{ll}$ , and show its consistency under these sufficient conditions.

Throughout this section, we will assume $\widehat{\Upsilon}=I_{p}$ and $\widehat{\Upsilon}_{\text{node},l}=I_{p-1}$ for all $l\in[k_{0}]$ for simplicity, although these restrictions are not essential at all. We use the following notations for the parameter supports: $J_{1}=\text{supp}(\boldsymbol{\beta}),$ $J_{2}=\text{supp}(\overline{\boldsymbol{\alpha}}),$ $J_{3}=\text{supp}(\overline{\boldsymbol{\gamma}}),$ and $J=\text{supp}(\overline{\boldsymbol{\eta}})$ , Their cardinalities are denoted by $s_{1}=|J_{1}|,$ $s_{2}=|J_{2}|,$ $s_{3}=|J_{3}|,$ and $s=|J|.$ We note that $s$ is non-decreasing in $N$ and/or $M$ . Similarly to the decomposition (2.5) for the main regression model, we also consider the decomposition

[TABLE]

for each coordinate $l\in[k_{0}]$ of the regressors.

5.1 Sufficient Conditions

We present sufficient conditions as five modules, Assumptions 2, 3, 4, 5, and 6, listed below.

Assumption 2 (Approximate Sparsity).

(1) $\|\overline{\boldsymbol{\eta}}\|\leq K$ . (2) $\|R\|\leq c_{s}\lesssim\sqrt{s}$ with probability $1-o(1)$ . (3) $\|Z^{\prime}R\|=o_{p}\Big{(}\sqrt{NM}\Big{)}.$

Recall that the fixed effects $\boldsymbol{\alpha}$ are decomposed into $\overline{\boldsymbol{\alpha}}$ and $\boldsymbol{\alpha}-\overline{\boldsymbol{\alpha}}$ such that (2.3) is satisfied, and the fixed effects $\boldsymbol{\gamma}$ are decomposed into $\overline{\boldsymbol{\gamma}}$ and $\boldsymbol{\gamma}-\overline{\boldsymbol{\gamma}}$ such that (2.4) is satisfied. These conditions (2.3) and (2.4) are imposed to satisfy Assumption 2 (1) and (2). Assumption 2 (3) can be relaxed to a weaker condition,666For example, $\sup_{\begin{subarray}{c}\|\xi\|=1\\ \|\xi\|_{0}=Cs\end{subarray}}\|\xi^{\prime}Z^{\prime}R\|=o_{p}(\sqrt{NM})$ for some finite positive $C$ . but we present the current condition for its better interpretation.

Remark 2 (Discussion of the Approximate Sparsity Condition).

We emphasize that the approximate sparsity condition of Assumption 2 (together with Assumption 5 (4) to be stated below) allows for many and even all the fixed effects (i.e., $\boldsymbol{\eta}$ as opposed to $\overline{\boldsymbol{\eta}}$ ) to be nonzero. The assumption should be interpreted as a requirement for how the fixed effects can be decomposed into the sparse components ( $\overline{\boldsymbol{\alpha}}$ and $\overline{\boldsymbol{\gamma}}$ ) and the remaining components ( $\boldsymbol{\alpha}-\overline{\boldsymbol{\alpha}}$ and $\boldsymbol{\gamma}-\overline{\boldsymbol{\gamma}}$ ) generating $R=D_{1}\left(\boldsymbol{\alpha}-\overline{\boldsymbol{\alpha}}\right)+D_{2}\left(\boldsymbol{\gamma}-\overline{\boldsymbol{\gamma}}\right).$ Indeed, the assumption implicitly imposes a non-trivial restriction on sampling procedures. For example, an i.i.d. sampling of fixed effects is not accommodated, although this feature does not contradict with our sampling assumption to be stated below as Assumption 3. With this said, the same limitations apply to all the preceding papers (cf. Section 1) that employ (approximate) sparsity conditions on fixed effects in panel data. In fact, the approximate sparsity is a rather plausible assumption for the sampling process in the context of our motivating application (1.1). In gravity analysis of trade, researchers initially used only the G7 countries, later added the OECD countries, and smaller economies have been added more recently. Nearly half of all import and export flows are determined by the top ten largest economies. Newly added countries to the sample tend to have very small trade volumes. This sampling process entails fixed effects taking smaller values as sample size increases, and it does not contradict with the approximate sparsity requirement. Section 6 elaborates on the approximate sparsity in trade volumes based on the actual world trade data. $\triangle$

Assumption 3 (Moments).

For each $(N,M)$ , the random vectors $(Y^{\prime}_{ij1},Z^{\prime}_{ij1},...,Y^{\prime}_{ijT},Z^{\prime}_{ijT})^{\prime}$ , $(i,j)\in[N]\times[M]$ , are independently distributed. Furthermore, there exist $q\in(4,\infty)$ and $K\in(0,\infty)$ not depending on $(N,M)$ such that the following conditions hold for all $l\in[k_{0}]$ .

(1)

$\Big{(}\frac{1}{NM}\sum_{i=1}^{N}\sum_{j=1}^{M}E\Big{[}\max_{t\leq T}\|X_{ijt}\|^{2q}_{\infty}\Big{]}\Big{)}^{1/2q}\leq B_{NM}$ * and $\Big{(}E|X_{ijt,l}|^{2q}\Big{)}^{1/2q}\leq K$ hold for all $i,j,t,l$ , where $B_{MN}$ satisfies $B_{NM}\sqrt{\log(p\vee(NM))}\lesssim(NM)^{1/2-1/q}$ ;* 2. (2)

$\|(D_{1},D_{2})\|_{\infty}=1$ ; and 3. (3)

$\frac{1}{NM}\sum_{i=1}^{N}\sum_{j=1}^{M}\sum_{t=1}^{T}E\varepsilon_{ijt}^{2q}\vee\frac{1}{NM}\sum_{i=1}^{N}\sum_{j=1}^{M}\sum_{t=1}^{T}E(\zeta^{l}_{ijt})^{2q}\leq K^{2q}<\infty$ .

For any squared matrix $A$ , define the sparse eigenvalues by

[TABLE]

With these notations, we state the following assumption of sparse eigenvalues for the rate-adjusted Gram matrix $\bar{\Psi}$ defined in (4.2).

Assumption 4 (Sparse Eigenvalues).

For any $C>0$ , there exist constants $0<\underline{k}<\overline{k}<\infty$ , not depending on $(N,M)$ , such that

[TABLE]

with probability approaching one.

For each $(N,M)$ , we write $\Psi=E\bar{\Psi}$ depending on $(N,M)$ , With this notation, the auxiliary decomposition (5.1) is made according to the following conditions.

Assumption 5 (Nuisance Parameters).

The following conditions are satisfied.

(1)

$\max_{l\in[k_{0}]}\|\phi^{l}\|_{0}\leq s_{l}$ * and $\max_{l\in[k_{0}]}\|\phi^{l}\|+(s_{l})^{-1/2}\|\phi^{l}\|_{1}\leq K$ ;* 2. (2)

For all $l\in[k_{0}]$ , $\|r_{l}\|\leq\sqrt{s_{l}}$ ; 3. (3)

For all $(N,M)$ , $0<L<\Lambda_{\min}(\Psi)<\Lambda_{\max}(\Psi)<U<\infty$ for $L$ , $U$ independent of $(N,M)$ ; 4. (4)

$\max_{l\in[k_{0}]}(s_{l}\vee s)\sqrt{\frac{(\log(p\vee(NM)))^{2}}{N\wedge M}}=o(1)$ .

Accounting for the possible dependence, we define the cluster-robust variance matrix

[TABLE]

For each $(N,M)$ , we write $\Theta=(E[\frac{Z^{\prime}Z}{NM}])^{-1}$ depending on $(N,M)$ . Let $\Theta_{l}$ denote the $l$ -th column of $\Theta$ . We state the following assumption of finite and non-zero variance.

Assumption 6 (Variance).

For any $(N,M)$ and for all $l\in[k_{0}]$ , $\|\Omega\|<\infty$ and $\Theta_{l}^{\prime}\Omega\Theta_{l}\geq\underline{k}>0$ for a constant $\underline{k}$ which is independent of the sample size.

Remark 3.

Notice that the conditions above are imposed on the Gram matrices, $\bar{\Psi}$ and $\Psi$ , re-weighted by effective sample size, rather than the original Gram matrices, $Z^{\prime}Z/NM$ and $EZ^{\prime}Z/NM$ . Assumption 3 is weaker than the common assumptions required in the literature, such as sub-gaussianity or uniform boundedness. Assumption 4 is also assumed by Belloni et al. (2012) and Belloni et al. (2016). It requires some small sub-matrices of the big $p\times p$ re-weighted Gram matrix to be well-behaved. Lower level sufficient conditions are also possible by using Lemma P1 in Belloni et al. (2018), but are not pursued here. Assumption 5 (1) and (2) impose sparsity on the nodewise regression parameters and the approximation errors. Assumption 5 (3) requires $\Psi$ , the expectation of the re-weighted Gram matrix, to be positive definite uniform over $(N,M)$ . These are rather standard in the literature. Assumption 5 limits the models that can be handled in terms of their dimensionality and sparsity. Note that we need only $ss_{l}(\log(p\vee(NM)))^{2}/(N\wedge M)=o(1)$ , whereas an adaptation of the proof strategies of Kock (2016) and Kock and Tang (2019) to our framework would entail $ss^{2}_{l}(\log(p\vee(NM)))^{2}/(N\wedge M)=o(1)$ . Finally, Assumption 6 requires $\Omega$ in the sandwich form to be well-behaved.

The following proposition states that Assumptions 2, 3, 4, 5, and 6 are sufficient for the high-level conditions in Assumption 1, with a concrete variance formula motivating the practical guideline of Section 3.

Proposition 1.

Assumptions 2, 3, 4, 5, and 6 imply Assumption 1 with $V_{ll}=\Theta_{l}^{\prime}\Omega\Theta_{l}$ .

A proof is found in Appendix A.2. Combining Theorem 1 and Proposition 1 together, we state the following corollary.

Corollary 1 (Asymptotic Normality).

If Assumptions 2, 3, 4, 5, and 6 are satisfied, then

[TABLE]

for each $l\in[k_{0}]$ , where $V_{ll}=\Theta_{l}^{\prime}\Omega\Theta_{l}$ .

Remark 4.

We conjecture that one can further enhance the results of Corollary 1 by showing the honesty property (uniform validity over a large set of parameters) of confidence intervals using the proposed procedure with no extra assumption by adapting the proof strategy of Theorem 3 of Caner and Kock (2018) or Theorem 3 of Kock and Tang (2019) to our framework.

5.2 Asymptotic Variance Estimation

Based on the asymptotic variance formula presented in Proposition 1, we suggest to compute the cluster-robust asymptotic variance of $\sqrt{NM}\left(\widetilde{\boldsymbol{\beta}}_{\ell}-\boldsymbol{\beta}_{\ell}\right)$ by

[TABLE]

as suggested in Section 3. This estimator is consistent in the current assumptions as formally stated in the following theorem.

Theorem 2 (Variance Estimator).

If Assumptions 2, 3, 4, 5 and 6 are satisfied, then

[TABLE]

A proof is found in Appendix A.3.

6 Approximate Sparsity in Gravity Analysis of Trade

In this section, we discuss our key assumption, namely the assumption of approximate sparsity (Assumptions 2 and 5 (4) – also see Remark 2), in the gravity model (1.1) of international trade. The idea behind the approximate sparsity assumption is that only a small number of observations have large fixed effect values, and the remaining majority of observations have relatively modest fixed effect values that can be summarized into the approximation error term $r_{ijt}$ . The assumption is likely satisfied in sampling processes where, after collecting observations with relatively large values of fixed effects (e.g., G7 and OECD countries), the remaining additions tend to have smaller values of fixed effects. We argue that this is plausible in common settings such gravity analysis in international trade.

To illustratea this point, we retrieved data from the World Integrated Trade Solution (WITS) Database, a common source of trade flows and trade costs used in gravity analysis.777This database was developed by the World Bank in conjunction with the United Nations Conference on Trade and Development (UNCTAD), the International Trade Center, United Nations Statistical Division (UNSD) and the World Trade Organization (WTO). The database combines information on trade flows from the UN Comtrade database, tariff and non-tariff barriers from the UN TRAINS database, and the both preferential and MFN tariffs from the WTO’s Integrated Data Base. We focus on country-specific import and export flows and aim to make two specific points. First, in any given year, trade is largely dominated by a few large countries. For instance, in 2015, the WITS database contains positive import flows for 237 countries and positive export flows for 232 countries. Nonetheless, nearly half of all import (respectively, export) flows are determined by the top 10 largest importers (respectively, exporters) alone. Not surprisingly, the largest importers are also the largest exporters. Second, the importance of these countries has remained stable over time, despite the fact that (a) world trade has grown exponentially over time and (b) WITS records exports and imports for a substantially larger number of countries in recent years than it did even a few years ago. In this sense, the ‘new’ additions to trade databases tend to have very small trade flows.

Using a country’s share of world imports (Table 1) as a measure of ‘importer’ importance or a country’s share of world exports (Table 2) as a measure of exporter importance, we document the 10 largest trading nations every 5 years starting in 1990. We note the following three attributes of standard trade data: (1) a small number of countries account for the large majority of world trade; (2) whether a country represents a large or small fraction of trade flows changes slowly over time; and (3) even though many developing countries have grown substantially since 1990, the average share of small countries has not changed very much. This last feature is largely due to the fact that the ‘new’ countries which are added to world trade databases are nearly always very small. To make points (1) and (2) particularly clear, we would expect that a typical country would have an import/export share of roughly 0.5% for a sample of about 200 countries. However, in any given year, fewer than 40 countries have import or export shares of 0.5%. Of the countries which have been added to the import database since 1990, their average (median) import share was 0.07% (0.01%). Similarly, among the countries added to the export database since 1990, their average (median) export share was 0.09% (0.03%). Regardless of how to measure the size of these peripheral countries, their overall contribution to world trade is extremely small.

In summary, only a small number of observations have high trade volumes. The large majority of remaining observations have very modest and almost negligible trade shares. This pattern remains stable over time. Since researchers first collect observations with large volumes (e.g., G7 and OECD countries), new additions to the data thereafter entail relatively small volumes. This common sampling process in gravity analysis of international trade is compatible with our key assumption, namely the the assumption of approximate sparsity (Assumptions 2 and 5 (4) – also see Remark 2).

7 Simulation Studies

7.1 Simulation Setting

Consider the following three fixed effect models of three-dimensional panel data.

[TABLE]

Model (I) is nested by Model (II), and Model (II) is in turn nested by Model (III). Therefore, Model (I) is the most parsimonious and subject to under-fitting, whereas Model (III) is the richest and subject to over-fitting. If a researcher runs a fixed effect estimator under Model (I) when Model (II) or (III) is true, then the estimates generally suffers from mis-specification biases. If a researcher runs a fixed effect estimator under Model (III) when Model (I) or (II) is true, then the estimates generally suffers from larger standard errors than necessary.

We run simulations for varying sizes of $N$ and $M=N-1$ , while the length of time is set to $T=5$ throughout. This setting follows from our asymptotic theory where $N$ and $M$ increases but $T$ does not. The $i$ and $j$ fixed effects are generated by $\alpha_{i}\sim N\left(m_{\alpha},s_{\alpha}^{2}\left/\left(\sqrt{i}\cdot(\log(i+1))^{3}\right)\right.\right)$ and $\gamma_{j}\sim N\left(m_{\gamma},s_{\gamma}^{2}\left/\left(\sqrt{j}\cdot(\log(j+1))^{3}\right)\right.\right)$ independently, where $m_{\alpha}=m_{\gamma}=0$ and $s_{\alpha}=s_{\gamma}=1$ . The $t$ fixed effects are generated by $\lambda_{t}=0$ for all $t$ but for one year $t$ when a universal shock of $\lambda_{t}=2$ is applied. The $it$ and $jt$ fixed effects are generated by $\alpha_{it}\sim N\left(m_{\alpha},s_{\alpha}^{2}\left/\left(\sqrt{i}\cdot(\log(i+1))^{3}\right)\right.\right)$ , $\gamma_{jt}\sim N\left(m_{\gamma},s_{\gamma}^{2}\left/\left(\sqrt{j}\cdot(\log(j+1))^{3}\right)\right.\right)$ , $m_{\alpha}=m_{\gamma}=0$ , and $s_{\alpha}=s_{\gamma}=1$ . We generate $X$ dependently on the fixed effects according to the mixture

[TABLE]

where $m_{x}=0$ , $s_{x}=2$ , $\rho=0.5$ , $\tilde{x}_{ijt}\sim N(0,1)$ , and $F_{ijt}$ is the standardized sum of fixed effects for the unit $(i,j,t)$ , i.e.,

[TABLE]

for each $(i,j,t)\in\{1,...,N\}\times\{1,...,M\}\times\{1,...,T\}$ . The error term is generated by $\varepsilon_{ijt}\sim N(m_{\varepsilon},s_{\varepsilon}^{2})$ independently where $m_{\varepsilon}=0$ and $s_{\varepsilon}=10$ . The main coefficient of interest is set to $\beta=1$ . Each set of simulations consists of 10,000 Monte Carlo iterations of data generation, estimation, and inference.

We compare five methods of estimation and inference. These are the OLS without any individual fixed effects, the fixed effect estimator based on Model (I), the fixed effect estimator based on Model (II), the fixed effect estimator based on Model (III), and our proposed de-biased lasso estimator and post-selection inference. Note that the OLS is always under-fitting the true data generating model, and hence is expected to produce mis-specification biases. The fixed effect estimator based on Model (I) is correctly specified when the true data generating model is Model (I), but is under-fitting Model (II) and Model (III). The fixed effect estimator based on Model (II) is over-fitting Model (I), correctly specified when the true data generating model is Model (II), and under-fitting Model (III). The fixed effect estimator based on Model (III) is over-fitting Model (I) and Model (II), but is correctly specified when the true data generating model is Model (III).

7.2 Simulation Results

Table 3 displays Monte Carlo simulation results under Model (I) (top panel), Model (II) (middle panel), and Model (III) (bottom panel) with the sample size $N=10$ ( $NMT=450$ ). Similarly, Tables 4 and 5 display Monte Carlo simulation results with the sample sizes $N=15$ ( $NMT=1050$ ) and $N=200$ ( $NMT=1900$ ), respectively. The displayed statistics are the averages, biases, standard deviations, and root mean squared errors of estimates. Also displayed are the coverage frequencies of the true value of $\beta$ by the 95% confidence intervals. The first column of each table shows the OLS results without any individual fixed effects. The next three columns of each table show results of fixed effect estimators based on estimating equations of Model (I), Model (II), and Model (III). We shall call them FE-I, FE-II, and FE-III for succinctness. The last column of each table shows results of our proposed de-biased lasso estimator with valid post-selection inference. We shall call it POST for succinctness.

In the top panel of each table, where the true data generating model is Model (I), OLS is biased while FE-I, FE-II, and FE-III yield little biases. These results are consistent with the current simulation setting as OLS mis-specifies the true model while FE-I, FE-II, and FE-III correctly specify the true model. The bias of POST is in the middle between that of OLS and those of FE-I, FE-II, and FE-III. In other words, POST is de-biased to some extent but not to the full extent so that desired balances between the bias and variance are maintained. OLS yields a smaller standard deviation than FE-I or FE-II, and FE-III yields by far the largest standard deviation. These results are also consistent with the fact that OLS is the most parsimonious while FE-III is the most redundant in specification. POST yields an even smaller standard deviation than OLS. FE-I, as the oracle estimator, yields a smaller root mean square error than OLS, FE-II, or FE-III. Furthermore, POST yields an even smaller root mean square error than the oracle estimator, FE-I. The coverage frequency by FE-I, as the oracle estimator, is closer to the nominal level 95% than those of OLS, FE-II, or FE-III. Furthermore, POST yields the coverage frequency as close to the nominal level as the oracle estimator, FE-I. In summary, we observe that, when the true model is parsimonious, POST is more efficient than redundantly rich models and allows for as accurate inference as the oracle estimator.

In the middle panel of each table, where the true data generating model is Model (II), OLS and FE-I are biased while FE-II and FE-III yield little biases. These results are consistent with the current simulation setting as OLS and FE-I mis-specify the true model while FE-II and FE-III correctly specify the true model. The bias of POST is slightly larger than those of FE-II and FE-III, but much smaller than those of OLS and FE-I. In other words, POST is de-biased to a large extent but not to the full extent so that desired balances between the bias and variance are maintained. FE-II, as the oracle estimator, yields a smaller root mean square error than OLS, FE-I, or FE-III. Furthermore, POST yields an even smaller root mean square error than the oracle estimator, FE-II. The coverage frequency by FE-II, as the oracle estimator, is closer to the nominal level 95% than those of OLS, FE-I, or FE-III. POST yields the coverage frequency as close to the nominal level as the oracle estimator, FE-II. In summary, we observe that POST is more precise than biased parsimonious estimators, is more efficient than redundant estimators, and allows for as accurate inference as the oracle estimator.

In the bottom panel of each table, where the true data generating model is Model (III), OLS, FE-I, and FE-II are biased while FE-III yields a little bias. These results are consistent with the current simulation setting as OLS, FE-I, and FE-II mis-specify the true model while FE-III correctly specifies the true model. The bias of POST is in the middle between those of OLS, FE-I, and FE-II and that of FE-III. In other words, POST is de-biased to some extent but not to the full extent so that desired balances between the bias and variance are maintained. POST yields a smaller root mean square error than any other estimator, including the oracle estimator, FE-III. POST also yields the coverage frequency closer to the nominal level than any estimator, including the oracle estimator, FE-III. In summary, we observe that, when the true model is rich, POST is more precise than parsimonious estimators and allows for as accurate inference as the oracle estimator.

The simulation results reported above demonstrate that the proposed method (POST) can be used as a robustly applicable method of inference when a researcher does not know the correct fixed effect specification in practice. We also implemented many additional sets of simulations under alternative data generating parameters, and confirm that the qualitative pattern of these additional results remain the same as those of our baseline setting presented above. Specifically, we consistently observe that POST is more precise than biased parsimonious estimators, is more efficient than redundant estimators, and allows for as accurate inference as the oracle estimator.

8 Discussions

Three-dimensional panel models are widely used in empirical analysis of international trade, housing, migration, and consumer price, among others. Empirical researchers use various combinations of fixed effects for three-dimensional panels. When a researcher imposes a parsimonious model and the true model is rich, then estimation based on the assumed parsimonious model generally incurs mis-specification biases. When a researcher employs a rich model and the true model is parsimonious, then estimation based on the redundantly rich model generally incurs larger standard errors than necessary. It is therefore useful for researchers to know correct models for an application of interest. In this light, Lu et al. (2018) propose methods of model selection in three-dimensional panel data. In this paper, we advance this literature by proposing a method of post-selection inference for regression parameters. We propose to use the lasso technique as means of model selection and to de-bias the lasso estimate, but our assumptions allow for many and even all fixed effects to be nonzero. Simulation studies demonstrate that the proposed method is more precise than biased estimators by parsimonious models, is more efficient than noisy estimators by redundant models, and allows for as accurate inference as the oracle estimator.

We suggest a couple of directions for future research. First, our model framework does not allow for $ij$ fixed effects, while $i$ , $j$ , $t$ , $it$ and $jt$ fixed effects are allowed. Although allowing for $ij$ fixed effects is not of interest in our motivating example,888In gravity models for international trade, the main parameters of interest are the coefficient of $DIST_{ij}$ , interpreted as the trade elasticity or trade cost, and the coefficient of $TA_{ij}$ , interpreted as the effects of bilateral trade agreements on trade volume. These parameters will not be identified once $ij$ fixed effects enter the model. it may be possible to allow for such fixed effects provided that the asymptotic setting allows for large $T$ as well as large $N$ and/or large $M$ . Formal theoretical development for this case is left for future research. Second, we conjecture that our limit distribution result can be extended to establish honest (uniformly valid) confidence intervals, and formal theoretical investigation of the honesty property is left for future research.

Mathematical Appendix

Throughout, we use the following short-hand notations: $Q=S/\sqrt{NM}$ and $a=p\vee(NM)$ . Also, for a matrix $A$ , denote $\|A\|_{\infty}=\max_{i,j}|A_{i,j}|$ .

Appendix A Proofs of the Main Results

A.1 Proof of Theorem 1

Proof.

The K.K.T. condition for the lasso program (3.1) gives

[TABLE]

Note that we have $S\bar{\Psi}S=Z^{\prime}Z$ by the definition of $\bar{\Psi}$ , and thus

[TABLE]

Multiplying both sides by $\widehat{\Theta}_{l}^{\prime}/\sqrt{NM}$ , we have

[TABLE]

where $Q=S/\sqrt{NM}$ . Therefore, we have

[TABLE]

By Assumption 1 (i)–(ii) and the definition (4.1) of the de-biased lasso, we obtain

[TABLE]

Applying Assumption 1 (iii) for each $l\in[k_{0}]$ yields the weak convergence result. ∎

A.2 Proof of Proposition 1

Proof.

The sufficiency of Assumptions 2, 3, 4, and 5 for Assumption 1 (i) is provided in Lemma 5. The sufficiency of Assumptions 2, 3, 4, and 5 for Assumption 1 (ii) is provided in Lemma 6. The sufficiency of Assumptions 3, 4, 5 and 6 for Assumption 1 (iii) is provided in Lemma 7. ∎

A.3 Proof of Theorem 2

Proof.

We introduce the intermediate object defined by

[TABLE]

Lemma 8 under Assumptions 2, 3, 4 and 5 yields $\max_{l\in[p]}\|\hat{\Theta}_{l}\|_{0}\leq Cs_{l}$ with probability $1-o(1)$ for some $C$ large enough for all $l\in[k_{0}]$ . Therefore, we obtain the decomposition

[TABLE]

for all $l\in[k_{0}]$ By Lemma 4 under Under Assumptions 3, 4, and 5 , it suffices to bound $\max_{\begin{subarray}{c}\|\xi\|=1\\ \|\xi\|_{0}\leq Cs_{l}\end{subarray}}\xi^{\prime}(\hat{\Omega}-\tilde{\Omega})\xi$ and $\|\hat{\Theta}_{l}\|^{2}_{1}\|\tilde{\Omega}-\Omega\|_{\infty}$ on the right-hand side.

We first bound $\|\hat{\Theta}_{l}\|^{2}_{1}\|\tilde{\Omega}-\Omega\|_{\infty}$ on the right-hand side of (A.1). Since $\max_{l\in[p]}\|\hat{\Theta}_{l}\|_{0}=O_{p}(s_{l})$ with probability approaching one and $\max_{l\in[p]}\|\hat{\Theta}_{l}\|=O_{p}(1)$ , we have $\|\hat{\Theta}_{l}\|_{1}=O_{p}(\sqrt{s_{l}})$ uniformly over $l\in[k_{0}]$ . By an application of Lemma 2, we have

[TABLE]

with probability at least $1-o(1)$ , where

[TABLE]

under Assumption 3, and

[TABLE]

under Assumption 3. Therefore, we obtain

[TABLE]

where the last rate follows from Assumption 3 (i). Combining these results, we obtain

[TABLE]

We next bound $\max_{\begin{subarray}{c}\|\xi\|=1\\ \|\xi\|_{0}\leq Cs_{l}\end{subarray}}\xi^{\prime}(\hat{\Omega}-\tilde{\Omega})\xi$ on the right-hand side of (A.1). Note that $\hat{\varepsilon}=\varepsilon+R-Z(\widehat{\boldsymbol{\eta}}-\overline{\boldsymbol{\eta}})$ . Thus,

[TABLE]

We bound each term of the last eight terms separately. First, Cauchy-Schwartz’s inequality yields

[TABLE]

Due to the sparsity of all the feasible $\xi$ , we have $\|\xi\|_{1}\leq\sqrt{s_{l}}\|\xi\|$ . Thus, by Assumption 3, Lemma 1, and Lemma 3 with $\mu=C\sqrt{NM\log a}$ under Assumptions 2, 3 (1), and 4, we have

[TABLE]

Therefore, $(8)=O_{p}\Big{(}\frac{s\cdot s_{l}B^{2}_{NM}\log a}{(NM)^{1-1/q}}\Big{)}.$

Similarly, for $(1)$ and $(3)$ , we have

[TABLE]

Thus, by Assumptions 2 and 3 (1),

[TABLE]

for all feasible $\xi$ . Since $Z^{\prime}Z/NM=Q\bar{\Psi}Q$ and $\|Q\xi\|\leq\|\xi\|$ ,

[TABLE]

where the second inequality is due to Assumption 4 and the last uses Assumption 3 (3).

Since all the remaining terms consist of the products of the above three components, by using (A.2), (A.3) and (A.4), we obtain

[TABLE]

Using the rate for $\max_{l\in[k_{0}]}\|\hat{\Theta}_{l}-\Theta_{l}\|$ from Lemma 4 under Assumptions 2, 3, and 6, we have

[TABLE]

as desired. ∎

Remark 5.

As emphasized in the main text, recall that Assumption 5 (4) requires $ss_{l}(\log(p\vee(NM)))^{2}/(N\wedge M)=o(1)$ instead of $ss^{2}_{l}(\log(p\vee(NM)))^{2}/(N\wedge M)=o(1)$ . This is due to the fact that we made use of the bound $|\hat{\Theta}_{l}^{\prime}\hat{\Omega}\hat{\Theta}_{l}-\hat{\Theta}_{l}^{\prime}\tilde{\Omega}\hat{\Theta}_{l}|\leq\|\hat{\Theta}_{l}\|^{2}\max_{\begin{subarray}{c}\|\xi\|=1\\ \|\xi\|_{0}\leq Cs_{l}\end{subarray}}\xi^{\prime}(\hat{\Omega}-\tilde{\Omega})\xi$ with probability approaching unity following Lemma 8. On the other hand , in Kock (2016) and Kock and Tang (2019), the bound based on the dual norm inequality $|\hat{\Theta}_{l}^{\prime}\hat{\Omega}\hat{\Theta}_{l}-\hat{\Theta}_{l}^{\prime}\tilde{\Omega}\hat{\Theta}_{l}|\leq\|\hat{\Theta}_{l}\|_{1}^{2}\|\hat{\Omega}-\tilde{\Omega}\|_{\infty}$ is used in place.

Appendix B Auxiliary Lemmas

B.1 Oracle Inequalities

Assumption 7 (Oracle Inequalities).

For each $(N,M)$ and for some choice of $\mu$ that depends on $(N,M)$ , we have $2\|\widehat{\Upsilon}_{1}^{-1}\varepsilon^{\prime}X\|_{\infty}\leq\mu/c$ , $2\|\widehat{\Upsilon}_{2}^{-1}\varepsilon^{\prime}D_{1}\|_{\infty}\leq\mu/\sqrt{N}c$ and $2\|\widehat{\Upsilon}_{3}^{-1}\varepsilon^{\prime}D_{2}\|_{\infty}\leq\mu/\sqrt{M}c$ with probability $1-o(1)$ for some $c>1$ .

Assumption 8 (Weights for Penalty).

There exist the ideal penalty loading matrix $\widehat{\Upsilon}^{0}_{l}$ with all elements bounded and bounded away from zero uniformly over $(N,M)$ , sequences $u$ , $\ell$ with $0<\ell\leq 1\leq u$ , $\ell\overset{p}{\to}1$ , and $u\overset{p}{\to}u^{\prime}>1$ for some constant $u^{\prime}$ such that

[TABLE]

with probability $1-o(1)$ for $l=1,2,3$ .

Remark 6.

There are many possible situations where one may want to impose weights to penalize different parameters differently. These situations include (1) the case where one incorporates extra information from economic theory; (2) a penalty choice based on the theory of moderate deviation inequality for self-normalized sums as in Belloni et al. (2012); (3) the case where one conducts an iterating lasso algorithm such as the conservative lasso as in Caner and Kock (2018); and (4) the common practice of normalizing the standard errrs of all covariates to one.

Assumption 9 (Restricted Eigenvalues).

For any $C>0$ , there exists $\underline{\kappa}_{C}>0$ depends only on $C$ such that $\kappa^{2}_{C}:=\kappa^{2}_{C}(\bar{\Psi},s_{1},s_{2},s_{3})\geq\underline{\kappa}_{C}$ for all $(N,M)$ with probability $1-o(1)$ .

Remark 7.

As highlighted in Belloni et al. (2012), Assumption 4 implies Assumption 9 by the argument in Bickel et al. (2009).

The following lemma presents oracle inequalities for three-dimensional panel lasso. Its proof is closely related to Lemma 6 of Belloni et al. (2012). The main difference is that it accounts for the presence of fixed effects with different effective sample sizes.

Lemma 1 (Oracle Inequalities).

If Assumptions 2, 7, 8, and 9 are satisfied, then

[TABLE]

Proof.

From the definition of $\widehat{\boldsymbol{\eta}}$ , we have

[TABLE]

Rewrite this inequality and and get

[TABLE]

Using reverse triangle inequality and the dual norm inequality,

[TABLE]

where the third inequality follows from Assumptions 2 and 7. By the definition of $P$ , we have

[TABLE]

under Assumption 8.

We now branch into two cases. First, suppose that $\|Z(\widehat{\boldsymbol{\eta}}-\overline{\boldsymbol{\eta}})\|<2c_{s}$ . In this case, the first equation in the statement of the lemma is trivially true since all the terms on right-hand side of the first equation in the statement of the lemma are non-negative. Second, suppose that $\|Z(\widehat{\boldsymbol{\eta}}-\overline{\boldsymbol{\eta}})\|\geq 2c_{s}$ . In this case,

[TABLE]

and thus

[TABLE]

where $c_{0}=(uc+1)/(\ell c-1)$ . Assumption 9 implies that, for any $\delta$ which is in the choice set of the minimum of restricted eigenvalue definition, we have

[TABLE]

Since $\delta^{\prime}\Psi\delta=\delta^{\prime}S^{-1}Z^{\prime}ZS^{-1}\delta=b^{\prime}Z^{\prime}Zb$ for $b=S^{-1}\delta$ , we can rewrite the condition in terms of $b$ and obtain

[TABLE]

Note that (B.2) implies that we can let $b=\widehat{\boldsymbol{\eta}}-\overline{\boldsymbol{\eta}}$ . Thus,

[TABLE]

Taking the square root on both sides yields

[TABLE]

Finally, substitute this equation into (B.1) and drop the negative terms on the right-hand side yield

[TABLE]

This shows the first equation in the statement of the lemma.

We next obtain the $L^{1}$ -norm bounds. We branch into two cases. First, suppose that

[TABLE]

By definition of $\kappa_{2c_{0}}$ , we have

[TABLE]

by applying similar lines of arguments to those of the first part of the proof using $2c_{0}$ in place of $c_{0}$ . Second, suppose that

[TABLE]

In this case, equation (B.1) implies

[TABLE]

where the last inequality is due to the definition of $c_{0}=(uc+1)/(\ell c-1)$ . Equation (B.1) further implies that

[TABLE]

where the first inequality follows from (B.1), the second inequality follows from $\|Z(\widehat{\boldsymbol{\eta}}-\overline{\boldsymbol{\eta}})\|(2c_{s}-\|Z(\widehat{\boldsymbol{\eta}}-\overline{\boldsymbol{\eta}})\|)\leq\max_{x\geq 0}x(2c_{s}-x)\leq c_{s}^{2}$ , and the third inequality follows from (B.4). Therefore,

[TABLE]

where the first inequality is due to (B.4) and the second inequality is due to the previous equation. Combining the two cases together, we obtain

[TABLE]

and the remaining three equations in the statement of the lemma follow. ∎

B.2 Concentration Inequality

The following lemma follows from Chernozhukov et al. (2014) and Chernozhukov et al. (2015).

Lemma 2 (A Concentration Inequality).

Let $(X_{i})_{i\in[n]}$ be $p$ -dimensional independent random vectors, $B=\sqrt{E[\max_{i\in[n]}\|X_{i}\|^{2}_{\infty}]}$ , and $\sigma^{2}=\max_{j\in[p]}\frac{1}{n}\sum_{i=1}^{n}E|X_{ij}|^{2}$ . With probability at least $1-C(\log n)^{-1}$ ,

[TABLE]

Proof.

The claim follows from applying Theorem 5.1 of Chernozhukov et al. (2014) to Lemma 8 of Chernozhukov et al. (2015) with $t=\log n$ , $\alpha=1$ , and $q=2$ . ∎

B.3 Regularized Events

Lemma 3 (Regularized Events).

Fix constants $c>1$ and $C>0$ , and let $\widehat{\Upsilon}=I$ . If Assumption 3 is satisfied, then we have $2\|\varepsilon^{\prime}X\|_{\infty}\leq\mu/c$ , $2\|\varepsilon^{\prime}D_{1}\|_{\infty}\leq\mu/c\sqrt{N}$ and $2\|\varepsilon^{\prime}D_{2}\|_{\infty}\leq\mu/c\sqrt{M}$ with probability at least $1-C(\log(N\wedge M))^{-1}$ where $\mu=C\sqrt{NM\log a}$ . Similarly, if Assumption 3 is satisfied, then we have $\|X_{-l}\zeta_{l}\|_{\infty}\leq\mu_{\text{node,l}}/2c$ , $\|D_{l}\zeta_{l}\|_{\infty}\leq\mu_{\text{node,l}}/2c\sqrt{N}$ , and $\|D_{2}\zeta_{l}\|_{\infty}\leq\mu_{\text{node,l}}/2c\sqrt{M}$ uniformly over $l\in[p]$ with probability at least $1-C(\log(N\wedge M))^{-1}$ where $\mu_{\text{node},l}=C\sqrt{NM\log a}$ .

Proof.

Applying Lemma 2, we have

[TABLE]

with probability $1-C(\log NM)^{-1}$ , where $\sigma^{2}=\max_{l\in[k_{0}]}\max_{t\in[T]}\frac{1}{NM}E[X^{2}_{ijt,l}\varepsilon_{ijt}^{2}]\leq O(K^{4})$ . Note that we have

[TABLE]

where the first inequality is due to Jensen’s inequality, the third inequality is due to Hölder’s inequality, and the last equality is due to Assumption 3 (1) and (3). Thus $\frac{B_{NM}\log(p\vee(NM))}{(NM)^{1-1/q}}=O(\sqrt{\frac{\log a}{NM}})$ , and this implies

[TABLE]

with probability at least $1-C(\log(NM))^{-1}$ for $K>0$ large enough.

Since $\|(D_{1},D_{2})\|_{\infty}=1$ under Assumption 3 (2), an application of Lemma 2 gives

[TABLE]

with probability at least $1-C(\log(N\wedge M))^{-1}$ , where $i$ depends on the choice of $l$ . Note that there are at most $MT=O(M)$ nonzero terms in the summand for each $l$ . Analogous arguments hold for $\|D_{2}^{\prime}\varepsilon\|_{\infty}$ with the number of nonzero terms being at most $NT=O(N)$ in place of $MT$ .

Under Assumption 3 and the choice $\mu_{\text{node},l}=C\sqrt{NM\log a}$ , similar lines of argument to those above show that the regularized events $\|X_{-l}\zeta_{l}\|_{\infty}\leq\mu_{\text{node}}/2c$ , $\|D_{l}\zeta_{l}\|_{\infty}\leq\mu_{\text{node}}/2c\sqrt{N}$ , and $\|D_{2}\zeta_{l}\|_{\infty}\leq\mu_{\text{node}}/2c\sqrt{M}$ occur with probability approaching one. Applying Lemma 2, we have

[TABLE]

with probability $1-C(\log N\wedge M)^{-1}$ . ∎

B.4 Rates of Nuisance Parameters

Throughout this section, we use the following notations. For any diagonal matrix $A$ , $A_{l}$ denotes the $l$ -th diagonal entry and $A_{-l}$ denotes $A$ with the $l$ -th column and row removed.

The following lemma establishes behaviors of the nuisance parameters based on the nodewise regressions under three-dimensional panel setting. It is closely related to Lemma C.9 of Kock and Tang (2019). The main difference is that, in Kock and Tang (2019), their one-way fixed effect modeling assumption implies their $D_{2}=\emptyset$ and $D^{\prime}_{1}D_{1}=I$ , which in turn implies the diagonal structure of

[TABLE]

and greatly simplifies their estimation procedure. In our case, however, due to the potential presence of multi-way fixed effects, such decomposition is not available. Therefore, the theory of our nodewise regression needs to account for these fixed effects with different convergence rates simultaneously.

Lemma 4 (Nodewise Lasso for Nuisance Parameters).

Suppose Assumptions 3, 4, and 5 are satisfied and $\widehat{\Theta}$ is calculated following (3.13) with $\mu_{\text{node},l}=C\sqrt{NM\log a}$ for a $C>0$ . It holds uniformly over $l\in[k_{0}]$ that

[TABLE]

Proof.

The proof is consists of three steps.

Step 1 First, under Assumption 3 and by the choice $\mu_{\text{node},l}=C\sqrt{NM\log a}$ , Lemma 3 gives that the regularized events $\|X_{-l}\zeta_{l}\|_{\infty}\leq\mu_{\text{node}}/2c$ , $\|D_{l}\zeta_{l}\|_{\infty}\leq\mu_{\text{node}}/2c\sqrt{N}$ , $\|D_{2}\zeta_{l}\|_{\infty}\leq\mu_{\text{node}}/2c\sqrt{M}$ occur with probability approaching one uniformly over $[k_{0}]$ . Using the arguments similar to those of Lemma 1, under Assumptions 4 and 5 (1) and (2), we have

[TABLE]

uniformly for $l\in[k_{0}]$ .

To find a bound for $\|\widehat{\phi}^{l}-\phi^{l}\|$ that holds uniformly over $[k_{0}]$ , note that

[TABLE]

by (B.5), where $\|A\|_{\infty}$ denotes the maximal element of a matrix $A$ . We now bound the second term on the right-hand side. Note that

[TABLE]

We want to show all three terms go to zero with $r=C\sqrt{\frac{\log a}{NM}}$ . Assumption 3 (1) and (3) imply

[TABLE]

Thus, with probability at least $1-C(\log(NM))^{-1}$ ,

[TABLE]

by Lemma 2. Similarly, with probability at least $1-C((\log N\wedge M))^{-1}$ ,

[TABLE]

by Assumption 3 (2). Thus, with probability at least $1-C(\log(N\wedge M))^{-1}$ ,

[TABLE]

Since following Assumption 5(4), $s_{l}\sqrt{\frac{\log a}{N\wedge M}}=o(1)$ , we therefore have

[TABLE]

uniformly in $l\in[k_{0}]$ . Substitute into (B.8), we obtain

[TABLE]

uniformly in $l\in[k_{0}]$ . Since under Assumption 5(3), $\Lambda_{\min}(\Psi)>0$ and

[TABLE]

uniformly in $l\in[k_{0}]$ , we conclude that

[TABLE]

uniformly in $l\in[k_{0}]$ . Thus, the triangle inequality and definition of $Q$ together imply it holds uniformly over $[k_{0}]$

[TABLE]

Step 2 We next show $\max_{l\in[k_{0}]}|\widehat{\tau}_{l}^{2}-\tau_{l}^{2}|$ . By the definition of $\widehat{\tau}_{l}$ and the K.K.T. condition, we get the following equality using the decomposition $Z_{l}=Z_{-l}\phi^{l}+r_{l}+\zeta_{l}$ .

[TABLE]

Thus,

[TABLE]

It suffices to find bounds for each of the five terms in the last expression.

First we consider $(i)$ . Under Assumption 3 (1), we have

[TABLE]

for all $l\in[k_{0}]$ . Therefore, by Lemma 2 and Assumption 3 (1),

[TABLE]

with probability at least $1-C(\log(NM))^{-1}$ . It follows that

[TABLE]

Second, we consider $(iv)$ in (B.9). By the regularized events established in Step 1, we have

[TABLE]

Thus, by (B.6),

[TABLE]

follows.

We next consider $(ii)$ in (B.9). Note that $\|\phi^{l}\|_{1}=O(\sqrt{s_{l}})$ in Assumption 5(1) implies $\|Q_{-l}\phi^{l}\|_{1}=O(\sqrt{s_{l}})$ . Combining this and the regularized events as before, we have

[TABLE]

Now, we consider $(iii)$ in (B.9). Using Assumptions 4 and 5 (1) and (3), the fact that $Q^{-1}=\sqrt{NM}S^{-1}$ , and the definition of $\bar{\Psi}$ , we obtain

[TABLE]

Furthermore, (B.5) implies,

[TABLE]

Combining these two intermediate results, we have

[TABLE]

Finally, we consider the remaining terms in (B.9) that involve $r_{l}$ . Note that

[TABLE]

follows from Assumption 5 (2) and Assumption 3 (3) . Also, under Assumptions 4 and 5 (1) and (2)

[TABLE]

with probability at least $1-o(1)$ . A similar argument under Assumptions 4 and 5 (1) and (2) shows that $\frac{|r_{l}^{\prime}Z_{l}(\widehat{\phi}^{l}-\phi^{l})|}{NM}=O\Big{(}\sqrt{\frac{s_{l}}{NM}}\Big{)}$ .

Combining all the results above, we obtain

[TABLE]

uniformly over $[k_{0}]$

Step 3 Since $l\in[k_{0}]$ ,

[TABLE]

hods for each $(N,M)$ under Assumption 5 (3), where the first inequality follows from the discussion following (B.30) in the Proof of Theorem 1 in Caner and Kock (2018). Therefore, $\widehat{\tau}^{2}_{l}$ is bounded away from zero in probability, and we have

[TABLE]

by Step 2.

Now, we bound $\max_{l\in[k_{0}]}\|\widehat{\Theta}_{l}-\Theta_{l}\|_{1}$ . Since $\|\phi^{l}\|_{1}=O(\sqrt{s_{l}})$ under Assumption 5 (1), we have

[TABLE]

The first and the third terms can be bounded by (B.11) and the second term can be bounded by (B.7). Therefore,

[TABLE]

Similar lines of argument under Assumption 5 (1) and $\|\widehat{\phi}^{l}-\phi^{l}\|$ from Step 1 lead to

[TABLE]

Since $\|\Theta_{l}\|_{1}\leq\max_{l\in[k_{0}]}\frac{1}{\tau^{2}_{l}}+\max_{l\in[k_{0}]}\|\frac{\phi^{l}}{\tau^{2}_{l}}\|_{1}=O(\sqrt{s_{l}})$ by (B.10) and Assumption 5 (1), it follows that $\|\widehat{\Theta}_{l}\|_{1}=O_{p}(\sqrt{s_{l}})$ for all $l\in[k_{0}]$ . ∎

B.5 Sufficiency for Assumption 1 (i)

Lemma 5.

If Assumptions 2, 3, 4, and 5 are satisfied, then

[TABLE]

Proof.

Recall $\bar{\Psi}=S^{-1}Z^{\prime}ZS^{-1}$ and $Q=S/\sqrt{NM}$ . Also, if we let $\Gamma=ZS^{-1}$ , then $\bar{\Psi}=\Gamma^{\prime}\Gamma$ and

[TABLE]

Since $l\in[k_{0}]$ , $Q_{ll}=1$ . Let $\bar{\Psi}_{l}$ denote the $l-$ th column of $\bar{\Psi}$ . Using the K.K.T. condition for the nodewise lasso, we have

[TABLE]

Also using the K.K.T. condition, we have

[TABLE]

Using the property of the sub-gradient $\kappa_{l}$ , we have

[TABLE]

which is the same as

[TABLE]

since $Z_{l}-Z_{-l}\widehat{\phi}^{l}=Z\widehat{C}_{l}$ . Divide both sides by $\widehat{\tau}^{2}_{l}$ by using $\widehat{\Theta}_{l}=\widehat{C}_{l}/\widehat{\tau}_{l}^{2}$ to obtain

[TABLE]

With some rewriting

[TABLE]

where $S_{-l}$ is $S$ with both the $l$ -th column and the $l$ -th row removed. $Q_{-l}$ is defined similarly. Applying Lemma 4 under Assumptions 3, 4, and 5, we have $1/\widehat{\tau}_{l}^{2}=O_{p}(1)$ . Therefore, by (B.12) and (B.13),

[TABLE]

Finally, Lemma 1 and Lemma 3 with $\mu=C\sqrt{(NM)\log a}$ under Assumptions 2, 3, and 4 together imply

[TABLE]

as claimed.999Note that Lemma 1, as it is stated, requires Assumptions 2, 7, 8, and 9. While Assumption 2 is directly invoked by the statement of Lemma 5, Assumption 7 is implied by Assumption 3 through Lemma 3, Assumption 8 is trivially satisfied under the current setting with $\widehat{\Upsilon}=I$ , and Assumption 9 is implied by Assumption 4. ∎

B.6 Sufficiency for Assumption 1 (ii)

Lemma 6.

Suppose that Assumptions 2, 3, 4, and 5 are satisfied. Then,

[TABLE]

Proof.

Note that

[TABLE]

by Assumption 5 (1) and (4) and Lemma 4 under Assumptions 3, 4, and 5. Therefore,

[TABLE]

follows under Assumption 2 (3). ∎

B.7 Sufficiency for Assumption 1 (iii)

Lemma 7.

Suppose that Assumptions 3, 4, 5 and 6 are satisfied. Then,

[TABLE]

Proof.

First we show $\frac{1}{\sqrt{NM}}\Theta_{l}^{\prime}Z^{\prime}\varepsilon\leadsto N(0,V_{ll})$ . Note that we have

[TABLE]

and

[TABLE]

under Assumption 6. Furthermore, by Assumption 3

[TABLE]

where $q>4$ , the first inequality follows from a dual norm inequality, the second and the third from the fact that $\|\Theta_{l}\|_{1}\lesssim\sqrt{s_{l}}$ and $\|\Theta_{l}\|_{0}\leq s_{l}$ implied by Assumption 5(1), the fourth from Cauchy-Schwartz’s inequality, and the fifth from Assumption 3 and the last equality follows from Assumption 5 (4). This verifies the Lyapunov’s condition. Thus, we have $\frac{1}{\sqrt{NM}}\Theta_{l}^{\prime}Z^{\prime}\varepsilon\leadsto N(0,V_{ll})$ .

Now, we show $|\frac{1}{\sqrt{NM}}(\widehat{\Theta}_{l}-\Theta_{l})^{\prime}Z^{\prime}\varepsilon|=o_{p}(1)$ . Invoking Lemmata 3 and 4 under Assumptions 3, 4, 5, we have

[TABLE]

Combining these results concludes $\frac{1}{\sqrt{NM}}\widehat{\Theta}_{l}^{\prime}Z^{\prime}\varepsilon\leadsto N(0,V_{ll})$ . ∎

B.8 Empirical Pre-Sparsity

The following lemma is a minor modification of Lemma 8 in Belloni et al. (2012).

Lemma 8 (Empirical Pre-sparsity).

If Assumptions 2, 3, 4, and 5 are satisfied, then we have

[TABLE]

where $\widehat{s}_{l}=\|\widehat{\phi}^{l}\|_{0}$ and $\widehat{s}=\|\widehat{\boldsymbol{\eta}}\|_{0}$ .

Proof.

Let $\hat{m}_{l}=\left|\hat{T}_{l}\setminus T_{l}\right|$ , where $T_{l}=\text{supp}(\phi^{l})$ and $\hat{T}_{l}=\text{supp}(\hat{\phi}^{l})$ . From K.K.T. condition, we have

[TABLE]

for all $l\in[k_{0}]$ and $k\in\hat{T}_{l}\setminus T_{l}$ . Thus,

[TABLE]

We bound the three terms in the last expression separately. First, Lemma 3 under Assumption 3 yields

[TABLE]

with probability at least $1-C(\log(N\wedge M))^{-1}$ . Second,

[TABLE]

follows by Assumptions 4 and 5. Therefore,

[TABLE]

Finally, by Lemma 4 under Assumptions 3, 4, 5, we obtain

[TABLE]

with probability at least $1-C(\log(N\wedge M))^{-1}$ , where the last inequality is due to Assumption 4 and Lemma 4. Using these bounds and (B.14), we obtain

[TABLE]

with probability $1-o(1)$ . Under Assumptions 2, 3, 4, and 5, the result for $\hat{s}$ can be established following analogous arguments. ∎

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Balazsi et al. (2017) Balazsi, L., L. Matyas, and T. Wansbeek (2017): “Fixed Effects Models,” in The Econometrics of Multi-Dimensional Panels , ed. by L. Matyas, Cham: Springer, chap. 1, 1–34.
2Baltagi and Bresson (2017) Baltagi, B. H. and G. Bresson (2017): “Modelling Housing Using Multi-dimensional Panel Data,” in The Econometrics of Multi-Dimensional Panels , ed. by L. Matyas, Cham: Springer, chap. 12, 349–376.
3Baltagi et al. (2017) Baltagi, B. H., P. H. Egger, and K. Erhardt (2017): “The Estimation of Gravity Models in International Trade,” in The Econometrics of Multi-Dimensional Panels , ed. by L. Matyas, Cham: Springer, chap. 11, 323–348.
4Belloni et al. (2012) Belloni, A., D. Chen, V. Chernozhukov, and C. Hansen (2012): “Sparse Models and Methods for Optimal Instruments With an Application to Eminent Domain,” Econometrica , 80, 2369–2429.
5Belloni et al. (2018) Belloni, A., V. Chernozhukov, D. Chetverikov, and Y. Wei (2018): “Uniformly valid post-regularization confidence regions for many functional parameters in z-estimation framework,” The Annals of Statistics , 46, 3643–3675.
6Belloni et al. (2014) Belloni, A., V. Chernozhukov, and C. Hansen (2014): “Inference on treatment effects after selection among high-dimensional controls,” The Review of Economic Studies , 81, 608–650.
7Belloni et al. (2016) Belloni, A., V. Chernozhukov, C. Hansen, and D. Kozbur (2016): “Inference in high-dimensional panel models with an application to gun control,” Journal of Business & Economic Statistics , 34, 590–605.
8Bickel et al. (2009) Bickel, P. J., Y. Ritov, and A. B. Tsybakov (2009): “Simultaneous analysis of Lasso and Dantzig selector,” The Annals of Statistics , 37, 1705–1732.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Post-Selection Inference in Three-Dimensional Panel Data††thanks: First arXiv version: March 30, 2019.

Abstract

1 Introduction

2 The Model Framework

3 Overview of the Method

4 The Main Theory

Assumption 1** (Asymptotic Normality).**

Theorem 1** (Asymptotic Normality).**

Remark 1**.**

5 Sufficient Conditions and Variance Estimation

5.1 Sufficient Conditions

Assumption 2** (Approximate Sparsity).**

Remark 2** (Discussion of the Approximate Sparsity Condition).**

Assumption 3** (Moments).**

Assumption 4** (Sparse Eigenvalues).**

Assumption 5** (Nuisance Parameters).**

Assumption 6** (Variance).**

Remark 3**.**

Proposition 1**.**

Corollary 1** (Asymptotic Normality).**

Remark 4**.**

5.2 Asymptotic Variance Estimation

Theorem 2** (Variance Estimator).**

6 Approximate Sparsity in Gravity Analysis of Trade

7 Simulation Studies

7.1 Simulation Setting

7.2 Simulation Results

8 Discussions

Mathematical Appendix

Appendix A Proofs of the Main Results

A.1 Proof of Theorem 1

Proof.

A.2 Proof of Proposition 1

Proof.

A.3 Proof of Theorem 2

Proof.

Remark 5**.**

Appendix B Auxiliary Lemmas

B.1 Oracle Inequalities

Assumption 7** (Oracle Inequalities).**

Assumption 8** (Weights for Penalty).**

Remark 6**.**

Assumption 9** (Restricted Eigenvalues).**

Remark 7**.**

Lemma 1** (Oracle Inequalities).**

Proof.

B.2 Concentration Inequality

Lemma 2** (A Concentration Inequality).**

Proof.

B.3 Regularized Events

Lemma 3** (Regularized Events).**

Proof.

B.4 Rates of Nuisance Parameters

Lemma 4** (Nodewise Lasso for Nuisance Parameters).**

Proof.

B.5 Sufficiency for Assumption 1 (i)

Lemma 5**.**

Proof.

B.6 Sufficiency for Assumption 1 (ii)

Lemma 6**.**

Proof.

B.7 Sufficiency for Assumption 1 (iii)

Lemma 7**.**

Proof.

B.8 Empirical Pre-Sparsity

Lemma 8** (Empirical Pre-sparsity).**

Proof.

Assumption 1 (Asymptotic Normality).

Theorem 1 (Asymptotic Normality).

Remark 1.

Assumption 2 (Approximate Sparsity).

Remark 2 (Discussion of the Approximate Sparsity Condition).

Assumption 3 (Moments).

Assumption 4 (Sparse Eigenvalues).

Assumption 5 (Nuisance Parameters).

Assumption 6 (Variance).

Remark 3.

Proposition 1.

Corollary 1 (Asymptotic Normality).

Remark 4.

Theorem 2 (Variance Estimator).

Remark 5.

Assumption 7 (Oracle Inequalities).

Assumption 8 (Weights for Penalty).

Remark 6.

Assumption 9 (Restricted Eigenvalues).

Remark 7.

Lemma 1 (Oracle Inequalities).

Lemma 2 (A Concentration Inequality).

Lemma 3 (Regularized Events).

Lemma 4 (Nodewise Lasso for Nuisance Parameters).

Lemma 5.

Lemma 6.

Lemma 7.

Lemma 8 (Empirical Pre-sparsity).