Detecting Identification Failure in Moment Condition Models

Jean-Jacques Forneron

arXiv:1907.13093·econ.EM·October 4, 2023

Detecting Identification Failure in Moment Condition Models

Jean-Jacques Forneron

PDF

TL;DR

This paper introduces a new method to detect identification failure in moment condition models using a quasi-Jacobian matrix, enabling robust inference regardless of the identification strength.

Contribution

It proposes a novel quasi-Jacobian matrix approach and a simple chi-squared test for detection of identification failure in moment models.

Findings

01

The quasi-Jacobian is asymptotically singular when identification fails.

02

The test works for strong, semi-strong, and weak identification.

03

Monte Carlo simulations and empirical application validate the method.

Abstract

This paper develops an approach to detect identification failure in moment condition models. This is achieved by introducing a quasi-Jacobian matrix computed as the slope of a linear approximation of the moments on an estimate of the identified set. It is asymptotically singular when local and/or global identification fails, and equivalent to the usual Jacobian matrix which has full rank when the model is point and locally identified. Building on this property, a simple test with chi-squared critical values is introduced to conduct subvector inferences allowing for strong, semi-strong, and weak identification without \textit{a priori} knowledge about the underlying identification structure. Monte-Carlo simulations and an empirical application to the Long-Run Risks model illustrate the results.

Tables6

Table 1. Table 1: CAPM - VAR parameters used in the simulations

Rank Failure	Near Rank Failure	Full Rank
$μ_{RF} = {(0.018, 0.013)}^{'}$	$μ_{NRF} = {(0.021, 0.04)}^{'}$	$μ_{FR} = {(0.00, 0.00)}^{'}$
$Φ_{RF} = (\begin{matrix} 0 & 0 \\ 0 & 0 \end{matrix})$	$Φ_{NRF} = (\begin{matrix} - 0.161 & 0.017 \\ 0.414 & 0.117 \end{matrix})$	$Φ_{FR} = (\begin{matrix} - 0.5 & 0 \\ 0 & - 0.5 \end{matrix})$
$Λ_{RF} = (\begin{matrix} 0.0012 & 0.0017 \\ 0.0017 & 0.0146 \end{matrix})$	$Λ_{NRF} = (\begin{matrix} 0.0012 & 0.00177 \\ 0.00177 & 0.014 \end{matrix})$	$Λ_{FR} = (\begin{matrix} 0.01 & 0 \\ 0 & 0.01 \end{matrix})$

Table 2. Table 2: CAPM – rejection rates, frequency for detecting identification failure

		Rank Failure					Near Rank Failure					Full Rank
n		AR₁	AR₂	AR₃	$t_{n}$	$< {\underline{λ}}_{n}$	AR₁	AR₂	AR₃	$t_{n}$	$< {\underline{λ}}_{n}$	AR₁	AR₂	AR₃	$t_{n}$	$< {\underline{λ}}_{n}$
100	$δ$	0.01	0.01	0.03	0.02	1.00	0.01	0.01	0.03	0.02	1.00	0.04	0.02	0.05	0.07	0.14
100	$γ$	0.05	0.02	0.05	0.00	0.00	0.05	0.02	0.05	0.00	0.00	0.04	0.01	0.04	0.06	0.00
250	$δ$	0.02	0.02	0.05	0.09	1.00	0.02	0.02	0.03	0.07	1.00	0.04	0.02	0.04	0.05	0.01
250	$γ$	0.05	0.02	0.05	0.00	0.00	0.05	0.02	0.05	0.00	0.00	0.04	0.02	0.04	0.06	0.00
500	$δ$	0.02	0.02	0.04	0.17	1.00	0.01	0.01	0.04	0.08	1.00	0.05	0.02	0.05	0.05	0.00
500	$γ$	0.04	0.02	0.04	0.00	0.00	0.04	0.02	0.04	0.00	0.00	0.06	0.02	0.06	0.05	0.00
1000	$δ$	0.02	0.02	0.05	0.22	1.00	0.02	0.02	0.04	0.05	0.98	0.05	0.02	0.05	0.05	0.00
1000	$γ$	0.05	0.02	0.05	0.00	0.00	0.04	0.02	0.04	0.00	0.00	0.05	0.02	0.05	0.05	0.00

Table 3. Table 3: Long-Run Risks: singular values of Jacobian and quasi-Jacobian

	$λ_{1}$	$λ_{2}$	$λ_{3}$	$λ_{4}$	$λ_{5}$	$λ_{6}$	$λ_{7}$	$λ_{8}$	$λ_{9}$	$λ_{10}$	$λ_{11}$	$λ_{12}$
${\bar{V}}_{n}^{- 1 / 2} \partial_{θ} {\bar{g}}_{n} ({\hat{θ}}_{n}) Σ_{n}^{- 1 / 2}$	${8.10}^{6}$	${4.10}^{6}$	${7.10}^{5}$	${1.10}^{5}$	${8.10}^{4}$	255	22	1.68	0.30	0.04	$< 10^{- 2}$	$< 10^{- 2}$
${\bar{V}}_{n}^{- 1 / 2} B_{n, \infty} Σ_{n}^{- 1 / 2}$	${2.10}^{8}$	${1.10}^{7}$	${1.10}^{6}$	${4.10}^{5}$	${2.10}^{4}$	208	0.94	0.42	0.06	0.01	$< 10^{- 2}$	$< 10^{- 2}$
${\bar{V}}_{n}^{- 1 / 2} B_{n, \infty} P_{θ_{1}}^{⟂} Σ_{n}^{- 1 / 2} P_{θ_{1}}^{⟂}$	${2.10}^{8}$	${1.10}^{7}$	${1.10}^{6}$	${4.10}^{5}$	${2.10}^{4}$	208	0.95	0.06	0.04	0.01	0.00	0.00

Table 4. Table H4: CAPM (larger κ n subscript 𝜅 𝑛 \kappa_{n} ) – size of 95% CIs for δ 𝛿 \delta and γ 𝛾 \gamma , frequency for detecting identification failure

		Rank Failure					Near Rank Failure					Full Rank
n		AR₁	AR₂	AR₃	$t_{n}$	$< {\underline{λ}}_{n}$	AR₁	AR₂	AR₃	$t_{n}$	$< {\underline{λ}}_{n}$	AR₁	AR₂	AR₃	$t_{n}$	$< {\underline{λ}}_{n}$
100	$δ$	0.01	0.01	0.03	0.02	1.00	0.01	0.01	0.03	0.02	1.00	0.02	0.02	0.05	0.07	0.97
100	$γ$	0.05	0.02	0.05	0.00	0.00	0.05	0.02	0.05	0.00	0.00	0.04	0.01	0.04	0.06	0.00
250	$δ$	0.02	0.02	0.05	0.09	1.00	0.02	0.02	0.03	0.07	1.00	0.03	0.02	0.04	0.05	0.35
250	$γ$	0.05	0.02	0.05	0.00	0.00	0.05	0.02	0.05	0.00	0.00	0.04	0.02	0.04	0.06	0.00
500	$δ$	0.02	0.02	0.04	0.17	1.00	0.01	0.01	0.04	0.08	1.00	0.05	0.02	0.05	0.05	0.00
500	$γ$	0.04	0.02	0.04	0.00	0.00	0.04	0.02	0.04	0.00	0.00	0.06	0.02	0.06	0.05	0.00
1000	$δ$	0.02	0.02	0.05	0.22	1.00	0.02	0.02	0.04	0.05	1.00	0.05	0.02	0.05	0.05	0.00
1000	$γ$	0.05	0.02	0.05	0.00	0.00	0.04	0.02	0.04	0.00	0.00	0.05	0.02	0.05	0.05	0.00

Table 5. Table H5: CAPM (Just-Identified) – size of 95% CIs for δ 𝛿 \delta and γ 𝛾 \gamma , frequency for detecting identification failure

		Rank Failure					Near Rank Failure					Full Rank
n		AR₁	AR₂	AR₃	$t_{n}$	$< {\underline{λ}}_{n}$	AR₁	AR₂	AR₃	$t_{n}$	$< {\underline{λ}}_{n}$	AR₁	AR₂	AR₃	$t_{n}$	$< {\underline{λ}}_{n}$
100	$δ$	0.01	0.01	0.04	0.01	1.00	0.01	0.01	0.04	0.01	1.00	0.04	0.01	0.05	0.05	0.38
100	$γ$	0.05	0.02	0.05	0.00	0.00	0.05	0.01	0.05	0.00	0.00	0.05	0.01	0.05	0.06	0.00
250	$δ$	0.02	0.02	0.04	0.03	1.00	0.01	0.01	0.03	0.05	1.00	0.05	0.02	0.05	0.04	0.04
250	$γ$	0.05	0.02	0.05	0.00	0.00	0.04	0.01	0.04	0.00	0.00	0.05	0.01	0.05	0.05	0.00
500	$δ$	0.02	0.02	0.05	0.08	1.00	0.01	0.01	0.04	0.11	1.00	0.05	0.01	0.05	0.04	0.00
500	$γ$	0.05	0.02	0.05	0.00	0.00	0.05	0.01	0.05	0.00	0.00	0.06	0.02	0.06	0.05	0.00
1000	$δ$	0.01	0.01	0.05	0.09	1.00	0.01	0.01	0.04	0.14	1.00	0.06	0.01	0.06	0.05	0.00
1000	$γ$	0.05	0.01	0.05	0.00	0.00	0.04	0.01	0.04	0.00	0.00	0.05	0.01	0.05	0.05	0.00

Table 6. Table H6: Long-Run Risks: singular values of Jacobian and quasi-Jacobian, larger κ n subscript 𝜅 𝑛 \kappa_{n}

	$λ_{1}$	$λ_{2}$	$λ_{3}$	$λ_{4}$	$λ_{5}$	$λ_{6}$	$λ_{7}$	$λ_{8}$	$λ_{9}$	$λ_{10}$	$λ_{11}$	$λ_{12}$
${\bar{V}}_{n}^{- 1 / 2} \partial_{θ} {\bar{g}}_{n} ({\hat{θ}}_{n}) Σ_{n}^{- 1 / 2}$	${1.10}^{6}$	${6.10}^{5}$	${2.10}^{5}$	${3.10}^{4}$	${1.10}^{4}$	145	20	0.61	0.20	0.02	$< 10^{- 2}$	$< 10^{- 2}$
${\bar{V}}_{n}^{- 1 / 2} B_{n, \infty} Σ_{n}^{- 1 / 2}$	${5.10}^{6}$	${1.10}^{6}$	${4.10}^{5}$	${7.10}^{4}$	${1.10}^{3}$	169	0.46	0.33	0.12	0.01	$< 10^{- 2}$	$< 10^{- 2}$
${\bar{V}}_{n}^{- 1 / 2} B_{n, \infty} P_{θ_{1}}^{⟂} Σ_{n}^{- 1 / 2} P_{θ_{1}}^{⟂}$	${5.10}^{6}$	${1.10}^{6}$	${4.10}^{5}$	${7.10}^{4}$	${1.10}^{3}$	169	0.43	0.32	0.02	$< 10^{- 2}$	0.00	0.00

Equations198

H_{0} : θ_{1} = θ_{10} vs. H_{1} : θ_{1} \neq = θ_{10} .

H_{0} : θ_{1} = θ_{10} vs. H_{1} : θ_{1} \neq = θ_{10} .

g (θ_{0}, γ_{0}) = d e f E_{γ_{0}} (\overset{g}{ˉ}_{n} (θ_{0})) = 0,

g (θ_{0}, γ_{0}) = d e f E_{γ_{0}} (\overset{g}{ˉ}_{n} (θ_{0})) = 0,

\hat{θ}_{n} = argmin_{θ \in Θ} \overset{g}{ˉ}_{n} (θ)^{'} W_{n} (θ) \overset{g}{ˉ}_{n} (θ),

\hat{θ}_{n} = argmin_{θ \in Θ} \overset{g}{ˉ}_{n} (θ)^{'} W_{n} (θ) \overset{g}{ˉ}_{n} (θ),

(A_{n, \infty}, B_{n, \infty}) = argmin_{A, B} (b \in {1, \dots, B} sup ∥ \overline{g}_{n} (θ_{b}) - A - B θ_{b} ∥ \hat{K}_{n} (θ_{b})),

(A_{n, \infty}, B_{n, \infty}) = argmin_{A, B} (b \in {1, \dots, B} sup ∥ \overline{g}_{n} (θ_{b}) - A - B θ_{b} ∥ \hat{K}_{n} (θ_{b})),

(μ_{n}, Σ_{n}) = argmin_{Σ, μ} (b \in {1, \dots, B} sup (lo g ∣Σ∣ + ∥ θ_{b} - μ ∥_{Σ^{- 1}}^{2}) \hat{K}_{n} (θ_{b})),

(μ_{n}, Σ_{n}) = argmin_{Σ, μ} (b \in {1, \dots, B} sup (lo g ∣Σ∣ + ∥ θ_{b} - μ ∥_{Σ^{- 1}}^{2}) \hat{K}_{n} (θ_{b})),

(A_{n, \infty}, B_{n, \infty})

(A_{n, \infty}, B_{n, \infty})

AR_{n} (θ_{10}) = θ_{2} \in Θ_{2} in f n (\overset{g}{ˉ}_{n} (θ_{10}, θ_{2})^{'} \hat{V}_{n}^{- 1} (θ_{10}, θ_{2}) \overset{g}{ˉ}_{n} (θ_{10}, θ_{2})),

AR_{n} (θ_{10}) = θ_{2} \in Θ_{2} in f n (\overset{g}{ˉ}_{n} (θ_{10}, θ_{2})^{'} \hat{V}_{n}^{- 1} (θ_{10}, θ_{2}) \overset{g}{ˉ}_{n} (θ_{10}, θ_{2})),

λ_{j n} = λ_{j} (P_{θ_{1}}^{⊥} Σ_{n}^{- 1/2} P_{θ_{1}}^{⊥} B_{n, \infty}^{'} \overset{ˉ}{V}_{n}^{- 1} B_{n, \infty} P_{θ_{1}}^{⊥} Σ_{n}^{- 1/2} P_{θ_{1}}^{⊥})^{1/2},

λ_{j n} = λ_{j} (P_{θ_{1}}^{⊥} Σ_{n}^{- 1/2} P_{θ_{1}}^{⊥} B_{n, \infty}^{'} \overset{ˉ}{V}_{n}^{- 1} B_{n, \infty} P_{θ_{1}}^{⊥} Σ_{n}^{- 1/2} P_{θ_{1}}^{⊥})^{1/2},

\hat{d}_{n} = # {j \in {d_{θ_{1}} + 1, \dots, d_{θ}}, λ_{j n} > \underline{λ}_{n}},

\hat{d}_{n} = # {j \in {d_{θ_{1}} + 1, \dots, d_{θ}}, λ_{j n} > \underline{λ}_{n}},

(A_{\infty}, B_{\infty}) = κ \to 0 lim (argmin_{A, B} (∥ g (θ, γ_{0}) ∥_{W} \leq κ sup ∥ g (θ, γ_{0}) - A - B θ ∥)),

(A_{\infty}, B_{\infty}) = κ \to 0 lim (argmin_{A, B} (∥ g (θ, γ_{0}) ∥_{W} \leq κ sup ∥ g (θ, γ_{0}) - A - B θ ∥)),

y_{t} = σ (e_{t} + ϑ e_{t - 1}), e_{t} \sim ii d (0, 1),

y_{t} = σ (e_{t} + ϑ e_{t - 1}), e_{t} \sim ii d (0, 1),

g(\theta):=\mathbb{E}\left(\begin{array}[]{cc}y_{t}^{2}-\sigma^{2}(1+\vartheta^{2}),\quad y_{t}y_{t-1}-\vartheta\sigma^{2}\end{array}\right)^{\prime}=0.

g(\theta):=\mathbb{E}\left(\begin{array}[]{cc}y_{t}^{2}-\sigma^{2}(1+\vartheta^{2}),\quad y_{t}y_{t-1}-\vartheta\sigma^{2}\end{array}\right)^{\prime}=0.

(A_{κ, \infty}, B_{κ, \infty}) = argmin_{A, B} (θ \in Θ, ∥ g (θ) ∥ \leq κ sup ∥ g (θ) - A - B θ ∥),

(A_{κ, \infty}, B_{κ, \infty}) = argmin_{A, B} (θ \in Θ, ∥ g (θ) ∥ \leq κ sup ∥ g (θ) - A - B θ ∥),

∥ B_{κ, \infty} v_{2} ∥ \leq \frac{2 κ}{∥ θ _{0}^{1} - θ _{0}^{2} ∥} \to 0, as κ ↘ 0.

∥ B_{κ, \infty} v_{2} ∥ \leq \frac{2 κ}{∥ θ _{0}^{1} - θ _{0}^{2} ∥} \to 0, as κ ↘ 0.

n \to \infty lim sup γ \in Γ, θ = (θ_{10}^{'}, θ_{2}^{'})^{'} \in \overline{Θ} sup P_{γ} (AR_{n} (θ_{10}) > χ_{1 - α}^{2} (d_{g} - \hat{d}_{n})) \leq α .

n \to \infty lim sup γ \in Γ, θ = (θ_{10}^{'}, θ_{2}^{'})^{'} \in \overline{Θ} sup P_{γ} (AR_{n} (θ_{10}) > χ_{1 - α}^{2} (d_{g} - \hat{d}_{n})) \leq α .

θ \in Θ, ∥ θ - θ_{n} ∥ \geq ε in f ∥ g (θ, γ_{n}) ∥ \geq δ (γ_{n}) h (ε) .

θ \in Θ, ∥ θ - θ_{n} ∥ \geq ε in f ∥ g (θ, γ_{n}) ∥ \geq δ (γ_{n}) h (ε) .

θ \in Θ, ∥ θ - θ_{n} ∥ \geq ε in f ∥ g (θ, γ_{n}) ∥ \leq C δ (γ_{n}) h (ε) .

θ \in Θ, ∥ θ - θ_{n} ∥ \geq ε in f ∥ g (θ, γ_{n}) ∥ \leq C δ (γ_{n}) h (ε) .

AR_{n} (θ_{1 n}) \to d χ_{d_{g} - d_{θ_{2}}}^{2} .

AR_{n} (θ_{1 n}) \to d χ_{d_{g} - d_{θ_{2}}}^{2} .

∥ β_{1} - β_{1 n} ∥ \geq ε, β_{2} in f ∥ g (β_{1}, β_{2}, γ_{n}) ∥

∥ β_{1} - β_{1 n} ∥ \geq ε, β_{2} in f ∥ g (β_{1}, β_{2}, γ_{n}) ∥

d (β_{2}, B_{2}^{0}) \geq ε, β_{1} in f ∥ g (β_{1}, β_{2}, γ_{n}) ∥

β_{2} \in B_{2}^{0} sup ∥ g (β_{1 n}, β_{2}, γ_{n}) ∥

AR_{n} (θ_{1 n}) \leq ϕ in f ∥ \overset{g}{ˉ}_{n} (θ_{1 n}, ϕ, β_{22 n}) ∥_{V_{n}^{- 1}}^{2} \to d χ_{d_{g} - d_{ϕ}}^{2} .

AR_{n} (θ_{1 n}) \leq ϕ in f ∥ \overset{g}{ˉ}_{n} (θ_{1 n}, ϕ, β_{22 n}) ∥_{V_{n}^{- 1}}^{2} \to d χ_{d_{g} - d_{ϕ}}^{2} .

[B_{n, \infty} - \partial_{θ} g (θ_{n}, γ_{n})] H_{n}

[B_{n, \infty} - \partial_{θ} g (θ_{n}, γ_{n})] H_{n}

A_{n, \infty} + B_{n, \infty} θ_{n} - \overset{g}{ˉ}_{n} (θ_{n})

\lambda_{\min}(B_{n,\infty}^{\prime}B_{n,\infty})=\lambda_{\min}\Big{(}\partial_{\theta}g(\theta_{n},\gamma_{n})^{\prime}\partial_{\theta}g(\theta_{n},\gamma_{n})\Big{)}\left(1+o_{p}(1)\right).

\lambda_{\min}(B_{n,\infty}^{\prime}B_{n,\infty})=\lambda_{\min}\Big{(}\partial_{\theta}g(\theta_{n},\gamma_{n})^{\prime}\partial_{\theta}g(\theta_{n},\gamma_{n})\Big{)}\left(1+o_{p}(1)\right).

j = 1 \sum d_{β_{2}} λ_{j} (B_{n, \infty}^{'} B_{n, \infty}) \leq O_{p} (κ_{n}^{2}) .

j = 1 \sum d_{β_{2}} λ_{j} (B_{n, \infty}^{'} B_{n, \infty}) \leq O_{p} (κ_{n}^{2}) .

j = 1 \sum d_{θ_{1}} + d_{β_{22}} λ_{j} (P_{θ_{1}}^{⊥} B_{n, \infty}^{'} B_{n, \infty} P_{θ_{1}}^{⊥}) \leq O_{p} (κ_{n}^{2}) .

j = 1 \sum d_{θ_{1}} + d_{β_{22}} λ_{j} (P_{θ_{1}}^{⊥} B_{n, \infty}^{'} B_{n, \infty} P_{θ_{1}}^{⊥}) \leq O_{p} (κ_{n}^{2}) .

n \to \infty lim sup γ \in Γ, θ = (θ_{10}^{'}, θ_{2}^{'})^{'} \in \overline{Θ} sup P_{γ} (AR_{n} (θ_{10}) > χ_{d_{θ} - \hat{d}_{n}}^{2} (1 - α)) \leq α .

n \to \infty lim sup γ \in Γ, θ = (θ_{10}^{'}, θ_{2}^{'})^{'} \in \overline{Θ} sup P_{γ} (AR_{n} (θ_{10}) > χ_{d_{θ} - \hat{d}_{n}}^{2} (1 - α)) \leq α .

n \to \infty lim P_{γ_{n}} (AR_{n} (θ_{1 n}) > χ_{d_{θ} - \hat{d}_{n}}^{2} (1 - α)) = α .

n \to \infty lim P_{γ_{n}} (AR_{n} (θ_{1 n}) > χ_{d_{θ} - \hat{d}_{n}}^{2} (1 - α)) = α .

\overline{g}_{n} (θ) = \frac{1}{n} t = 1 \sum n [δ R_{t + 1} (C_{t + 1} / C_{t})^{- γ} - 1] Z_{t},

\overline{g}_{n} (θ) = \frac{1}{n} t = 1 \sum n [δ R_{t + 1} (C_{t + 1} / C_{t})^{- γ} - 1] Z_{t},

x_{1, t} = ρ x_{1, t - 1} + ϕ_{e} f (x_{2, t - 1}) e_{t}, x_{2, t} = σ^{2} + ν (x_{2, t - 1} - σ^{2}) + σ_{w} w_{t},

x_{1, t} = ρ x_{1, t - 1} + ϕ_{e} f (x_{2, t - 1}) e_{t}, x_{2, t} = σ^{2} + ν (x_{2, t - 1} - σ^{2}) + σ_{w} w_{t},

g_{t} = μ + x_{1, t - 1} + f (x_{2, t - 1}) η_{t}, g_{d, t} = μ_{d} + ϕ x_{1, t - 1} + ϕ_{d} f (x_{2, t - 1}) u_{t},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\usdate

Detecting Identification Failure in

Moment Condition Models

Jean-Jacques Forneron Department of Economics, Boston University, 270 Bay State Road, Boston, MA 02215 USA.

Email: [email protected], Website: http://jjforneron.com.

I would like to thank Serena Ng for discussions that initiated this project. I thank Francesca Molinari for suggestions that greatly improved this paper. I also greatly benefited from comments and discussions with Tim Christensen, Pavel Cizek, Greg Cox, Ivàn Fernàndez-Val, Hiro Kaido, Nour Meddahi, Arthur Lewbel, Demian Pouzo, Zhongjun Qu, Eric Renault, Yichong Zhang and the participants of the BU-BC econometric workshop, the seminar participants at Brown, Chicago, CREST, NUS, NYU, UC Berkeley, Université de Montréal, University of Rochester, SMU, Toulouse School of Economics and conferences. I would also like to thank Joachim Grammig for kindly sharing his replication files for the long-run risks model.

Abstract

This paper develops an approach to detect identification failure in moment condition models. This is achieved by introducing a quasi-Jacobian matrix computed as the slope of a linear approximation of the moments on an estimate of the identified set. It is asymptotically singular when local and/or global identification fails, and equivalent to the usual Jacobian matrix which has full rank when the model is globally and locally identified. Building on this property, a simple test with chi-squared critical values is introduced to conduct subvector inferences allowing for strong, semi-strong, and weak identification without a priori knowledge about the underlying identification structure. Monte-Carlo simulations and an empirical application to the Long-Run Risks model illustrate the results.

JEL Classification: C11, C12, C13, C32, C36.

Keywords: Asset Pricing, Uniform Inference, Global Identification, Indirect Inference.

1 Introduction

The Generalized Method of Moments (GMM) of Hansen and Singleton (1982) is a powerful estimation framework which does not require the model to be fully specified parametrically. Under regularity conditions, the estimates are consistent and asymptotically Gaussian. In particular, the moments should uniquely identify the finite-dimensional parameters. This is very difficult to verify in practice and, as noted in Newey and McFadden (1994), is often assumed. Yet, when identification fails or nearly fails, the Central Limit Theorem provides a poor finite sample approximation for the distribution of the estimates. This has motivated a vast amount of research on tests which are robust to identification failure. An empirically relevant problem, which remains less explored, is of determining, for a given set of estimating moments, whether local and global identification actually hold.

The contribution of this paper is two-fold: first, it introduces a quasi-Jacobian matrix which is singular under both local (first-order) and global identification failure and is informative about the coefficients involved in the identification failure. This is the main contribution of the paper as it provides an approach similar to Cragg and Donald (1993) and Stock and Yogo (2005) but in a non-linear setting. Second, the information is used to construct an identification robust subvector test which does not require a priori knowledge of the identification structure. The test is asymptotically non-conservative under strong identification. It is asymptotically efficient for strongly just-identified models.

The quasi-Jacobian matrix is the best linear approximation of the sample moment function over a region of the parameters where these moments are close to zero. To find the best linear approximation, a sup-norm (or $\ell_{\infty}$ -norm) loss is used to minimize the largest deviation from the linear approximation. This is known as a Chebyshev approximation problem which can be solved fairly quickly using convex optimization software. In the population, the quasi-Jacobian has full rank if, and only if, the parameters are both globally and locally identified. When either global or local identification fails, it is singular in all directions associated with the identification failure. (Non)-singularity of the quasi-Jacobian can be used to check whether identification holds numerically when it is not feasible analytically.

The asymptotic behaviour of the quasi-Jacobian matrix is studied under three identification regimes: including strong, semi-strong, and weak (or set) identification. Under strong identification, the moment conditions are informative, have a unique solution, under semi-strong identification are less informative but sufficiently so that for estimates to be consistent and asymptotically Gaussian. Antoine and Renault (2009), Andrews and Cheng (2012) showed that: under (semi)-strong identification, standard inference methods such as the t-test with standard normal critical values are asymptotically valid.111The term (semi)-strong will refer to cases where identification can be either strong or semi-strong. Antoine and Renault (2009) further distinguish between nearly-strong and nearly-weak identification. Under the latter, the limiting distribution may be non-Gaussian. Here, when this is the case, it will be referred to as higher-order local identification. Under weak and set identification, the moments are insufficiently informative compared to sampling uncertainty and multiple distant solutions to the moment conditions appear plausible, even in large samples, so that the parameters cannot be consistently estimated and standard inference methods are not asymptotically valid. The Supplement also considers higher-order local identification, where the solution is unique but not locally identified; it can be consistently estimated but with non-Gaussian limiting distribution. Under (semi)-strong identification, the quasi-Jacobian is shown to be asymptotically equivalent to the usual Jacobian: after re-scaling, it is asymptotically non-singular. Under higher-order and weak identification the quasi-Jacobian is asymptotically singular with eigenvalues vanishing in directions where identification fails. It is thus informative about the presence of identification failures and which directions are not identified.

Building on these results, this paper constructs a simple test procedure for subvector hypotheses on the parameters $\theta=(\theta_{1}^{\prime},\theta_{2}^{\prime})^{\prime}\in\mathbb{R}^{d_{\theta}}$ of the form:

[TABLE]

Subvector inference as described in (1) is quite prevalent in empirical work where only a few structural parameters $\theta_{1}$ are typically of interest. The remaining $\theta_{2}$ nuisance parameters describe other features of the data generating process needed for estimation. For instance, in the empirical application only $2$ preference parameters are of interest while the remaining $10$ coefficients parameterize the law of motion for consumption and dividends which is not of immediate interest. The paper relies on the Anderson and Rubin (1949, AR) test statistic for simplicity. The critical values take the form $\chi^{2}_{d_{g}-d}$ where $d_{g}$ is the number of moments and $d$ is determined using an Identification Category Selection (ICS) procedure based on the singular values of the quasi-Jacobian matrix. This is a projection inference procedure where the ICS step estimates the number of (semi)-strongly identified nuisance parameters to reduce the degrees of freedom.

Monte-Carlo simulations illustrate the results for a simple consumption-based asset pricing model. In the empirical application, the procedure is used to conduct joint inference on risk-aversion and the inverse elasticity of substitution in the long-run risks model of Bansal and Yaron (2004). The results suggest that several nuisance parameters are weakly identified but not all; some are (semi)-strongly identified. This implies that standard inferences based on t or Wald statistics are not asymptotically valid and full projection inference is valid, but conservative. Given the number of parameters in the application, the standard approach of performing test inversion using a grid search is very computationally demanding. Instead, an adaptive sampling procedure based on the Population Monte Carlo (PMC) principle draws uniformly on level sets of the objective function. This makes it possible to conduct robust inference on more complex models like the empirical application: the quasi-Jacobian and 5,000 uniform draws on the confidence set are computed in about 4 hours on a desktop computer.

Structure of the Paper

After a review of the literature and an overview of the notation, Section 2 introduces the setting, the procedure and provides more details about the quasi-Jacobian, the test, and the identification regimes. Section 3 derives the asymptotic behaviour of the quasi-Jacobian matrix, and Section 4 results for the test. Section 5 gives Monte-Carlo evidence for the results, and Section 6 the empirical application. Appendices A, B provide proofs for the main results. The Supplement includes sample R code to compute the quasi-Jacobian and for inference, a description of the PMC algorithm used to generate draws, and additional results for higher-order identification.

Related Literature

The literature on the identification of economic models is quite vast, and an extensive review is given in Lewbel (2018). Within this literature, this paper mainly relates to three topics: local and global identification of finite-dimensional parameters in the population, detection of identification failure in finite samples, and identification robust inference.

Koopmans and Reiersol (1950) provide one of the earliest general formulations of the identification problem at the population level. To paraphrase the authors, the main problem is to determine whether the distribution of the data, assumed to be generated from a given class of models, is consistent with a unique set of structural parameters. In the likelihood setting, Fisher (1967), Rothenberg (1971) introduced sufficient conditions for local and global identification. Komunjer (2012) provides weaker global identification conditions for GMM.

In linear models, global identification amounts to a rank condition on the slope of the moments. This insight was used in pre-testing linear IV models for identification failure using a first-stage F-statistic or rank tests, Cragg and Donald (1993), Stock and Yogo (2005), Kleibergen and Paap (2006). Pre-tests based on the null of strong identification appear in Hahn and Hausman (2002) for linear IV and Inoue and Rossi (2011), Bravo et al. (2012) for non-linear models. Pre-testing for strong identification could make size control difficult when the pre-test has low power. For non-linear models, Wright (2003) uses a rank test and Antoine and Renault (2020) a distorted J-statistic to detect local identification failure. Arellano et al. (2012) develop a test for underidentification of a single coefficient.

Given the impact of (near) identification failure on standard inferences, a large body of literature has developed identification robust tests. Much of the literature is concerned with inference on the full parameter vector, e.g. Anderson and Rubin (1949), Stock and Wright (2000), Kleibergen (2005), Andrews and Mikusheva (2016). Projection inference can be used to conduct subvector inference from these tests (Dufour, 1997). Alternatively, Bonferroni methods combined with a $C(\alpha)$ test can be used, Chaudhuri and Zivot (2011), Andrews (2017). For homoskedastic linear IV models, Guggenberger et al. (2012) propose critical values for a subset Anderson-Rubin test which improve power over full projection inference. In the same setting, Guggenberger et al. (2019) propose a data-driven choice of critical values based on a measure of identification strength of the nuisance parameters, and Kleibergen (2021) considers subvector conditional Likelihood-Ratio inference. This paper relies on the Anderson-Rubin statistic for inference, which is the simplest to implement. More powerful test statistics exist such as the conditional quasi-Likelihood Ratio. The main challenge there is in computing the critical values by simulation, which requires to repeatedly minimize non-linear and potentially multi-modal objective functions.222This is difficult for non-convex problems, see e.g. Nemirovsky and Yudin (1983, Section 1.6.2) for the complexity of the minimization problem and Nesterov (2018, p14-16) for the practical implications and software limitations.

Given knowledge about the source of a potential identification failure, and a specific structure in the underlying model Andrews and Cheng (2012, 2013, 2014), Cheng (2015), Han and McCloskey (2019), Cox (2020) propose identification robust tests which are asymptotically non-conservative and powerful under strong identification. These papers rely on a data-driven choice of critical value; it is determined by an ICS statistic built from model-specific knowledge about the source and form of the identification failure. This paper proposes and studies an ICS statistic which does not rely on model-specific information to determine identification status. The choice of robust critical values can coincide with Andrews and Cheng (2012)’s least-favorable critical value, see Appendix H.2 for an example. Andrews (2017) proposes an ICS based on the singular values of sample Jacobian which measures local but not global identification strength. His test applies to GMM and likelihood problems.

Under higher-order identification, estimates are consistent but the delta-method is not valid. The limiting distribution is non-standard (Rotnitzky et al., 2000), Dovonon and Hall (2018). This issue is known but much less studied than weak and set identifications. Dovonon et al. (2019) study identification robust tests under second-order identification, and Lee and Liao (2018) conduct standard inference under known second-order identification structure.

Notation

For any matrix (or vector) $A$ , $\|A\|=\sqrt{\sum_{i,j}A_{i,j}^{2}}=\sqrt{\text{trace}(AA^{\prime})}$ is the Frobenius (Euclidian) norm of $A$ . For any square matrix $A$ , $\lambda_{j}(A)$ refers to the j-th eigenvalues of $A$ , in increasing order if $A$ is symmetric positive semi-definite; $\lambda_{\max}(A)$ and $\lambda_{\min}(A)$ refer to its largest and smallest eigenvalue, respectively, $\lambda_{1}(A),\dots,\lambda_{d}(A)$ are the first d eigenvalues of $A$ in increasing order. For a weighting matrix $W_{n}(\theta)$ , the norm $\|\bar{g}_{n}(\theta)\|^{2}_{W_{n}}$ is computed as $\bar{g}_{n}(\theta)^{\prime}W_{n}(\theta)\bar{g}_{n}(\theta)$ . The abbreviation wpa 1 will be used to abreviate “with probability approaching 1.” For $\varepsilon>0$ , $B_{\varepsilon}(\theta)$ is a closed $\varepsilon$ -ball around $\theta$ .

2 Setting and Assumptions

Following Hansen and Singleton (1982), the econometrician wants to estimate the solution vector $\theta_{0}$ to the system of unconditional moment equations:

[TABLE]

where $\theta_{0}=(\theta_{10}^{\prime},\theta_{20}^{\prime})^{\prime}\in\overline{\Theta}=\overline{\Theta}_{1}\times\overline{\Theta}_{2}$ , a compact subset of $\mathbb{R}^{d_{\theta}}$ , $\text{dim}(\bar{g}_{n})=d_{g}\geq d_{\theta}$ . $\bar{g}_{n}(\theta)=1/n\sum_{i=1}^{n}g(z_{i},\theta)$ is the sample vector of moment conditions, $(z_{i})_{i=1,\dots,n}$ is a sample of iid or stationary random variables. The parameter $\gamma_{0}\in\Gamma$ indexes the true distribution of the data $(z_{i})$ , including the true $\theta_{0}$ . It has the form $\Gamma=\{\gamma=(\theta,\omega),\theta\in\overline{\Theta},\omega\in\Omega\}$ . $\Omega$ indexes features of the data generating process beyond $\theta$ that are relevant to identification and weak convergence. $\Gamma=\overline{\Theta}\times\Omega$ is a compact subset of a metric space with a metric $\|\theta-\tilde{\theta}\|+d(\omega,\tilde{\omega})$ between $\gamma=(\theta,\omega)$ and $\tilde{\gamma}=(\tilde{\theta},\tilde{\omega})$ that induces weak convergence for $(z_{i},z_{i+m})$ for any $i,m\geq 1$ .333For reduce the number of coefficients involved in the notation below, this distance will be written as $\|\theta-\tilde{\theta}\|+d(\gamma,\tilde{\gamma})$ . See Andrews and Cheng (2012, p2162) for a discussion of these conditions. The operator $\mathbb{E}_{\gamma_{0}}$ denotes the expectation under $\gamma=\gamma_{0}$ . $g(\theta,\gamma_{0})=\mathbb{E}_{\gamma_{0}}(\bar{g}_{n}(\theta))$ is then the population vector of moment conditions evaluated at the true $\gamma_{0}\in\Gamma$ and a coefficient $\theta$ . Throughout, it is assumed that $\theta_{0}$ is such that $g(\theta_{0},\gamma_{0})=0$ . The function $g(\cdot,\gamma)$ is assumed to be continuously differentiable on $\Theta$ for all $\gamma$ .

Given the sample moments $\bar{g}_{n}$ and a sequence of positive definite weighting matrices $W_{n}(\theta)$ converging to $W(\theta)$ , the GMM estimator $\hat{\theta}_{n}$ solves the sample minimization problem:

[TABLE]

where $\Theta=\Theta_{1}\times\Theta_{2}$ is the optimization space.

Assumption 1 (Parameter Space, Sample Moments, Weighting Matrix).

i. $\Gamma$ and $\overline{\Theta}\subset\mathbb{R}^{d_{\theta}}$ are compact; $\Theta$ is a convex, compact subset of $\mathbb{R}^{d_{\theta}}$ such that $\overline{\Theta}\subset\Theta$ and $\cup_{\theta\in\overline{\Theta}}B_{\eta}(\theta)\subseteq\Theta$ for some $\eta>0$ ; for all $(\theta,\gamma)\in\overline{\Theta}\times\Gamma$ and all $\varepsilon>0$ , $B_{\varepsilon}(\theta,\gamma)\cap(\overline{\Theta}\times\Gamma)$ is non-singleton and connected, ii. for any sequence $(\theta_{n},\gamma_{n})\to(\theta_{0},\gamma_{0})\in\overline{\Theta}\times\Gamma$ : $\sup_{\theta\in\Theta}\sqrt{n}\|\bar{g}_{n}(\theta)-g(\theta,\gamma_{n})\|=O_{p}(1),$ and $\sqrt{n}\bar{g}_{n}(\theta_{n})\overset{d}{\to}\mathcal{N}(0,V_{0})$ where $V_{0}$ is finite and non-singular, iii. $\sup_{\theta\in\Theta}\|W_{n}(\theta)-W(\theta)\|=o_{p}(1)$ . $W_{n}$ and $W$ are Lipschitz continuous in $\theta$ ; there exists $\underline{\lambda}_{W},\overline{\lambda}_{W}$ such that $0<\underline{\lambda}_{W}\leq\lambda_{\min}(W_{n}(\theta))\leq\lambda_{\max}(W_{n}(\theta))\leq\overline{\lambda}_{W}<\infty$ , for all $\theta$ .

Assumption 1 i. implies that $\Theta$ strictly contains $\overline{\Theta}$ so that issues arising when a parameter is on the boundary are not considered here.444See Cox (2020) for results on identification and boundary robust inference. The connected neighborhood condition plays the role of Assumption ACP iv. in Andrews and Cheng (2012, p2165). It implies that we can find sequences $\gamma_{n}$ along a continuous path in $\Gamma$ leading to $\gamma_{0}$ such that $0<\|\gamma_{n}-\gamma_{0}\|\to 0$ . Together with a continuity condition in Assumption 3 below, it allows to interpolate converging subsequences into converging sequences of parameters in one of the desired identification categories. This is similar to Assumption B2 in Andrews et al. (2020) and Assumption 14 in Cox (2020). Condition ii. is a uniform convergence condition, implied by a uniform CLT. Condition iii. ensures that $\|\cdot\|_{W_{n}}$ is equivalent to $\|\cdot\|$ so that the choice of $W_{n}$ does not alter the identifiability of the parameters.

2.1 Outline of the Procedure

The following steps provide a general overview of the computation of the quasi-Jacobian matrix, the ICS, and test procedure used in the paper. In the following, the matrix $P_{\theta_{1}}^{\perp}$ is an orthogonal projection matrix, projecting on the space orthogonal to $\theta_{1}$ . It can be written as $P_{\theta_{1}}^{\perp}=\text{diag}(0_{d_{\theta_{1}}},1_{d_{\theta_{2}}})$ so that it only selects elements associated with $\theta_{2}$ . The matrix $\overline{V}_{n}$ is a weighted average of estimates of $\text{var}[\sqrt{n}\overline{g}_{n}(\theta_{b})]$ , with weights proportional to $\hat{K}_{n}(\theta_{b})$ , described in more details below.555For iid data, $\text{var}[\sqrt{n}\overline{g}_{n}(\theta_{b})]$ is approximated using $\frac{1}{n}\sum_{i=1}^{n}g(z_{i},\theta_{b})g(z_{i},\theta_{b})^{\prime}-\overline{g}_{n}(\theta_{b})\overline{g}_{n}(\theta_{b})^{\prime}$ ; for dependent data a HAC estimator is used.

Computing the quasi-Jacobian and the test statistic:

Inputs bandwidth $\kappa_{n}$ , kernel $K$ , cutoff $\underline{\lambda}_{n}$ , number of draws $B$

quasi-Jacobian Matrix

i.

Draw $(\theta_{b})_{b=1,\dots,B}$ uniformly on the level set $\{\theta\in\Theta,\|\bar{g}_{n}(\theta)\|_{W_{n}}\leq\kappa_{n}\}$

ii.

Compute the intercept $A_{n,\infty}$ and slope $B_{n,\infty}$ in the $\ell_{\infty}$ -norm regression:

$\displaystyle(A_{n,\infty},B_{n,\infty})=\text{argmin}_{A,B}\left(\sup_{b\in\{1,\dots,B\}}\|\overline{g}_{n}(\theta_{b})-A-B\theta_{b}\|\hat{K}_{n}(\theta_{b})\right),$

(4)

where $\hat{K}_{n}(\theta_{b})=K(\|\overline{g}_{n}(\theta_{b})\|_{W_{n}}/\kappa_{n})$ .

iii.

Compute the variance $\Sigma_{n}$ :

$\displaystyle(\mu_{n},\Sigma_{n})=\text{argmin}_{\Sigma,\mu}\left(\sup_{b\in\{1,\dots,B\}}(\log|\Sigma|+\|\theta_{b}-\mu\|^{2}_{\Sigma^{-1}})\hat{K}_{n}(\theta_{b})\right),$

(5)

Identification Category Selection

i.

Compute the singular values $(\lambda_{jn})_{j=1,\dots,d_{\theta}}$ of $\overline{V}_{n}^{-1/2}B_{n,\infty}P_{\theta_{1}}^{\perp}\Sigma_{n}^{-1/2}P_{\theta_{1}}^{\perp}$

ii.

Compute $\hat{d}_{n}$ , the number of singular values $\lambda_{jn}$ greater than $\underline{\lambda}_{n}$

Subvector Inference

i.

Compute the test statistic: $\text{AR}(\theta_{10})=\inf_{\theta_{2}\in\Theta_{2}}n\|\bar{g}_{n}(\theta_{10},\theta_{2})\|^{2}_{\hat{V}_{n}^{-1}}$

ii.

Reject $H_{0}:\theta_{1}=\theta_{10}$ at the $1-\alpha$ confidence level if $\text{AR}(\theta_{10})>\chi^{2}_{d_{g}-\hat{d}_{n}}(1-\alpha)$

In the procedure, $\chi^{2}_{d_{g}-\hat{d}_{n}}(1-\alpha)$ is the $1-\alpha$ quantile of a $\chi^{2}$ distribution with $d_{g}-\hat{d}_{n}$ degrees of freedom, $d_{g}$ is the number of moment conditions. In the following, the number of draws $B$ is assumed to be sufficiently large for the finite- $B$ approximation error to be negligible. The $\ell_{\infty}$ regression (4) is known as a Chebyshev (or minimax) approximation problem and can be cast as a linear programming problem (Boyd and Vandenberghe, 2004, p293). It can be solved with a few lines of code using the cvx convex optimization toolkit.666See Supplemental Appendix G for sample R code which implements the method. (5) is also solved using cvx. Finally, note that, in the procedure, the intercept $A_{n,\infty}$ and the mean $\mu_{n}$ are nuisance parameters, only $B_{n,\infty}$ and $\Sigma_{n}$ are used in steps 3-4. On the computation side: Appendix F outlines a sequential Algorithm to sample on the level set (step 2i.), the quasi-Jacobian is only computed once; it is defined whether the sample moments are differentiable, or not. The standard Jacobian requires differentiability and needs to be evaluated at every grid point. Instead of the $\ell_{\infty}$ loss, one could use the $\ell_{2}$ -norm which yields least-squares solutions $(A_{n,LS},B_{n,LS})$ . Some technical difficulties arise because the identified set typically has measure zero, and stronger assumptions are required to derive the properties of $B_{n,LS}$ compared to $B_{n,\infty}$ . The re-scaling in step 3 is discussed below. The following provides further details about the steps outlined above.

2.2 Linear Approximations and the quasi-Jacobian Matrix

The quasi-Jacobian matrix $B_{n,\infty}$ is defined as the slope of a local linear approximation for $\bar{g}_{n}(\cdot)$ over an estimate of the identified set.

Definition 1.

(Sup-Norm Approximation) Let $K$ be a kernel function and $\kappa_{n}$ a bandwidth. The sup-norm approximation $(A_{n,\infty},B_{n,\infty})$ solves:

[TABLE]

where $\hat{K}_{n}(\theta)=K\left(\|\bar{g}_{n}(\theta)\|_{W_{n}}/\kappa_{n}\right)$ . The quasi-Jacobian refers to the slope matrix $B_{n,\infty}$ .

In practice, the minimization problem (6) is solved over a finite grid as in (4). The grid can be generated using Monte-Carlo or quasi-Monte-Carlo methods (Robert and Casella, 2004; Lemieux, 2009). In the simulations, the Sobol sequence was used. In the empirical application, $d_{\theta}=12$ is relatively large, and the set of $\theta$ where $\hat{K}_{n}(\theta)>0$ is fairly narrow; the acceptance rate is very low. A very large number of draws would be needed to find sufficiently many $\theta_{b}$ with non-zero weight, i.e. $\hat{K}_{n}(\theta_{b})>0$ . The empirical application relies on a sequential sampling principle called Population Monte Carlo (Cappé et al., 2004). It constructs a sequence of proposal distributions that approximate the target distribution with increasing accuracy, see Appendix F for details. These proposals can be re-purposed to compute confidence sets, reducing the additional time required for test inversion. It can also be used to compute $B_{n,\infty}$ for different values of $\kappa_{n}$ as a sensitivity analysis.

Assumption 2 (Kernel, Bandwidth).

i. $K(x)>0$ if $x\in[0,1)$ , $K(x)=0$ if $x\geq 1$ . $K$ is continuous on $[0,1)$ , ii. $\sqrt{n}\kappa_{n}\to\infty$ , $\sqrt{n}\kappa_{n}^{2}\to 0$ .

The kernel is assumed to have compact support. The uniform kernel, $K(x)=\mathbbm{1}_{x\in[-1,1]}$ , was used in the simulations and empirical results.777The estimated $B_{n,\infty}$ is nearly numerically identical using the cosine or Epanechnikov kernels. The first condition ensures that $\hat{K}_{n}(\cdot)$ selects the identified set with wpa 1 under weak identification. The second ensures that $B_{n,\infty}$ only captures the first-order Jacobian term in local expansions under (semi)-strong identification. Otherwise, it would also capture nonlinear terms from the remainder.

2.3 Test Procedure

To illustrate the usefulness of detecting identification failure, consider the following simple data-driven test procedure. It is based on the Anderson-Rubin statistic for non-linear GMM models as described in Stock and Wright (2000). To test null hypotheses of the form $H_{0}:\theta_{1}=\theta_{10}$ , compute the sample statistic:

[TABLE]

where $\hat{V}_{n}(\theta)$ consistently estimates the asymptotic variance $\lim_{n\to\infty}\text{var}(\sqrt{n}\bar{g}_{n}(\theta))$ . The test rejects at a nominal level $\alpha\in(0,1)$ if $\text{AR}_{n}(\theta_{10})>\chi^{2}_{d_{g}-\hat{d}_{n}}(1-\alpha)$ where $\chi^{2}_{d_{g}-\hat{d}_{n}}(1-\alpha)$ is the $1-\alpha$ quantile of a chi-square distribution with $d_{g}-\hat{d}_{n}$ degrees of freedom. $\hat{d}_{n}\in\{0,\dots,d_{\theta_{2}}\}$ is computed using an identification category selection (ICS) procedure based on the quasi-Jacobian and its singular values. The procedure, described below, evaluates the number of nuisance parameters in $\theta_{2}$ which are potentially weakly/set identified. Using $\hat{d}_{n}=0$ yields the largest critical value and amounts to full projection inference (Dufour and Taamouti, 2005). Using $\hat{d}_{n}=d_{\theta_{2}}$ yields the smallest critical value which provides valid, non-conservative inferences when all of the nuisance parameters are strongly identified. Intermediate values of $\hat{d}_{n}$ improve power compared to full projection while ensuring robustness if a subset of the nuisance parameters is weakly identified. A confidence set for $\theta_{1}$ collects all values of $\theta_{1}$ for which $\text{AR}_{n}(\theta_{1})\leq\chi^{2}_{1-\alpha}(d_{g}-\hat{d}_{n})$ using the same $\hat{d}_{n}$ .

The choice of $\hat{d}_{n}$ should be invariant to rescaling the sample moments $\bar{g}_{n}$ and/or the parameters $\theta$ . To this end, the procedure relies on two normalization matrices: $\bar{V}_{n}=\int_{\Theta}\hat{V}_{n}(\theta)\hat{\pi}_{n}(\theta)d\theta$ and $\Sigma_{n}$ , where $\hat{\pi}_{n}(\theta)=\hat{K}_{n}(\theta)/\int_{\Theta}\hat{K}_{n}(\theta)d\theta$ . $\bar{V}_{n}$ an average of asymptotic variance estimators for $\lim_{n\to\infty}n\text{var}(\bar{g}_{n}(\theta))$ . It is used to ensure the procedure is invariant to re-scaling and rotating the sample moments. $\Sigma_{n}$ is the $\ell_{\infty}$ -covariance matrix minimizing $\sup_{\theta\in\Theta}\left(\log|\Sigma|+\|\theta-\mu\|_{\Sigma^{-1}}^{2}\right)\hat{K}_{n}(\theta)$ over $\mu$ and $\Sigma$ . These quantities are readily available from the steps required to compute $B_{n,\infty}$ . It is important to use an estimate of the variance $\Sigma_{n}$ of $\theta$ on $\Theta_{n}$ rather than the variance of $B_{n,\infty}$ or of the sample Jacobian. When the model is set or weakly identified, the variance $\Sigma_{n}$ - which measures the size of the set $\Theta_{n}$ - does not go to zero in directions where identification fails.888Lemma D4 shows that $\Sigma_{n}^{-1/2}$ is bounded above under weak identification in directions where identification fails. The variance of $B_{n,\infty}$ or the Jacobian could be arbitrarily small, however.999Take $g(z_{i};\theta)=0$ , for all $\theta$ , $z_{i}$ a.s.. The variances of both $B_{n,\infty}$ and the Jacobian are zero; yet, $\Sigma_{n}\neq 0$ . Hence, $B_{n,\infty}\Sigma_{n}^{-1/2}$ is vanishing in directions where identification fails and is invariant to rescaling the coefficients $\theta$ .

Let $P_{\theta_{1}}^{\perp}$ be the projection matrix on the orthogonal of the span of $\theta_{1}$ , compute the singular values of the normalized $\bar{V}_{n}^{-1/2}B_{n,\infty}P_{\theta_{1}}^{\perp}\Sigma_{n}^{-1/2}$ :

[TABLE]

where $\lambda_{j}$ denotes the j-th eigenvalue in increasing order so that $0\leq\lambda_{1n}\leq\dots\leq\lambda_{d_{\theta}n}$ . By projection, the smallest $d_{\theta_{1}}$ singular values are equal to zero. Take $\underline{\lambda}_{n}\to 0$ , a decreasing sequence such that $\kappa_{n}=o(\underline{\lambda}_{n})$ , and compute:

[TABLE]

where $\#$ counts the number of singular values $\lambda_{jn}$ which are greater than the threshold $\underline{\lambda}_{n}$ .

Choice of Tuning Parameters:

A default choice is the uniform kernel $K(x)=\mathbbm{1}_{x\in[-1,1]}$ . Then, the role of the pair $(\kappa_{n},W_{n})$ is to estimate the solution set of parameter(s) $\theta_{0}$ such that $g(\theta_{0},\gamma_{0})=0$ . For this choice of kernel, if a law of the iterated logarithm applies, then, pointwise, $\text{liminf}_{n\to\infty}\hat{K}_{n}(\theta_{0})=1$ almost surely using $W_{n}(\theta_{0})=\text{var}[\sqrt{n}\overline{g}_{n}(\theta_{0})]^{-1}$ and $\kappa_{n}=\sqrt{2\log[\log(n)]/n}$ .101010A law of the iterated logarithm implies $\text{limsup}_{n\to\infty}\sqrt{n/(2\log[\log(n)])}\|\overline{g}_{n}(\theta_{0})\|_{W_{n}}=1$ almost surely, also $K(x)=1$ for all $x\in[-1,1]$ , see e.g. Petrov (1995, Ch7); and Kosorok (2008, p31), van der Vaart and Wellner (1996, p379, footnote b) for references applying to empirical processes, which are not pointwise. In that sense, efficient weighting and $\kappa_{n}=\sqrt{2\log\log(n)/n}$ are asymptotically optimal and makes $\hat{K}_{n}(\theta)$ invariant to linear transformations of the moments.

The role of the normalized $B_{n,\infty}$ and the threshold $\underline{\lambda}_{n}$ is analogous to the ICS procedure in Andrews and Cheng (2012, Section 5.2), and the subsequent literature. Here, it is shown that if $d$ nuisance parameters are weakly identified then at least $d$ singular values are $O_{p}(\kappa_{n})$ . Hence, wpa 1, they are smaller than $\underline{\lambda}_{n}$ , if $\underline{\lambda}_{n}=o(\kappa_{n})$ . As a result, $\hat{d}_{n}$ is no greater than the number (semi)-strongly identified nuisance parameters wpa 1, which leads to valid inferences under weak identification. Typically, using larger values of $\underline{\lambda}_{n}$ in an ICS procedure is desirable for robust inference since it correctly detects identification failures with greater probability in finite samples. However, it also makes the test more conservative under semi-strong identification since it incorrectly detects identification failure with greater probability. This implies a trade-off between power for semi-strongly identified models with robustness for weakly identified models. The normalization $\Sigma_{n}^{-1/2}$ in the procedure improves on this by making the behaviour of the ICS statistic more distinct between these two regimes. The normalization $B_{n,\infty}P_{\theta_{1}}^{\perp}\Sigma_{n}^{-1/2}P_{\theta_{1}}^{\perp}$ preserves the asymptotic singularity under weak identification but the normalized matrix diverges at a $\kappa_{n}^{-1}-$ rate when identification is strong.

2.4 The quasi-Jacobian

The main component of the procedure is the quasi-Jacobian. To better understand the main differences with the Jacobian, the following derives its properties for $n=\infty$ , using a positive definite $W(\theta)$ and the uniform kernel $K(x)=\mathbbm{1}_{x\in[-1,1]}$ . Take

[TABLE]

where the $\sup$ is taken over $\theta\in\Theta$ with $\kappa>0$ . To compute the Jacobian, $\partial_{\theta}g(\theta_{0},\gamma_{0})$ , one would use the set $\|\theta-\theta_{0}\|\leq\kappa$ ; the main difference is the choice of neighborhood.

This difference suggests that, unlike the Jacobian, the properties of the quasi-Jacobian depend on the set $\Theta_{0}=\{\theta\in\Theta,g(\theta,\gamma_{0})=0\}$ , which collects all solutions to the moment condition. For any given value $\gamma_{0}\in\Gamma$ , there are three possibilities, either: i. $\Theta_{0}$ is non-singleton, ii. $\Theta_{0}=\{\theta_{0}\}$ is singleton and $\partial_{\theta}g(\theta_{0},\gamma_{0})$ is singular, or iii. $\Theta_{0}=\{\theta_{0}\}$ is singleton and $\partial_{\theta}g(\theta_{0},\gamma_{0})$ has full rank. Under i., $\theta_{0}$ is not globally identified. Under ii. and iii. $\theta_{0}$ is globally identified but only locally identified under iii. Consistency and asymptotic normality require iii., i.e. strong identification, and standard inference need not be asymptotically valid under i. or ii. The following Theorem relates the rank of $B_{\infty}$ to identifications i., ii., and iii.

Theorem 1 (quasi-Jacobian, $n=\infty$ ).

Take $\gamma_{0}\in\Gamma$ . Suppose $0<\underline{\lambda}_{W}\leq\lambda_{\min}(W(\theta))\leq\lambda_{\max}(W(\theta))\leq\overline{\lambda}<\infty$ and $g(\cdot,\gamma_{0})$ is continuously differentiable for all $\theta\in\Theta$ . Suppose there are $\overline{\varepsilon},\overline{C}>0$ and $\alpha>1$ such that when $\Theta_{0}=\{\theta_{0}\}$ is singleton: $\|g(\theta)-g(\theta_{0})-\partial_{\theta}g(\theta_{0})(\theta-\theta_{0})\|\leq\overline{C}\|\theta-\theta_{0}\|^{\alpha}$ for all $\|\theta-\theta_{0}\|\leq\overline{\varepsilon}$ . Then the quasi-Jacobian $B_{\infty}$ is such that:

(1)

$B_{\infty}$ * singular if, and only if: $\Theta_{0}$ non-singleton or, $\Theta_{0}$ singleton and $\partial_{\theta}g(\theta_{0},\gamma_{0})$ singular,*

(2)

For $\Theta_{0}$ singleton and $\partial_{\theta}g(\theta_{0},\gamma_{0})$ full rank: $B_{\infty}=\partial_{\theta}g(\theta_{0},\gamma_{0})$ ,

(3)

For $\Theta_{0}$ non-singleton: $B_{\infty}(\theta^{1}_{0}-\theta_{0}^{2})=0$ for all $\{\theta_{0}^{1},\theta_{0}^{2}\}\subseteq\Theta_{0}$ ,

(4)

For $\Theta_{0}$ singleton and $\partial_{\theta}g(\theta_{0},\gamma_{0})$ singular: $B_{\infty}v=0$ whenever $\partial_{\theta}g(\theta_{0},\gamma_{0})v=0$ .

The dependence of $\Theta_{0}$ , $B_{\infty}$ on $\gamma_{0}$ is omitted to simplify notation. The condition for globally identified models holds with $\alpha=2$ if $g(\cdot,\gamma_{0})$ is twice continuously differentiable with bounded second derivative around $\theta_{0}$ . Theorem 1 shows that $B_{\infty}$ is singular as soon as $\gamma_{0}$ is such that global or local identification fails (1). An immediate implication of Theorem 1 is that for all $v\neq 0$ such that $B_{\infty}v\neq 0$ , $P_{v}\Theta_{0}=P_{v}\{\theta_{0}\}$ ; i.e. the parameter is point identified in direction $v$ . This contrasts with the Jacobian which can have full rank without global identification. $B_{\infty}$ is singular in all directions in which global identification fails (3), or local identification fails (4); these directions may vary depending on $\gamma_{0}$ . $B_{\infty}$ has full rank only if $\gamma_{0}$ is such that both global and local identification hold (2). Theorem 1 holds for both just and over-identified models. The identified set $\Theta_{0}$ can be arbitrary, e.g. discrete. The results do require correct specification, $\Theta_{0}$ non-empty, for the quasi-Jacobian $B_{\infty}$ to be well defined. Even though Theorem 1 is fairly general, the main results will be restricted to settings where either i. $\Theta_{0}$ is non-singleton, or iii. $\Theta_{0}$ is singleton and $\partial_{\theta}g(\theta_{0},\gamma_{0})$ has full rank. Additional results for ii. are given in Appendix I.

The Jacobian generally does not have property (1) or (3). When using projection methods for subvector inference, one can concentrate out nuisance parameters that are both globally and locally identified. The Jacobian can only determine the latter which is not sufficient for consistency. The following illustrates (3) and gives a sketch of the proof using a simple non-linear model where $\Theta_{0}$ is non-singleton but the Jacobian has full rank for all $\theta\in\Theta_{0}$ .

Intuition for linear models.

For linear models, the sup-norm approximation is exact with $B_{\infty}=\mathbb{E}(x_{i}x_{i}^{\prime})$ and $\mathbb{E}(z_{i}x_{i}^{\prime})$ for OLS and IV, respectively. The quasi-Jacobian coincides with the Jacobian and it is singular when the regressors are multicollinear or the instruments are not relevant. Both are singular in directions where the rank condition fails.

Non-linear models: a pen and pencil example.

Consider a simple MA(1) process:

[TABLE]

where $\theta=(\vartheta,\sigma^{2})\in\mathbb{R}\times\mathbb{R}_{+}$ are the parameters of interest. The model is estimated using the following set of moment conditions (the dependence on $\gamma$ is omitted in this example):

[TABLE]

Whenever $\vartheta_{0}\not\in\{-1,0,1\}$ and $\sigma_{0}^{2}>0$ , this system of equations has two distinct solutions: $\theta_{0}^{1}=(\vartheta_{0},\sigma_{0}^{2})$ and $\theta_{0}^{2}=(1/\vartheta_{0},\vartheta_{0}^{2}\sigma_{0}^{2})$ . Imposing invertibility (i.e. $|\vartheta_{0}|<1$ ), or non-invertibility (i.e. $|\vartheta_{0}|>1$ ) restores identification so that, intuitively, only one dimension is unidentified. Both solutions are locally identified: the Jacobian $\partial_{\theta}g(\theta)$ has full rank at both values; it is uninformative about the global identification failure in this example. The goal of this example is to show that $B_{\infty}$ is informative about the lack of global identification and the direction in which identification fails. Without the quasi-Jacobian, one would need to check with pen and pencil whether $g(\theta)=0$ has multiple solutions, or not.

The first step is to find a one-to-one linear reparameterization $\beta=(\beta_{1}^{\prime},\beta_{2}^{\prime})^{\prime}$ such that $\beta_{1}$ is uniquely identified but $\beta_{2}$ is not. Let $v_{2}=(\theta_{0}^{1}-\theta_{0}^{2})/\|\theta_{0}^{1}-\theta_{0}^{2}\|$ and pick any orthogonal $v_{1}\perp v_{2}$ such that $\|v_{1}\|=1$ . By construction: $v_{1}^{\prime}(\theta_{0}^{1}-\theta_{0}^{2})=0$ and $v_{2}^{\prime}(\theta_{0}^{1}-\theta_{0}^{2})=\|\theta_{0}^{1}-\theta_{0}^{2}\|^{2}>0$ . This implies that $\theta_{0}^{1}$ and $\theta_{0}^{2}$ are equal in direction $v_{1}$ but distinct in direction $v_{2}$ . Pick $\beta_{1}=v_{1}^{\prime}\theta$ , $\beta_{2}=v_{2}^{\prime}\theta$ . As desired: the mapping is one-to-one, with $\beta_{1}$ uniquely and $\beta_{2}$ set identified. Property (3) in Theorem 1 implies that directions in which $B_{\infty}$ is non-singular must be associated with a unique value for $\theta_{0}$ . This first step illustrates how these directions can be constructed from the set $\Theta_{0}$ . Importantly, the linear reparametrization need not be computed explicitly in practice, as explained below.

The second step is to show that $B_{\infty}$ is informative about the identification failure and contains information about the reparametrization above. In the MA(1) model, the set $\Theta_{0}=\{\theta_{0}^{1},\theta_{0}^{2}\}$ has two points. Take $\kappa>0$ and compute the intercept and slope $A_{\kappa,\infty},B_{\kappa,\infty}$ :

[TABLE]

here using the uniform kernel $K(x)=\mathbbm{1}_{x\in[-1,1]}$ , and $W=I$ for analytical simplicity. Notice that for $(A,B)=0$ , $\sup_{\theta\in\Theta,\|g(\theta)\|\leq\kappa}\|g(\theta)\|\leq\kappa$ . Also, because $\|g(\theta_{0}^{1})\|=\|g(\theta_{0}^{2})\|=0\leq\kappa$ , the solution $A_{\kappa,\infty},B_{\kappa,\infty}$ is such that $\|g(\theta)-A_{\kappa,\infty}-B_{\kappa,\infty}\theta\|\leq\kappa$ for $\theta\in\{\theta_{0}^{1},\theta_{0}^{2}\}$ . Using the triangular inequality and its reverse, this implies $\|B_{\kappa,\infty}(\theta_{0}^{1}-\theta_{0}^{2})\|\leq 2\kappa+\|g(\theta_{0}^{1})\|+\|g(\theta_{0}^{2})\|=2\kappa$ . Now, express this in terms of the direction vector $v_{2}$ constructed above:

[TABLE]

In the limit, the quasi-Jacobian $B_{\infty}$ is singular in the direction $v_{2}$ where identification fails. This implies that $v_{2}$ is a right-singular vector associated with the singular value [math]. The singular value decomposition of $B_{\infty}$ is informative about the directions of identification failure and the linear reparametrization from the first step. While the linear reparamerization requires knowledge of $\Theta_{0}$ and computing all possible $\theta_{0}^{1}-\theta_{0}^{2}$ with $\{\theta_{0}^{1},\theta_{0}^{2}\}\subseteq\Theta_{0}$ , Theorem 1 implies that the right-singular vectors of $B_{\infty}$ associated with the singular value [math] span all directions of identification failure $\theta_{0}^{1}-\theta_{0}^{2}$ .

In large samples and under Assumptions 1-2, $\|B_{n,\infty}v_{2}\|\leq 2\kappa_{n}/\|\theta_{0}^{1}-\theta_{0}^{2}\|$ wpa $1$ (using the same $K$ , $W$ ). As a result, for any sequence $\underline{\lambda}_{n}$ such that $\kappa_{n}=o(\underline{\lambda}_{n})$ , $\sigma_{\min}(B_{n,\infty})\leq\underline{\lambda}_{n}$ wpa $1$ which signals the identification failure, as desired. To illustrate, Figure 1 compares the distribution of the largest and smallest singular values of the Jacobian $\partial_{\theta}\overline{g}_{n}(\hat{\theta}_{n})$ , quasi-Jacobian $B_{n,\infty}$ , and scaled quasi-Jacobian $\overline{V}_{n}^{-1/2}B_{n,\infty}\Sigma_{n}^{-1/2}$ with the same cutoff $\underline{\lambda}_{n}$ . The scaling makes the singular values scale invariant. The Jacobian fails to detect the lack of identification, even for large $n$ (left panel) and also with the scaling $\overline{V}_{n}^{-1/2}\partial_{\theta}\overline{g}_{n}(\hat{\theta}_{n})\Sigma_{n}^{-1/2}$ . The quasi-Jacobian detects the identification failure since the smallest singular value is below the cutoff. However, the largest singular value is also close to the cutoff. With the scaling, the largest singular value diverges while the smallest one shrinks to zero (right panel).

2.5 Drifting Sequences of Parameters, Identification Regimes

The test procedure described above is said to be robust to identification failure if it has asymptotic null rejection probability bounded above by the nominal size, i.e.:

[TABLE]

In the limit, the worst-case rejection rate should be no greater than the nominal size $\alpha$ . Following Andrews and Cheng (2012), this can be determined from the asymptotic properties of the test for specific sequences of parameters $(\theta_{n},\gamma_{n})\in\overline{\Theta}\times\Gamma$ .

Assumption 3 (Identification).

There exists a continuous function $\delta(\cdot)\geq 0$ and a strictly positive function $h(\cdot)>0$ such that for any $(\theta_{n},\gamma_{n})\in\overline{\Theta}\times\Gamma$ where $g(\theta_{n},\gamma_{n})=0$ and $\varepsilon>0$ :

[TABLE]

There exists a $\overline{\varepsilon}>0$ and a constant $C>0$ such that for $0<\varepsilon\leq\overline{\varepsilon}$ :

[TABLE]

The function $\delta$ indicates whether the solution $\theta_{0}$ to the moment condition $g(\theta,\gamma_{0})=0$ is unique for a given $\gamma=\gamma_{0}$ . The second part of the assumption implies that when $\delta(\gamma_{0})=0$ , there is at least one $\theta\neq\theta_{0}$ such that $g(\theta,\gamma_{0})=0$ . Sequences such that $\gamma_{n}\to\gamma_{0}$ , $\delta(\gamma_{0})=0$ , satisfy $\delta(\gamma_{n})\to 0$ since $\delta$ is continuous. The properties of $\hat{\theta}_{n}$ depend on the rate at which $\delta(\gamma_{n})$ converges to zero. Under Assumptions 1 and 3, Lemma A1 shows that $\|\hat{\theta}_{n}-\theta_{n}\|=o_{p}(1)$ , the estimator is consistent, if $\sqrt{n}\delta(\gamma_{n})\to\infty$ . When $\sqrt{n}\delta(\gamma_{n})=O(1)$ , the estimator is generally not consistent, see e.g. Stock and Wright (2000).

In the MA(1) example, pick $\overline{\Theta}=\{\vartheta_{0},\sigma_{0}^{2}\}\in(\mathbb{R}/\{-1,0,1\})\times(\mathbb{R}_{+}/\{0\})$ , then $\delta(\gamma)=0$ regardless of $\gamma$ as long as the two distinct solutions $\theta_{0}^{1},\theta_{0}^{2}\in\Theta$ . The second inequality only holds for $\varepsilon<\|\theta_{0}^{1}-\theta_{0}^{2}\|$ , with $C=1$ , which implies $\overline{\varepsilon}\in(0,\|\theta_{0}^{1}-\theta_{0}^{2}\|)$ .

To give another example, consider a linear IV regression: $g(\theta,\gamma)=\mathbb{E}_{\gamma}[z_{i}(y_{i}-x_{i}^{\prime}\theta)]=\mathbb{E}_{\gamma}[z_{i}x_{i}^{\prime}](\theta_{n}-\theta)$ using $y_{i}=x_{i}^{\prime}\theta_{n}+u_{i}$ and $\mathbb{E}_{\gamma}(u_{i}z_{i})=0$ . Here $\|g(\theta,\gamma)\|\geq\sigma_{\min}(\mathbb{E}_{\gamma}[z_{i}x_{i}^{\prime}])\|\theta_{n}-\theta\|$ so that $\delta(\gamma)=\sigma_{\min}(\mathbb{E}_{\gamma}[z_{i}x_{i}^{\prime}])$ and $h(\varepsilon)=\varepsilon$ . The inequality holds with equality, i.e. $C=1$ , when $\theta_{n}-\theta$ is the right singular vector of $\mathbb{E}_{\gamma}[z_{i}x_{i}^{\prime}]$ associated with the smallest singular value. Here $\delta(\gamma_{0})=0$ implies $\mathbb{E}_{\gamma_{0}}[z_{i}x_{i}^{\prime}]$ singular, and the model is underidentified.111111Additional derivations for a non-linear regression model are given in Appendix H.

The dichotomy between $\delta$ and $h$ in Assumption 3 allows to construct a measure of global identification strength used to categorize the sequences $\gamma_{n}$ .121212A similar decomposition can be found in Chen (2007, p5589) to isolate the effect of the sieve dimension $k$ on the shape of the objective in nonparametric estimation. Let $\Gamma_{0}=\{\gamma\in\Gamma,\delta(\gamma)=0\}$ and $\Gamma_{1}=\Gamma/\Gamma_{0}$ . $\Gamma_{0}$ collects all DGPs such that $\theta_{0}$ is not uniquely identified, and in $\Gamma_{1}$ those that are point identified. Let $\Gamma_{0}(\infty)=\{\gamma_{n}\in\Gamma,\gamma_{n}\to\gamma_{0}\in\Gamma_{0},\sqrt{n}\delta(\gamma_{n})\to\infty\}$ , $\Gamma_{0}(b)=\{\gamma_{n}\in\Gamma,\gamma_{n}\to\gamma_{0}\in\Gamma_{0},\lim_{n\to\infty}\sqrt{n}\delta(\gamma_{n})=b<\infty\}$ . In the following, any converging sequence $\gamma_{n}$ will be assumed to belong to one of $\Gamma_{0}(b)$ for some $b\geq 0$ , $\Gamma_{0}(\infty)$ , or converges in $\Gamma_{1}$ . These will be referred to as weak, semi-strong, and strong sequences.

Assumption 4 (Strong and Semi-Strong Sequences).

Let $(\theta_{n},\gamma_{n})\to(\theta_{0},\gamma_{0})$ where $\gamma_{0}\in\Gamma_{1}$ , or $\gamma_{n}\in\Gamma_{0}(\infty)$ . Let $H_{n}=\left(\partial_{\theta}g(\theta_{n},\gamma_{n})^{\prime}\partial_{\theta}g(\theta_{n},\gamma_{n})\right)^{-1/2}$ . For any $r_{n}=o(1)$ , suppose the following holds: i. $\partial_{\theta}g(\theta,\gamma)$ is continuous in $\theta$ and $\gamma$ ; $\partial_{\theta}g(\theta_{n},\gamma_{n})$ has full rank for all $n\geq 1$ , ii. $n\times\lambda_{\min}\left(\partial_{\theta}g(\theta_{n},\gamma_{n})^{\prime}\partial_{\theta}g(\theta_{n},\gamma_{n})\right)\to\infty$ , $\lambda_{\max}\left(\partial_{\theta}g(\theta_{n},\gamma_{n})^{\prime}\partial_{\theta}g(\theta_{n},\gamma_{n})\right)\leq\overline{\lambda}<\infty$ , iii. $\sup_{\|\partial_{\theta}g(\theta_{n},\gamma_{n})(\theta-\theta_{n})\|\leq r_{n}}\sqrt{n}\|[\bar{g}_{n}(\theta)-\bar{g}_{n}(\theta_{n})]-[g(\theta,\gamma_{n})-g(\theta_{n},\gamma_{n})]\|\overset{p}{\to}0$ , iv. there exists $\varepsilon>0$ , $\underline{C}>0$ such that for $\|\theta-\theta_{n}\|\leq\varepsilon$ , $\|g(\theta,\gamma_{n})\|\geq\underline{C}\|\partial_{\theta}g(\theta_{n},\gamma_{n})(\theta-\theta_{n})\|$ , and $\sup_{\|\partial_{\theta}g(\theta_{n},\gamma_{n})(\theta-\theta_{n})\|\leq r_{n}}\|g(\theta,\gamma_{n})-g(\theta_{n},\gamma_{n})-\partial_{\theta}g(\theta_{n},\gamma_{n})(\theta-\theta_{n})\|=O(r_{n}^{2})$ , v, $\partial_{\theta}g(\theta_{n},\gamma_{n})H_{n}\to R_{0}$ , where $R_{0}$ is a full rank matrix.

Assumption 4 provides sufficient conditions to establish asymptotic normality of $\hat{\theta}_{n}-\theta_{n}$ at a potentially slower than $\sqrt{n}$ -rate. Condition i. is standard and ensures the model is locally identified. Condition ii. allows the Jacobian to be vanishing at a slower than $\sqrt{n}$ -rate in some directions. Conditions iii. is a stochastic equicontinuity condition. Condition iv. implies that the Taylor remainder is quadratic under the weaker norm $\|\partial_{\theta}g(\theta_{n},\gamma_{n})(\cdot)\|$ , which is the relevant norm for convergence when $\gamma_{n}\in\Gamma_{0}(\infty)$ . Indeed, Lemma A2 establishes that $\sqrt{n}\|\partial_{\theta}g(\theta_{n},\gamma_{n})(\hat{\theta}_{n}-\theta_{n})\|=O_{p}(1)$ . Condition iv. excludes settings where the non-linear remainder dominates the first-order term.131313These second or higher-order identification issues are not considered in the main text, additional results for the quasi-Jacobian under higher-order identification are given in the Supplement. Condition v. is analogous to Assumption 3iv in Antoine and Renault (2012). It requires a rescaling for which the Jacobian is non-singular in the limit. For instance, under a singular value decomposition of the form $\partial_{\theta}g(\theta_{n},\gamma_{n})=UD_{n}V^{\prime}$ , we have $\partial_{\theta}g(\theta_{n},\gamma_{n})H_{n}=UV^{\prime}=R_{0}$ . The rescaling corrects for the possibly vanishing, but non-zero, terms in the diagonal $D_{n}$ . Antoine and Renault (2021, Sec2.2) discuss conditions relating to Assumption 4 in more detail.

Proposition 1 (Asymptotic Distribution for (Semi)-Strong Sequences).

Let $(\theta_{n},\gamma_{n})\to(\theta_{0},\gamma_{0})$ . Let $\text{AR}_{n}(\theta_{1n})=\inf_{\theta_{2}\in\Theta_{2}}\|\bar{g}_{n}(\theta_{1n},\theta_{2})\|_{V_{n}}^{2}$ , if Assumptions 1, 3 and 4 hold then:

[TABLE]

Proposition 1 implies that the test is asymptotically valid for any choice of $\hat{d}_{n}\in\{0,\dots,d_{\theta_{2}}\}$ and asymptotically non-conservative if $\hat{d}_{n}=d_{\theta_{2}}$ wpa 1. Furthermore, for just-identified models $\text{QLR}_{n}(\theta_{1})=\text{AR}_{n}(\theta_{1})$ , and the test is asymptotically efficient if $\hat{d}_{n}=d_{\theta_{2}}$ wpa 1.

Linear reparameterization.

As in the MA(1) example, the derivations rely on a one-to-one linear reparameterization $\beta=M\theta=(\beta_{1}^{\prime},\beta_{2}^{\prime})^{\prime}$ with $\beta_{1}$ uniquely and $\beta_{2}$ set identified. The following steps construct the reparameterization, which is not implemented in practice: the span of right-singular vectors associated with singular values below $\underline{\lambda}_{n}$ consistently estimates the span of identification failure. The following applies to just and over-identified models.

First, take $\gamma_{0}\in\Gamma$ , collect all solutions to the moment conditions $\Theta_{0}=\{\theta\in\Theta,g(\theta,\gamma_{0})=0\}$ . Let $V_{2}=\text{span}(\{v_{2}=\theta_{0}^{1}-\theta_{0}^{2},(\theta_{0}^{1},\theta_{0}^{2})\in\Theta_{0}\times\Theta_{0}\})$ and $V_{1}=V_{2}^{\perp}$ . If $V_{2}=\{0\}$ , then $V_{1}=\mathbb{R}^{d_{\theta}}$ which implies that $\Theta_{0}$ is a singleton; i.e. the parameters are uniquely identified. This is the case when $\gamma_{0}\in\Gamma_{1}$ . If $\{0\}\subset V_{2}$ strictly, then $V_{1}\subset\mathbb{R}^{d_{\theta}}$ strictly; i.e. the parameters are set identified. This is the case when $\gamma_{0}\in\Gamma_{0}$ . As in the MA(1) example, by projection $P_{V_{1}}(\theta_{0}^{1}-\theta_{0}^{2})=0$ for any two $\theta_{0}^{1},\theta_{0}^{2}\in\Theta_{0}$ ; i.e. the solution is unique on $V_{1}$ . In contrast, for any non-zero $v_{2}\in V_{2}$ , there exists two distinct $\theta_{0}^{1},\theta_{0}^{2}\in\Theta_{0}\times\Theta_{0}$ s.t. $v_{2}^{\prime}(\theta_{0}^{1}-\theta_{0}^{2})\neq 0$ , by construction. Define $\beta_{1}$ as the projection of $\theta$ on $V_{1}$ and $\beta_{2}$ the projection on $V_{2}$ . The matrix $M$ combines the bases of $V_{1}$ and $V_{2}$ . As illustrated by the MA(1) example, it may not be possible to improve on this linear reparameterization with a non-linear one without some further structure on the moments or the model. The reparameterization is defined up to a rotation on $V_{1}$ and $V_{2}$ , respectively.

For testing $H_{0}:\theta_{1}=\theta_{10}$ , the identification status of the nuisance parameters $\theta_{2}$ matters. Consider a further sub-decomposition $(\beta_{1},\beta_{21},\beta_{22})$ where only $\beta_{22}$ is unidentified under the restriction $\theta_{1}=\theta_{10}$ . To find it, take $V_{22}=\text{span}(\{\theta_{0}^{1}-\theta_{0}^{2},(\theta_{0}^{1},\theta_{0}^{2})\in\Theta_{0}\times\Theta_{0},P_{\theta_{1}}\theta_{0}^{1}=P_{\theta_{1}}\theta_{0}^{2}=\theta_{10}\})$ and follow the same steps as above. By construction, $V_{22}$ is a subset of $V_{2}$ , also $\theta_{1}$ is in $V_{22}^{\perp}$ and $\beta_{22}$ is the subset of $\theta_{2}$ which is unidentified under $H_{0}$ .141414Note that, by linearity and by construction, $\text{span}(P_{V_{22}})=\text{span}(P_{V_{22}}P_{\theta_{1}}^{\perp})\subseteq\text{span}(P_{V_{2}}P_{\theta_{1}}^{\perp})$ .

Now consider sequences $(\theta_{n},\gamma_{n})\to(\theta_{0},\gamma_{0})$ with $\gamma\in\Gamma_{0}$ . Combine the linear reparameterization with the continuity of $g$ with respect to $\theta$ and $\gamma$ to find, using the Maximum Theorem, that for all $(\theta_{n},\gamma_{n})\to(\theta_{0},\gamma_{0})$ , any $\varepsilon>0$ , and letting $\beta_{n}=M\theta_{n}$ :151515To apply the Maximum Theorem, note that by continuity of $g(\cdot,\gamma_{0})$ and compactness of $\Theta$ , both $\Theta_{0}$ and $\mathcal{B}_{2}^{0}$ are compact subsets of $\mathbb{R}^{d_{\theta}}$ and $\mathbb{R}^{d_{\beta_{2}}}$ , respectively. Similar equations can be derived for $(\beta_{1},\beta_{21},\beta_{22})$ with the added constraint $\theta_{1}=\theta_{1n}$ .

[TABLE]

where $\mathcal{B}_{2}^{0}=P_{V_{2}}\Theta_{0}$ is the identified set for $\beta_{2}$ when $(\theta,\gamma)=(\theta_{0},\gamma_{0})$ . The first limit implies $\beta_{1}$ is consistently estimable, while the second and third imply that the population objective function becomes flat (only) on $\mathcal{B}_{2}^{0}$ . The decomposition so far separates $\beta_{1}$ point identified from $\beta_{2}$ set unidentified when $\gamma=\gamma_{0}$ .161616Note that for the class of models considered in Andrews and Cheng (2012), their parameter $\beta$ which is point identified and determines identification strength is included in the vector $\beta_{1}$ constructed here.

If there is a single source of identification failure, then $\sup_{\beta_{2}\in\mathcal{B}_{2}^{0}}\sqrt{n}\|g(\beta_{1n},\beta_{2},\gamma_{n})\|$ is determined by a scalar subset of $\gamma_{n}$ , and is bounded above for weak sequences. To illustrate, consider the linear IV example again with a single endogenous regressor $x_{i}$ and one instrument $z_{i}$ . In this case $\sqrt{n}\|g(\beta_{1n},\beta_{2},\gamma_{n})\|=\sqrt{n}|\text{cov}(x_{i},z_{i})|\times|\beta_{2}-\beta_{2n}|$ depends on the scalar $\sqrt{n}|\text{cov}(x_{i},z_{i})|$ being bounded which characterizes weak sequences. In the case with multiple sources of identification failure, there may be mixed identification strength, and some components of $\beta_{2}$ may be (semi)-strongly identified, so the reparameterization needs to be further refined. This is deferred to Appendix E.

Assumption 5 (Weak Sequences).

Let $(\theta_{n},\gamma_{n})\to(\theta_{0},\gamma_{0})$ with $\gamma_{n}\in\Gamma_{0}(b)$ . Let $\mathcal{B}_{n}=\{\beta\in\mathcal{B},M^{-1}\beta=(\theta_{1n}^{\prime},\theta_{2}^{\prime})^{\prime},\theta_{2}\in\Theta_{2}\}$ be the null-constrained space for $\beta$ . There exists $\tilde{\delta}(\cdot)\geq 0$ continuous satisfying $\sqrt{n}\tilde{\delta}(\gamma_{n})\to\infty$ and $\tilde{h}(\cdot)$ strictly positive, and two non-empty and non-singleton sets $\mathcal{B}_{2}^{0}\subset\mathbb{R}^{d_{\beta_{2}}}$ and $\mathcal{B}_{22}^{0}\subset\mathbb{R}^{d_{\beta_{22}}}$ such that for any $\varepsilon>0$ :

i.

$\inf_{\beta_{2},\|\beta_{1}-\beta_{1n}\|\geq\varepsilon}\|g(\beta_{1},\beta_{2},\gamma_{n})\|\geq\tilde{\delta}(\gamma_{n})\tilde{h}(\varepsilon),$ * $\limsup_{n\to\infty}\sup_{\beta_{2}\in\mathcal{B}_{2}^{0}}\sqrt{n}\|g(\beta_{1n},\beta_{2},\gamma_{n})\|<\infty$ and $\inf_{\beta_{1},d(\beta_{2},\mathcal{B}_{2}^{0})\geq\varepsilon}\|g(\beta_{1},\beta_{2},\gamma_{n})\|\geq\tilde{\delta}(\gamma_{n})\tilde{h}(\varepsilon)$ ,*

ii.

$\inf_{\beta_{22},\|\beta_{1}-\beta_{1n}\|+\|\beta_{21}-\beta_{21n}\|\geq\varepsilon}\|g(\beta_{1},\beta_{21},\beta_{22},\gamma_{n})\|\geq\tilde{\delta}(\gamma_{n})\tilde{h}(\varepsilon)$ ,

$\limsup_{n\to\infty}\sup_{\beta_{22}\in\mathcal{B}_{22}^{0}}\sqrt{n}\|g(\beta_{1n},\beta_{21n},\beta_{22})\|<\infty$ , and $\inf_{\beta_{1},\beta_{21},d(\beta_{22},\mathcal{B}_{22}^{0})\geq\varepsilon}\|g(\beta_{1},\beta_{2},\gamma_{n})\|\geq\tilde{\delta}(\gamma_{n})\tilde{h}(\varepsilon)$ , where the infs are taken over the constrained space $(\beta_{1}^{\prime},\beta_{21}^{\prime},\beta_{22}^{\prime})^{\prime}\in\mathcal{B}_{n}$ .

Assumption 5 adds this additional structure to (7)-(9), where $\beta_{1}$ are assumed semi-strongly and $\beta_{2}$ weakly identified. The first part Assumption 5i. implies $\beta_{1}$ is consistently estimable, allowing for some components to be semi-strongly identified. The second and third part imply the objective function is flat with respect to $\beta_{2}$ but only on the identified set $\mathcal{B}_{2}^{0}$ . For the quasi-Jacobian, the $\limsup_{n\to\infty}\sup_{\beta_{2}\in\mathcal{B}_{2}^{0}}\sqrt{n}\|g(\beta_{1n},\beta_{2})\|<\infty$ implies that $\|\overline{g}_{n}(\beta_{1n},\beta_{2})\|_{W_{n}}\leq\kappa_{n}$ uniformly in $\beta_{2}\in\mathcal{B}_{2}^{0}$ with increasing probability so that Step 2.i of the procedure consistently estimates the identified set and all directions of identification failure. Similarly, condition ii. repeats the conditions under the restriction that $\theta_{1}=\theta_{1n}.$ The parameters $(\beta_{1},\beta_{21})$ correspond to the directions that are consistently estimable. To simplify notation, the Proposition below denotes as $\phi$ these $d_{\theta}-d_{\theta_{1}}-d_{\beta_{22}}=d_{\phi}$ coefficients that are consistently estimable and semi-strongly identified under $H_{0}:\theta_{1}=\theta_{1n}$ .

Proposition 2 (Asymptotic Distribution for Weak Sequences).

Let $(\theta_{n},\gamma_{n})\to(\theta_{0},\gamma_{0})$ . Suppose there is a linear reparameterization $M_{\phi}$ invertible, $M_{\phi}\theta=(\theta_{1}^{\prime},\phi^{\prime},\beta_{22}^{\prime})^{\prime}$ , such that the moment function $\phi\to\bar{g}_{n}(\theta_{1n},\phi,\beta_{22n})$ satisfies Assumptions 1, 3 and 4, then:

[TABLE]

Proposition 2 implies that the test procedure has limiting null rejection probability bounded by the nominal size for weak sequences as long as $\hat{d}_{n}\leq d_{\beta_{1}}+d_{\beta_{21}}-d_{\theta_{1}}=d_{\phi}$ wpa 1, since $\mathbb{P}_{\gamma_{n}}\left(\text{AR}_{n}(\theta_{1n})\geq\chi^{2}_{1-\alpha}(d_{g}-\hat{d}_{n})\right)\leq\mathbb{P}_{\gamma_{n}}\left(\inf_{\phi\in\Phi}\|\bar{g}_{n}(\theta_{1n},\phi,\beta_{22n})\|_{V_{n}^{-1}}\geq\chi^{2}_{1-\alpha}(d_{g}-d_{\phi})\right)+o(1)\to\alpha$ . Note that Assumption 3 with respect to $\phi$ is implied by Assumption 5 ii.

3 Asymptotic Behaviour of the quasi-Jacobian

As discussed above, the properties of the ICS and test procedure are tied to those of the quasi-Jacobian under different identification regimes. The following derives the large sample behaviour of the sup-norm and least-squares quasi-Jacobian matrices $B_{n,\infty}$ under strong, semi-strong, and weak identification.

3.1 Strong and Semi-Strong Sequences

Theorem 2 (quasi-Jacobian and Jacobian Equivalence).

Let $B_{n,\infty}$ denote the quasi-Jacobian. Suppose $(\theta_{n},\gamma_{n})\to(\theta_{0},\gamma_{0})$ with $\gamma_{0}\in\Gamma_{1}$ or $\gamma_{n}\in\Gamma_{0}(\infty)$ . Suppose that Assumptions 1, 2, and 4 hold. If $\kappa_{n}^{-1}\delta(\gamma_{n})\to\infty$ and $\kappa_{n}^{2}=o\left(\lambda_{\min}(\partial_{\theta}g(\theta_{n},\gamma_{n})^{\prime}\partial_{\theta}g(\theta_{n},\gamma_{n})\right)$ , then:

[TABLE]

where $n^{-1/2}\kappa_{n}^{-1}\to 0$ by assumption and $H_{n}=\left(\partial_{\theta}g(\theta_{n},\gamma_{n})^{\prime}\partial_{\theta}g(\theta_{n},\gamma_{n})\right)^{-1/2}$ .

The proof is given in Appendix B. Theorem 2 implies that, for (semi)-strong sequences, the quasi-Jacobian, and the Jacobian are asymptotically equivalent after re-scaling to a non-singular limit. For non-smooth moments, where the sample Jacobian is not defined as in quantile-IV regression or SMM estimation of discrete choice models, $B_{n,\infty}$ can be used in the sandwich formula to compute standard errors for $\hat{\theta}_{n}$ . Assumption 4 v. implies $\lambda_{\min}(H_{n}\partial_{\theta}g(\theta_{n},\gamma_{n})^{\prime}\partial_{\theta}g(\theta_{n},\gamma_{n})H_{n})=\lambda_{\min}(R_{0}^{\prime}R_{0})+o(1)\to 1$ , hence:

[TABLE]

For sufficiently strong sequences such that $\underline{\lambda}_{n}^{2}=o\left(\lambda_{\min}(\partial_{\theta}g(\theta_{n},\gamma_{n})^{\prime}\partial_{\theta}g(\theta_{n},\gamma_{n}))\right)$ , where $\underline{\lambda}_{n}$ is the cutoff in Section 2.3, this implies that $\hat{d}_{n}=d_{\theta_{2}}$ wpa 1.

3.2 Weak Sequences

Theorem 3 (Asymptotic Singularity of the quasi-Jacobian).

Suppose $(\theta_{n},\gamma_{n})\to(\theta_{0},\gamma_{0})$ with $\gamma_{n}\in\Gamma_{0}(b),b\in[0,\infty)$ and Assumptions 1, 2, 3, 5 hold. For any $v=(0_{d_{\beta_{1}}},\beta_{2}^{1\prime}-\beta_{2}^{2\prime})^{\prime}$ , with $\beta_{2}^{1},\beta_{2}^{2}\in\mathcal{B}_{2}^{0}\times\mathcal{B}_{2}^{0}$ in the identified set for $\beta_{2}$ , $\|B_{n,\infty}M^{-1}v\|\leq O_{p}(\kappa_{n})$ . Let $\lambda_{j}(B_{n,\infty}^{\prime}B_{n,\infty})\geq 0$ denote the eigenvalues of $B_{n,\infty}^{\prime}B_{n,\infty}$ in increasing order, then:

[TABLE]

In particular, $\lambda_{\min}(B_{n,\infty}^{\prime}B_{n,\infty})\leq O_{p}(\kappa_{n}^{2})$ .

Theorem 3 shows that when $\theta$ is not uniquely identified, the quasi-Jacobian vanishes at a $\kappa_{n}$ rate in all directions associated with the identification failure. The span of these directions has dimension $d_{\beta_{2}}$ so that $B_{n,\infty}$ vanishes on a subspace of dimension $d_{\beta_{2}}$ . Hence, small singular values are indicative of an identification failure, and the number of weakly identified coefficients. The constants involved in the $O_{p}$ terms are made explicit in the proof. The following Proposition extends these results to $B_{n,\infty}P_{\theta_{1}}^{\perp}$ , which focuses on the identification status of the nuisance parameters only. For both results, the proof is similar to the derivations used for the MA(1) example.

Proposition 3 (quasi-Jacobian after Projection).

Suppose $(\theta_{n},\gamma_{n})\to(\theta_{0},\gamma_{0})$ with $\gamma_{n}\in\Gamma_{0}(b),b\in[0,\infty)$ and Assumptions 1, 2, 3, 5 hold. For any $v=(0_{d_{\beta_{1}}+d_{\beta_{21}}},\beta_{22}^{1\prime}-\beta_{22}^{2\prime})^{\prime}$ , with $\beta_{22}^{1},\beta_{22}^{2}\in\mathcal{B}_{22}^{0}\times\mathcal{B}_{22}^{0}$ the identified set for $\beta_{22}$ under the null, $\|B_{n,\infty}M^{-1}v\|\leq O_{p}(\kappa_{n})$ . Let $\lambda_{j}(P_{\theta_{1}}^{\perp}B_{n,\infty}^{\prime}B_{n,\infty}P_{\theta_{1}}^{\perp})\geq 0$ denote the eigenvalues of $P_{\theta_{1}}^{\perp}B_{n,\infty}^{\prime}B_{n,\infty}P_{\theta_{1}}^{\perp}$ in increasing order:

[TABLE]

4 Asymptotic Properties of the Test Procedure

As discussed above, the ICS procedure used to compute $\hat{d}_{n}$ relies on two normalizations that ensure invariance to rescaling of the sample moments and/or the parameters. The first normalizing matrix is $\Sigma_{n}$ computed in the procedure outlined above. $\Sigma_{n}^{-1/2}$ is shown to be bounded above in directions associated with the identification failure in Lemma D4, so that Proposition 3 extends to the normalized $B_{n,\infty}P_{\theta_{1}}^{\perp}\Sigma_{n}^{-1/2}P_{\theta_{1}}^{\perp}$ . Under strong identification, Lemma D3 implies that $\Sigma_{n}^{-1/2}=O(\kappa_{n}^{-1})$ so that $B_{n,\infty}P_{\theta_{1}}^{\perp}\Sigma_{n}^{-1/2}P_{\theta_{1}}^{\perp}$ diverges at a $\kappa_{n}^{-1}$ rate in $d_{\theta}-d_{\theta_{1}}$ directions. As a result, $B_{n,\infty}P_{\theta_{1}}^{\perp}\Sigma_{n}^{-1/2}P_{\theta_{1}}^{\perp}$ vanishes at a $\kappa_{n}$ -rate in directions where identification fails, and diverges at a $\kappa_{n}^{-1}$ -rate when all parameters are strongly identified.

The second normalizing matrix is $\overline{V}_{n}=\int_{\Theta}\hat{V}_{n}(\theta)\hat{\pi}_{n}(\theta)d\theta$ , where $\hat{V}_{n}(\theta)$ is an estimator of the asymptotic variance $\lim_{n\to\infty}\text{var}_{\gamma_{n}}(\sqrt{n}\overline{g}_{n}(\theta))$ . The Assumption below requires $\hat{V}_{n}(\theta)$ consistent and asymptotically non-singular so that the normalization does not alter the asymptotic properties of $B_{n,\infty}P_{\theta_{1}}^{\perp}\Sigma_{n}^{-1/2}$ .

Assumption 6.

$V(\theta,\gamma)=\lim_{n\to\infty}\text{var}_{\gamma}(\sqrt{n}\overline{g}_{n}(\theta))$ * is non-singular and $0<\underline{\lambda}_{V}\leq\lambda_{\min}(V(\theta,\gamma))\leq\lambda_{\max}(V(\theta,\gamma))\leq\overline{\lambda}_{V}<\infty$ for all $\theta\in\Theta$ , $\gamma\in\Gamma$ , $\sup_{\theta\in\Theta,\gamma\in\Gamma}\|\hat{V}_{n}(\theta)-V(\theta,\gamma)\|=o_{p}(1)$ .*

Theorem 4 (Asymptotic Size).

Suppose Assumptions 1-5 hold. Let $\underline{\lambda}_{n}\to 0$ such that $\kappa_{n}=o(\underline{\lambda}_{n})$ . Let $\hat{d}_{n}=\#\big{\{}j\in\{d_{\theta_{1}}+1,\dots,d_{\theta}\},\,\lambda_{j}(P_{\theta_{1}}^{\perp}\Sigma_{n}^{-1/2}P_{\theta_{1}}^{\perp}B_{n,\infty}^{\prime}\overline{V}_{n}^{-1}B_{n,\infty}P_{\theta_{1}}^{\perp}\Sigma_{n}^{-1/2}P_{\theta_{1}}^{\perp})>\underline{\lambda}_{n}^{2}\big{\}},$ then for any $\alpha\in(0,1)$ :

[TABLE]

For any sequence $\gamma_{n}\in\Gamma_{0}(\infty)\cup\Gamma_{1}$ such that $\underline{\lambda}^{2}_{n}=o(\lambda_{\min}(\partial_{\theta}g(\theta_{n},\gamma_{n})^{\prime}\partial_{\theta}g(\theta_{n},\gamma_{n}))$ :

[TABLE]

Theorem 4 establishes the uniform validity of the test procedure described in Section 2.3 under strong, semi-strong, and weak sequences. First, it is shown that the normalizations do not affect the predictions of Theorems 2, 3, and Proposition 3. Then, since $\overline{\Theta}$ and $\Gamma$ are compact, the worst-case rejection probability is attained by a converging subsequence which, using the stated assumptions, can be interpolated into a converging sequence in either $\Gamma_{0}(b)$ for some $b\in[0,\infty)$ , $\Gamma_{0}(\infty)$ , or converging in $\Gamma_{1}$ . The result then relies on two properties. The first is that $\hat{d}_{n}\geq d_{\beta_{22}}$ under weak identification, and the second is that $\text{AR}_{n}(\theta_{1n})=\inf_{\theta_{2}\in\Theta_{2}}\text{AR}_{n}(\theta_{1n},\theta_{2})\leq\text{AR}_{n}(\theta_{1n},\hat{\phi}_{n},\beta_{22n})$ which has a standard chi-squared limiting distribution with degrees of freedom that only depend on the dimension of $\bar{g}_{n}$ , and the number of identified nuisance parameters. For just-identified models, the resulting procedure is efficient under strong identification since it uses the smallest valid critical value, and is equivalent to a quasi-Likelihood ratio test. For over-identified model, the test uses the smallest valid critical value for the projected AR test so it is non-conservative within that class. The results above can be extended to some other existing robust test statistics. For instance, the K-statistic of Kleibergen (2005) is such that, under additional regularity conditions, $\text{K}_{n}(\theta_{1n})=\inf_{\theta_{2}\in\Theta_{2}}\text{K}_{n}(\theta_{1n},\theta_{2})\leq\text{K}_{n}(\theta_{1n},\hat{\phi}_{n},\beta_{22n})$ which also has a chi-squared limiting distribution with reduced degrees of freedom.

5 Monte-Carlo Simulations

The finite-sample properties of the quasi-Jacobian matrix and the test procedure are illustrated using a consumption capital asset pricing model (CAPM) as in Wright (2003, Sec3).

Let $\delta$ , $\gamma$ measure time preference and relative risk aversion. $C_{t},D_{t},R_{t}$ are real consumption, dividends, and the gross asset return at time $t$ . The Euler equation is: $\mathbb{E}_{t}[\delta R_{t+1}(C_{t+1}/C_{t})^{-\gamma}-1]=0$ , where $C_{t+1}/C_{t}$ measures consumption growth. $R_{t}$ depends endogenously on $y_{t+1}=(c_{t+1},d_{t+1})^{\prime}$ , where $c_{t+1}=\log(C_{t+1}/C_{t})$ and $d_{t+1}=\log(D_{t+1}/D_{t})$ , which follows a first-order vector autoregressive (VAR) process: $y_{t+1}=\mu+\Phi y_{t}+u_{t+1}$ , where $u_{t+1}\overset{iid}{\sim}\mathcal{N}(0,\Lambda)$ . The sample moments are:

[TABLE]

where $Z_{t}=(1,R_{t},C_{t}/C_{t-1})^{\prime}$ . Tauchen (1986) illustrates how $(\mu,\Phi,\Lambda)$ affects the finite-sample properties of $\hat{\theta}_{n}=(\hat{\delta}_{n},\hat{\gamma}_{n})$ . The following considers three DGPs: Rank Failure (RF), Near Rank Failure (NRF) and Full Rank (FR).171717RF, NRF and FR correspond to RF1, NRF1 and FR in Wright (2003, p326). Wright (2003) explains that they correspond to $\theta=(\delta,\gamma)$ being set, weakly, and strongly identified. NRF is calibrated to match annual U.S. data (Kocherlakota, 1990, Sec3).

Table 2 reports rejection rates for the method in Section 2.1 (Proj1), full projection inference using $\chi^{2}_{3}$ (Proj2) and $\chi^{2}_{2}$ (Proj3) critical values as well a t-test with standard normal critical value ( $t_{n}$ ). The empirically relevant sample sizes are $n=100,250$ . $n=500,1000$ illustrate large sample properties. The parameter space is $\Theta=[0.7,1.1]\times[0,10]$ . The t-test does not control size in RF and NRF. It is closer to nominal size for FR. However, as Figure H5 in Appendix H.2 shows, another global solution $\hat{\theta}_{n}\simeq(0.7,10)$ is estimated in about 1% and 0.05% of the replications for $n=100,250$ . Here, the parameters are locally strongly identified, but not globally. The sample Jacobian would not detect this issue which leads to some over-rejection for the t-test. In comparison, the proposed procedure (Proj1) has null rejection rates below nominal size across sample sizes and DGPs.

Wright (2003, Sec3), Antoine and Renault (2009, Sec5) explain that one coefficient is always strongly identified. Table 2 and Figure 2 confirm this. The procedure finds $\gamma$ to be weakly identified in nearly all replications for RF, NRF, and $\delta$ strongly identified. For FR, the procedure finds $\gamma$ weakly identified in $14\%$ and $0.5\%$ of replications when $n=100,250$ .

Figure 3 compares the power of the proposed procedure (AR1) with full projection inference (AR3), and projection inference with the nuisance parameter concentrated out (AR2) as well as the t-test when appropriate (FR with $n=250,500,1000$ ). The results show power improvement over full projection inference when the nuisance parameter is strongly identified, i.e. when testing hypotheses about $\gamma$ . When the model is strongly identified (FR), the procedure is less powerful than the t-test because of over-identification. Result for a just-identified specification with $Z_{t}=(1,R_{t})^{\prime}$ and a larger $\kappa_{n}$ are given in Appendix H.2. Another example in that Appendix compares the procedure with Andrews and Cheng (2012) for a non-linear regression.

6 Application to the Long-Run Risks Model

To illustrate the empirical content which can be gained from the quasi-Jacobian for inference, consider a simulated method of moments estimation of the long-run risks (LRR) model (Bansal and Yaron, 2004). There are two latent variables representing a persistent component to the level of consumption growth $x_{1,t}$ and stochastic volatility $x_{2,t}$ :

[TABLE]

where $f(x)=\sqrt{x}$ if $x\geq\sigma^{2}$ and $f(x)=\sigma^{2}/\sqrt{2\sigma^{2}-x}$ as in Calvet and Czellar (2015, p346). Consumption and dividend growth $g_{t},d_{d,t}$ are then given by:

[TABLE]

where $(e_{t},w_{t},\eta_{t},u_{t})\sim\mathcal{N}(0,I)$ iid. Given an Epstein-Zin utility function, equilibrium conditions imply that financial variables, log-price dividend ratio $z_{m,t}$ , market return $r_{m,t}$ and the risk-free rate $r_{a,t}$ can be written as:

[TABLE]

where the coefficients $(A_{0,m},A_{1,m},A_{2,m},\kappa_{0,m},\kappa_{1,m},A_{0,r},A_{1,r},A_{2,r})$ are computed numerically as a solution of a non-linear system of equations involving the full vector of 12 parameters $\theta=(\rho,\phi_{e},\sigma,\nu,\sigma_{w},\mu,\mu_{d},\phi,\phi_{d},\delta,\gamma,\psi^{-1})$ where $\delta$ is the discount factor, $\gamma$ risk-aversion, and $\psi^{-1}$ the inverse intertemporal elasticity of sustitution (IES). See Bansal and Yaron (2004) for details. The variables above need to be further time-aggregated from the monthly decision interval to match the quarterly frequency of the data. There are a number of estimations of this model using one of SMM and Indirect Inference,181818See Bansal et al. (2007); Hasseltoft (2012); Calvet and Czellar (2015); Grammig and Küchlin (2018). GMM,191919See Constantinides and Ghosh (2011); Bansal et al. (2012, 2016)., or Bayesian estimation202020See Schorfheide et al. (2018). There are, however, several concerns for the identifiability of the parameters. Calvet and Czellar (2015) show that the latent variables $(x_{1,t},x_{2,t})$ cannot be recovered from the data for uncountably many values of $\theta$ , resulting in highly irregular GMM and likelihood objective functions. Grammig and Küchlin (2018) find that the stochastic volatility component is poorly identified and calibrate $\nu=\sigma_{w}=0$ . However, stochastic volatility in long-term consumption growth has important implications for asset prices (Schorfheide et al., 2018). Several papers report estimates with very small standard errors (see Grammig and Küchlin, 2018, Table 7, p24), but estimates can vary a lot across estimations. This suggests that some parameters are likely not globally identified but might be locally identified.

The following considers joint inference for the two preference parameters $\theta_{1}=(\gamma,\psi^{-1})$ . The remaining coefficients are $\theta_{2}=(\rho,\phi_{e},\sigma,\nu,\sigma_{w},\mu,\mu_{d},\phi,\phi_{d},\delta)$ . Amongst these nuisance parameters, it seems reasonable to think that several are (semi)-strongly identified. However, the asset pricing coefficients $(A_{0,m},\dots)$ are highly non-linear functions of $\theta$ so it is arguably more difficult to pin down exactly how many and which ones are well identified. Nevertheless, the results in this paper imply that $B_{n,\infty}P_{\theta_{1}}^{\perp}$ can determine how many nuisance parameters are weakly identified with high probability.

The moment conditions used for inference are based on matching the following sample with simulated moments: means of all variables, variances of $g_{t},g_{d,t},z_{m,t}$ , AR $(2)$ coefficients of $g_{t}$ , and autocorrelation of $g_{t}^{2}$ .212121A quasi-difference $z_{m,t}-0.95z_{m,t-1}$ is applied beforehand because $z_{m,t}$ is very persistent making $\hat{V}_{n}$ nearly singular, the quasi-differencing solves this issue and makes the estimation below more stable. These just-identified moments match quantities of interest that are commonly reported in calibrations or post-estimation, see e.g. Beeler and Campbell (2012). The estimation is conducted using U.S. data shared by Grammig and Küchlin (2018) for $(g_{t},g_{d,t},z_{m,t},r_{m,t},r_{a,t})$ over 1947Q2-2014Q4, totalling in $n=271$ observations. The simulated moments are computed over $S=2$ samples. The bounds for the optimization space $\Theta$ are $\rho\in[0.9,0.995],\phi_{e}\in[0,0.1],\sigma\in[10^{-4},0.1],\nu\in[0,0.995],10^{5}\times\sigma_{w}\in[0,2],\mu\in[-0.035,0.035],\mu_{d}\in[-0.035,0.035],\phi\in[0,10],\phi_{d}\in[0,10],\delta\in[0.93,1.2],\gamma\in[0.05,25],\psi^{-1}\in[0.01,3]$ . Computations are conducted in R and C++ using Rcpp.

Table 3 compares the spectrum of the normalized Jacobian and quasi-Jacobian.222222The estimate $\hat{\theta}_{n}$ used for the Jacobian is computed by using the calibration in Bansal and Yaron (2004) as starting value, and alternating between the Nelder-Mead and bobyqa optimizers until convergence. Note that different seeds for the simulated samples yield very different estimates but similar fitted moments. Also, the sample gradient is not available analytically; it is computed by finite differences which here is quite sensitive to the choice of step size. Using the threshold $\underline{\lambda}_{n}=\sqrt{2\log(n_{S})/n}=0.21$ implies that $B_{n,\infty}$ detects $5$ directions of identification failure, with an additional singular value just above the threshold. In comparison, the gradient is small in $3$ directions. After projecting out $\theta_{1}=(\gamma,\psi^{-1})$ , there are $7$ singular values above the threshold, indicating $7$ (semi)-strongly identified parameters. Hence, inference for $(\gamma,\psi^{-1})$ relies on a $\chi^{2}_{5}(0.95)=11.1$ critical value. In comparison, full projection relies on $\chi^{2}_{12}(0.95)=21$ , and standard inference $\chi^{2}_{2}(0.95)=6$ .

Figure 4 reports 5000 draws of $\theta_{1}$ such that $\text{AR}_{n}(\theta_{1})\leq\chi^{2}_{5}(0.95)$ using the Population Monte Carlo algorithm in Appendix F, plus their convex hull in blue. Values for $\gamma$ are contained in $[5.41,25]$ and $\psi^{-1}\in[0.01,0.90]$ . This excludes several regions of interest. First, we can reject $H_{0}:\psi\leq 1$ at the 95% confidence level, i.e. the IES is strictly greater than unity. Second, we can reject $H_{0}:\gamma=\psi^{-1}$ and conclude that the utility function is not CRRA. Finally, the confidence set favours $H_{1}:\gamma>\psi^{-1}$ over $H_{0}:\gamma\leq\psi^{-1}$ . Under $H_{1}$ , households prefer an early resolution of uncertainty; their preference for consumption smoothing is less than their relative risk aversion. Although not reported here, note that full projection inference cannot reject some of these null hypotheses. As a robustness check with respect to tuning parameters, Appendix H.3 finds the same results using $\chi^{2}_{6}$ critical values (Figure H14) and using a larger value for $\kappa_{n}$ (Table H6).

7 Conclusion

This paper introduces a quasi-Jacobian matrix which is asymptotically equivalent to the usual Jacobian matrix under strong and semi-strong identification but is asymptotically singular when global identification fails. This can be useful because the Jacobian is not always informative about global identification failures. While the inference procedure relies on the AR statistic, extending the results to the robust score test is straightforward, as discussed earlier. For overidentified models, it could be interesting to extend the theory to more powerful test statistics such as the CQLR/AR test in Andrews (2017). Another concern could be that a given choice of moments does not identify the parameters but another set of moments might. This is a moment selection problem. In that case, it could be interesting to extend the quasi-Jacobian to a continuum of moment conditions which can be used for conditional GMM estimation (Carrasco and Florens, 2000); allowing the use of all available information rather than selecting finite dimensional moments.

Appendix A Preliminary Results

A.1 Preliminary results for Section 2

Lemma A1 (Strong and Semi-Strong Sequences: Consistency).

Let $(\theta_{n},\gamma_{n})\to(\theta_{0},\gamma_{0})$ . If $\gamma_{n}\in\Gamma_{0}(\infty)$ or $\gamma_{0}\in\Gamma_{1}$ and Assumptions 1, 3 hold, then $\|\hat{\theta}_{n}-\theta_{n}\|=o_{p}(1)$ .

Lemma A2 (Strong and Semi-Strong Sequences: Asymptotic Normality).

Let $(\theta_{n},\gamma_{n})\to(\theta_{0},\gamma_{0})$ . If $\gamma_{n}\in\Gamma_{0}(\infty)$ or $\gamma_{0}\in\Gamma_{1}$ and Assumptions 1, 3, 4 hold, then

[TABLE]

where $H_{n}=\left(\partial_{\theta}g(\theta_{n},\gamma_{n})^{\prime}\partial_{\theta}g(\theta_{n},\gamma_{n})\right)^{-1/2}$ , $\Sigma_{0}=(R_{0}^{\prime}W_{0}R_{0})^{-1}R_{0}^{\prime}W_{0}R_{0}(R_{0}^{\prime}W_{0}R_{0})^{-1}$ , $W_{0}=W(\theta_{0})$ .

Appendix B Proofs for the main results

B.1 Proofs for Section 2

Proof of Theorem 1:

For simplicity, the derivations for this Theorem rely on $K(x)=\mathbbm{1}_{x\in(-1,1)}$ , see derivations for Section 3 for derivations with other kernels. For $\kappa>0$ , let $(A_{\kappa,\infty},B_{\kappa,\infty})=\text{argmin}_{A,B}(\sup_{\|g(\theta)\|_{W}\leq\kappa}\|g(\theta)-A-B\theta\|)$ . By construction, $B_{\infty}=\lim_{\kappa\to 0}B_{\kappa,\infty}$ . To simplicify notation, denote $g(\theta)=g(\theta,\gamma_{0})$ and $\partial_{\theta}g(\theta)=\partial_{\theta}g(\theta,\gamma_{0})$ . There are three cases to consider:

Case 1) Take $\{\theta^{1}_{0},\theta^{2}_{0}\}\subseteq\Theta_{0}$ non-singleton with $\theta^{1}_{0}\neq\theta_{0}^{2}$ . Take $\kappa>0$ , then $0=\|g(\theta^{1}_{0})\|_{W}=\|g(\theta_{0}^{2})\|_{W}\leq\kappa$ , by construction. Also by construction, $\sup_{\|g(\theta)\|_{W}\leq\kappa}\|g(\theta)-A_{\kappa,\infty}-B_{\kappa,\infty}\theta\|\leq\sup_{\|g(\theta)\|_{W}\leq\kappa}\|g(\theta)\|\leq\underline{\lambda}^{-1/2}_{W}\kappa$ . As a result, $\|g(\theta)-A_{\kappa,\infty}-B_{\kappa,\infty}\theta\|\leq\underline{\lambda}^{-1/2}_{W}\kappa$ for $\theta\in\{\theta^{1}_{0},\theta_{0}^{2}\}$ and the triangular inequality implies $\|B_{\kappa,\infty}(\theta^{1}_{0}-\theta_{0}^{2})\|\leq 2\underline{\lambda}^{-1/2}_{W}\kappa.$ Take the limit as $\kappa\to 0$ to find $B_{\infty}(\theta^{1}_{0}-\theta_{0}^{2})=0$ where $\theta^{1}_{0}-\theta_{0}^{2}\neq 0$ . Hence, $B_{\infty}$ is singular.

Case 2) $\Theta_{0}=\{\theta_{0}\}$ is singleton and $\partial_{\theta}g(\theta_{0})$ is singular. Take any vector $v\in\text{span}(\partial_{\theta}g(\theta_{0}))^{\perp}$ with $\|v\|=1$ . For the following, consider $\theta=\theta_{0}+\kappa^{1/\alpha}rv$ for some $r\in\mathbb{R}$ such that $\kappa^{1/\alpha}|r|\leq\overline{\varepsilon}$ . Then $\|g(\theta)\|_{W}=\|g(\theta)-g(\theta_{0})-\kappa^{1/\alpha}r\partial_{\theta}g(\theta_{0})v\|\leq\overline{\lambda}_{W}\overline{C}\kappa|r|^{\alpha}\leq\kappa$ for all $|r|\leq(\overline{\lambda}_{W}\overline{C})^{-1/\alpha}$ . As in Case 1), $\|g(\theta)-A_{\kappa,\infty}-B_{\kappa,\infty}\theta\|\leq\underline{\lambda}^{-1/2}_{W}\kappa$ for all $\theta=\theta_{0}+\kappa^{1/\alpha}rv$ with $|r|\leq(\overline{\lambda}_{W}\overline{C})^{-1/\alpha}$ . Then $\|B_{\kappa,\infty}(\theta-\theta_{0})\|\leq\|g(\theta)-A_{\kappa,\infty}-B_{\kappa,\infty}\theta\|+\|g(\theta_{0})-A_{\kappa,\infty}-B_{\kappa,\infty}\theta_{0}\|\leq 2\underline{\lambda}^{-1/2}_{W}\kappa$ . Take $r\neq 0$ , fixed, then $\theta-\theta_{0}=r\kappa^{1/\alpha}v$ and $\|B_{\kappa,\infty}v\|\leq 2r^{-1}\underline{\lambda}^{-1/2}_{W}\kappa^{1-1/\alpha}\to 0$ as $\kappa\to 0$ since $\alpha>1$ . This implies $B_{\infty}v=0$ ; $B_{\infty}$ is singular.

Case 3) $\Theta_{0}=\{\theta_{0}\}$ is singleton and $\partial_{\theta}g(\theta_{0})$ has full rank. Continuity and global identification imply $\|g(\theta)\|_{W}\geq\overline{\kappa}$ for some $\overline{\kappa}>0$ and all $\|\theta-\theta_{0}\|\geq\overline{\varepsilon}$ . Consider $0<\kappa<\overline{\kappa}$ so that $\|g(\theta)\|_{W}\leq\kappa$ implies $\|\theta-\theta_{0}\|\leq\overline{\varepsilon}$ . Let $0<\underline{\sigma}=\sigma_{\min}(\partial_{\theta}g(\theta_{0}))\leq\sigma_{\max}(\partial_{\theta}g(\theta_{0}))=\overline{\sigma}<\infty$ . For these values of $\theta$ , $\underline{\lambda}_{W}^{1/2}\underline{\sigma}\|\theta-\theta_{0}\|-\overline{\lambda}_{W}^{1/2}\overline{C}\|\theta-\theta_{0}\|^{\alpha}\leq\|g(\theta)\|_{W}\leq\overline{\lambda}_{W}^{1/2}\overline{\sigma}\|\theta-\theta_{0}\|+\overline{\lambda}_{W}^{1/2}\overline{C}\|\theta-\theta_{0}\|^{\alpha}$ . We can further assume, without loss of generality, that $\overline{\kappa}$ and thus $\overline{\varepsilon}$ are sufficiently small that $1/2\underline{\lambda}_{W}^{1/2}\underline{\sigma}\|\theta-\theta_{0}\|\leq\underline{\lambda}_{W}^{1/2}\underline{\sigma}\|\theta-\theta_{0}\|-\overline{\lambda}_{W}^{1/2}\overline{C}\|\theta-\theta_{0}\|^{\alpha}$ and $\overline{\lambda}_{W}^{1/2}\overline{\sigma}\|\theta-\theta_{0}\|+\overline{\lambda}_{W}^{1/2}\overline{C}\|\theta-\theta_{0}\|^{\alpha}\leq 2\overline{\lambda}_{W}^{1/2}\overline{\sigma}\|\theta-\theta_{0}\|$ . Re-write $\theta=\theta_{0}+\kappa v$ for some vector $v$ , then $\|v\|\leq(2\overline{\lambda}_{W}^{1/2}\overline{\sigma})^{-1}$ implies $\|g(\theta)\|_{W}\leq\kappa$ . Likewise, $\|v\|>(1/2\underline{\lambda}_{W}^{1/2}\underline{\sigma})^{-1}$ implies $\|g(\theta)\|_{W}>\kappa$ . Pick $(A,B)=(-\partial_{\theta}g(\theta_{0})\theta_{0},\partial_{\theta}g(\theta_{0}))$ , then by construction: $\sup_{\|g(\theta)\|_{W}\leq\kappa}\|g(\theta)-A_{\kappa,\infty}-B_{\kappa,\infty}\theta\|\leq\sup_{\|g(\theta)\|_{W}\leq\kappa}\|g(\theta)-g(\theta_{0})-\partial_{\theta}g(\theta_{0})(\theta-\theta_{0})\|\leq\sup_{\|g(\theta)\|_{W}\leq\kappa}\overline{C}\|\theta-\theta_{0}\|^{\alpha}\leq\overline{C}(1/2\underline{\lambda}_{W}^{1/2}\underline{\sigma})^{-\alpha}\kappa^{\alpha}$ . Pick any $\|v\|\leq(2\overline{\lambda}_{W}^{1/2}\overline{\sigma})^{-1}$ , then $\|g(\theta)-A_{\kappa,\infty}-B_{\kappa,\infty}\theta\|\leq\overline{C}(1/2\underline{\lambda}_{W}^{1/2}\underline{\sigma})^{-\alpha}\kappa^{\alpha}$ and $\|g(\theta_{0})-A_{\kappa,\infty}-B_{\kappa,\infty}\theta_{0}\|\leq\overline{C}(1/2\underline{\lambda}_{W}^{1/2}\underline{\sigma})^{-\alpha}\kappa^{\alpha}$ . Then $\|g(\theta)-g(\theta_{0})-B_{\kappa,\infty}[\theta-\theta_{0}]\|\leq 2\overline{C}(1/2\underline{\lambda}_{W}^{1/2}\underline{\sigma})^{-\alpha}\kappa^{\alpha}$ . This implies $\|[\partial_{\theta}g(\theta_{0})-B_{\kappa,\infty}]\kappa v\|\leq[2\overline{C}(1/2\underline{\lambda}_{W}^{1/2}\underline{\sigma})^{-\alpha}+\overline{C}\|v\|^{\alpha}]\kappa^{\alpha}$ since $\theta-\theta_{0}=\kappa v$ . Then $\|[\partial_{\theta}g(\theta_{0})-B_{\kappa,\infty}]v\|\leq[2\overline{C}(1/2\underline{\lambda}_{W}^{1/2}\underline{\sigma})^{-\alpha}+\overline{C}\|v\|^{\alpha}]\kappa^{\alpha-1}\to 0$ using $\alpha>1$ . Since this holds for any vector $v\neq 0$ with $\|v\|\leq(2\overline{\lambda}_{W}^{1/2}\overline{\sigma})^{-1}$ , this implies that $B_{\infty}=\partial_{\theta}g(\theta_{0})$ and, in addition, $B_{\kappa,\infty}=\partial_{\theta}g(\theta_{0})+O(\kappa^{\alpha-1})$ .

For the statements in the Theorem: Case 1) implies results (3), Case 2) implies result (4) and, case 3) implies result (2). For results (1), $\Theta_{0}$ non-singleton or, $\Theta_{0}$ singleton and $\partial_{\theta}g(\theta_{0})$ singular implies $B_{\infty}$ singular. $\Theta_{0}$ singleton and $\partial_{\theta}g(\theta_{0})$ full rank imply $B_{\infty}=\partial_{\theta}g(\theta_{0})$ full rank. ∎

Proof of Proposition 1:

Note that Assumptions 1, 3 and 4 hold for the moment function $\theta_{2}\to\bar{g}_{n}(\theta_{1n},\theta_{2})$ . Applying Lemma A2, we have:

[TABLE]

where $H_{2n}=[\partial_{\theta_{2}}g(\theta_{n},\gamma_{n})^{\prime}\partial_{\theta_{2}}g(\theta_{n},\gamma_{n})]^{-1/2}$ . By construction of the test statistic, we have: $\text{AR}_{n}(\theta_{1n})=\|\bar{g}_{n}(\theta_{1n}\hat{\theta}_{2n})\|_{V_{n}^{-1}}^{2}$ . We also have:

[TABLE]

The leading term converges to $(I-R_{20}(R_{20}^{\prime}V_{0}^{-1}R_{20})^{-1}R_{20}^{\prime}V_{0}^{-1})$ , where $\partial_{\theta_{2}}g(\theta_{n},\gamma_{n})H_{2n}\to R_{20}$ which has rank $d_{\theta_{2}}$ . This limit is an orthogonal projection matrix with rank $d_{g}-d_{\theta_{2}}$ . Hence, by the continuous mapping theorem: $\|\bar{g}_{n}(\theta_{1n}\hat{\theta}_{2n})\|_{V_{n}^{-1}}^{2}\overset{d}{\to}\chi^{2}_{d_{g}-d_{\theta_{2}}}$ . ∎

B.2 Proofs for Section 3

B.2.1 Strong and semi-strong sequences.

Proof of Theorem 2 for $B_{n,\infty}$ :

Pick a $\varepsilon>0$ such that Assumption 4 iv. holds, then using $\kappa_{n}^{-1}\delta(\gamma_{n})\to\infty$ :

[TABLE]

which implies $\sup_{\|\theta-\theta_{n}\|\geq\varepsilon}\hat{K}_{n}(\theta)=0$ wpa 1. Take $\|\theta-\theta_{n}\|\leq\varepsilon$ , using Assumption 4 iv. and using the change of variable $\theta=\theta_{n}+\kappa_{n}H_{n}h$ with $\|\kappa_{n}H_{n}h\|\leq\varepsilon$ we have:

[TABLE]

The term on the right-hand-side is a $o_{p}(1)$ by assumption. The squared norm $\|\partial_{\theta}g(\theta_{n},\gamma_{n})H_{n}h\|^{2}=\text{trace}\left(h^{\prime}H_{n}\partial_{\theta}g(\theta_{n},\gamma_{n})^{\prime}\partial_{\theta}g(\theta_{n},\gamma_{n})H_{n}h\right)=\|h\|^{2}$ by construction of $H_{n}$ . Hence, $\|\bar{g}_{n}(\theta)/\kappa_{n}\|_{W_{n}}>1$ wpa 1 uniformly in $\|h\|\geq 2$ so that $\hat{K}_{n}(\theta)=0$ wpa 1.

For any $\theta$ such that $\|h\|\leq 2$ , $\|\partial_{\theta}g(\theta_{n},\gamma_{n})(\theta-\theta_{n})\|=\kappa_{n}\|\partial_{\theta}g(\theta_{n},\gamma_{n})H_{n}h\|\leq 2\kappa_{n}$ so that Assumption 4 iv. applies with $r_{n}=2\kappa_{n}$ . For any two candidates $A,B$ we have wpa 1:

[TABLE]

for $\inf_{x\in[0,1/2]}K(x)=\underline{K}>0$ by Assumption 1 ii., using $\hat{K}_{n}(\theta)\geq\inf_{x\in[0,1/2]}K(x)$ wpa 1 for $\|h\|\leq 1/4$ by similar derivations as above.

Pick $B_{n}=\partial_{\theta}g(\theta_{n},\gamma_{n})$ and $A_{n}=\bar{g}_{n}(\theta_{n})-B_{n}\theta_{n}$ then $\sup_{\theta\in\Theta}\|\bar{g}_{n}(\theta)-A_{n}-B_{n}\theta\|\hat{K}_{n}(\theta)=o_{p}(n^{-1/2})$ . By contradiction, suppose $\sqrt{n}\|A_{n,\infty}+B_{n,\infty}\|\not\to 0$ and/or $\sqrt{n}\kappa_{n}\|[B_{n}-B_{n,\infty}]H_{n}\|\not\to 0$ , in probability. Then for any $\theta=\theta_{n}+\kappa_{n}H_{n}h$ with $\|h\|<1/4$ , we have wpa 1:

[TABLE]

in probability for at least one $\|h\|<1/4$ while the same quantity converges in probability to zero when evaluated at $A_{n},B_{n}$ . For instance if $\sqrt{n}\|A_{n,\infty}+B_{n,\infty}\|\not\to 0$ , pick $h=0$ . This contradicts the approximate minimizer property of $A_{n,\infty},B_{n,\infty}$ . We conclude that $[\partial_{\theta}g(\theta_{n},\gamma_{n})-B_{n,\infty}]H_{n}=o_{p}(n^{-1/2}\kappa_{n}^{-1})$ and $A_{n,\infty}+B_{n,\infty}\theta_{n}=\bar{g}_{n}(\theta_{n})+o_{p}(n^{-1/2})$ .

∎

B.2.2 Weak sequences.

Definition B2.

Define the span of the identification failure in the full space $\mathcal{B}$ and the constrained space $\mathcal{B}_{n}$ respectively as:

[TABLE]

Proof of Theorem 3:

Let $\tilde{B}_{n,\infty}=B_{n,\infty}M^{-1}$ . After applying the reparameterization, we have:

[TABLE]

For any $\beta$ such that $\hat{K}_{n}(\beta)>0$ , we have: $\|\overline{g}_{n}(\beta)\|_{W_{n}}\leq\kappa_{n}$ and then $\|\overline{g}_{n}(\beta)\|\leq\underline{\lambda}_{W}^{-1}\kappa_{n}.$ By continuity of $K$ on $[0,1]$ we have $\hat{K}_{n}(\beta)\leq\overline{K}$ for some constant $\overline{K}>0$ so that:

[TABLE]

for any $\beta\in\mathcal{B}$ . Then, using the reverse triangular inequality:

[TABLE]

By definition of $V_{\star}$ , we can find pairs $(\beta_{n}^{j},\tilde{\beta}_{n}^{j})$ $j=1,\dots,d_{\beta_{2}}$ with $\beta_{n}^{j}=(\beta_{1n},\beta_{2}^{j})$ , $\tilde{\beta}_{n}^{j}=(\beta_{1n},\tilde{\beta}_{2}^{j})$ for two $(\beta_{2}^{j},\tilde{\beta}_{2}^{j})\in\mathcal{B}_{2}^{0}\times\mathcal{B}_{2}^{0}$ such that the vectors $v^{j}=\beta_{n}^{j}-\tilde{\beta}_{n}^{j}$ , $j=1,\dots,d_{\beta_{2}}$ are linearly independent. By assumption, we have:

[TABLE]

which is a $O_{p}(n^{-1/2})=o_{p}(\kappa_{n})$ . This implies that $\|\overline{g}_{n}(\beta)/\kappa_{n}\|_{W_{n}}\leq 1/2$ with wpa 1 uniformly in $\beta=(\beta_{1n},\beta_{2}),\beta_{2}\in\mathcal{B}_{2}^{0}$ so that $\hat{K}_{n}(\beta)\geq\inf_{x\in[0,1/2]}K(x)=\underline{K}$ with wpa 1 uniformly on the same set. In turn, we have wpa 1 for all $j$ :

[TABLE]

Using the triangular inequality, we have wpa 1 and uniformly in $j$ :

[TABLE]

Let $V=M^{-1}(v^{1},\dots,v^{d_{\beta_{2}}})$ . By linear independence, $P_{V}=V(V^{\prime}V)^{-1}V^{\prime}$ is well defined and:

[TABLE]

wpa 1. For any $v\in V_{\star}$ , $P_{V}v=v$ hence $\|B_{n,\infty}v\|\leq\|B_{n,\infty}P_{v}\|\,\|v\|\leq O_{p}(\kappa_{n})$ wpa 1. To find the other two results note that $B_{n,\infty}^{\prime}B_{n,\infty}$ is Hermitian, and $P_{V}$ is an orthogonal projection matrix by construction. Hence $P_{V}$ admits an eigen decomposition of the form $O\text{bckdiag}(I_{d_{\beta_{2}}},0_{d_{\theta}-d_{\beta_{2}}})O^{*}$ with $OO^{*}=I_{d}$ ; $O^{*}$ is the conjugate transpose of $O$ and bckdiag builds a block-diagonal matrix. Using this decomposition we have:

[TABLE]

where $O_{d_{\beta_{2}}},O_{d_{\beta_{2}}}^{*}$ are the first $d_{\beta_{2}}$ columns/rows of $O$ and $O^{*}$ , respectively, which satisfy $O_{d_{\beta_{2}}}^{*}O_{d_{\beta_{2}}}=I_{d_{\beta_{2}}}$ . As an implication of the minimax principle (Bhatia, 1997, Problem III.6.11, p77) and the equality above, we have the following inequality:

[TABLE]

wpa 1. This concludes the proof. ∎

Proof of Proposition 3:

Following the steps in the proof of Theorem 3, we can construct a basis for $V^{0}_{\star}\subseteq V_{\star}$ using $v^{j}=(0,\theta_{22}^{j}-\tilde{\theta}_{22}^{j})$ with pairs $(\theta_{22}^{j},\tilde{\theta}_{22}^{j})\in\mathcal{B}_{22}^{0}\times\mathcal{B}_{22}^{0}$ . Since $\theta_{1}=\theta_{1n}$ is fixed, we have $P_{\theta_{1}}P_{V_{\star}}=0$ and $P_{\theta_{1}}^{\perp}P_{V_{\star}}=P_{V_{\star}}$ for the basis $V_{\star}=M^{-1}(v^{1},\dots,v^{d_{\beta_{22}}})$ . Hence, $\|B_{n,\infty}P_{\theta_{1}}^{\perp}P_{V_{\star}}\|\leq O_{p}(\kappa_{n})$ and $\|B_{n,\infty}P_{\theta_{1}}^{\perp}P_{\theta_{1}}\|=0$ . By the minimax principle, these imply the desired inequality : $\sum_{j=1}^{d_{\theta_{1}}+d_{\beta_{22}}}\lambda_{j}(P_{\theta_{1}}^{\perp}B_{n,\infty}^{\prime}B_{n,\infty}P_{\theta_{1}}^{\perp})\leq O_{p}(\kappa_{n}^{2}).$ ∎

B.3 Proofs for Section 4

Proof of Theorem 4:

First, we show that normalizations do not affect the results of Proposition 3 for weak sequences. This amounts to showing that $B_{n,\infty}P_{\theta_{1}}^{\perp}\Sigma_{n}^{-1/2}$ has $d_{\theta_{1}}+d_{\beta_{22}}$ singular values that are $O_{p}(\kappa_{n})$ . From Proposition 3, there exists a linearly independent family $V_{\star}=M^{-1}(v^{1},\dots,v^{d_{\beta_{22}}})$ such that $MV_{\star}\in V_{\star}^{0}\subseteq V^{0}$ (from Definition B2) and $\|B_{n,\infty}P_{V_{\star}}\|=O_{p}(\kappa_{n})$ where $P_{V_{\star}}=V_{\star}(V_{\star}^{\prime}V_{\star})^{-1}V_{\star}^{\prime}$ . Similarly $\|\Sigma_{n}^{-1/2}M^{-1}P_{V_{\star}}\|\leq\tilde{C}$ from Lemma D4. Also because $I_{d}=P_{\theta_{1}}+P_{\theta_{1}}^{\perp}$ where $P_{\theta_{1}}M^{-1}P_{V_{\star}}=0$ by design, we have: $\|\Sigma_{n}^{-1/2}P_{\theta_{1}}^{\perp}M^{-1}P_{V_{\star}}\|\leq\tilde{C}$ . Then using the minimax characterization of singular values (Bhatia, 1997, Problem III.6.5), we have:

[TABLE]

Then $\sigma_{j}(\Sigma_{n}^{-1/2}P_{\theta_{1}}^{\perp}M^{-1}P_{V_{\star}})\leq\sigma_{\max}(\Sigma_{n}^{-1/2}P_{\theta_{1}}^{\perp}M^{-1}P_{V_{\star}})\leq\tilde{C}$ , and $\sigma_{i}(B_{n,\infty}P_{\theta_{1}}^{\perp})\leq O_{p}(\kappa_{n})$ for $1\leq i\leq d_{\theta_{1}}+d_{\beta_{22}}$ from Proposition 3. Since $P_{V_{\star}}$ has rank $d_{\beta_{22}}$ and is orthogonal to the rank $d_{\theta_{1}}$ matrix $P_{\theta_{1}}$ for which $B_{n,\infty}P_{\theta_{1}}^{\perp}\Sigma_{n}^{-1/2}P_{\theta_{1}}^{\perp}P_{\theta_{1}}=0$ , we have that $\sigma_{\ell}(B_{n,\infty}P_{\theta_{1}}^{\perp}\Sigma_{n}^{-1/2}P_{\theta_{1}}^{\perp}M^{-1})\leq O_{p}(\kappa_{n})$ for $1\leq\ell\leq d_{\theta_{1}}+d_{\beta_{22}}$ , in increasing order.232323Pick $i=\ell$ and $j=1$ and notice that the matrix is bounded above on a subspace of dimension $d_{\theta_{1}}+d_{\beta_{22}}$ . Also $\sigma_{\min}(M^{-1})$ is strictly positive and bounded below, because $M$ is invertible, so that: $\sigma_{\ell}(B_{n,\infty}P_{\theta_{1}}^{\perp}\Sigma_{n}^{-1/2}P_{\theta_{1}}^{\perp})\leq O_{p}(\kappa_{n})$ as well. Likewise, Assumption 6 implies that $\sigma_{\min}(\overline{V}_{n}^{-1/2})\geq\overline{\lambda}_{V}^{-1/2}+o(1)$ which then also implies that $\sigma_{\ell}(\overline{V}_{n}^{-1/2}B_{n,\infty}P_{\theta_{1}}^{\perp}\Sigma_{n}^{-1/2}P_{\theta_{1}}^{\perp})\leq O_{p}(\kappa_{n})$ for $1\leq\ell\leq d_{\theta_{1}}+d_{\beta_{22}}$ as desired.

Now we are interested in establishing the asymptotic size of the test. Let $(\theta_{n},\gamma_{n})$ be a sequence in $\overline{\Theta}\times\Gamma$ such that

[TABLE]

as noted in Andrews et al. (2020, p501), such a sequence always exists. There always exists at least one subsequence of $(\theta_{n},\gamma_{n})$ which achieves the $\limsup$ above, i.e. for some $\varphi_{1}:\mathbb{N}\to\mathbb{N}$ strictly increasing: $\lim_{n\to\infty}\mathbb{P}_{\gamma_{\varphi_{1}(n)}}\left(\text{AR}_{\varphi_{1}(n)}(\theta_{1\varphi_{1}(n)})>\chi^{2}_{d_{g}-\hat{d}_{\varphi_{1}(n)}}(1-\alpha)\right)=\limsup_{n\to\infty}\mathbb{P}_{\gamma_{n}}\left(\text{AR}_{n}(\theta_{1n})>\chi^{2}_{d_{g}-\hat{d}_{n}}(1-\alpha)\right)$ . Assumption 1 i. implies that $\overline{\Theta}\times\Gamma$ is sequentially compact so that this subsequence admits a convergence sub-subsequence in $\overline{\Theta}\times\Gamma$ , i.e. for some $\varphi_{2}:\mathbb{N}\to\mathbb{N}$ strictly increasing: $(\theta_{\varphi_{2}\circ\varphi_{1}(n)},\gamma_{\varphi_{2}\circ\varphi_{1}(n)})\to(\theta_{0},\gamma_{0})\in\overline{\Theta}\times\Gamma$ and $\lim_{n\to\infty}\mathbb{P}_{\gamma_{\varphi_{2}\circ\varphi_{1}(n)}}\left(\text{AR}_{\varphi_{2}\circ\varphi_{1}(n)}(\theta_{1\varphi_{2}\circ\varphi_{1}(n)})>\chi^{2}_{d_{g}-\hat{d}_{\varphi_{2}\circ\varphi_{1}(n)}}(1-\alpha)\right)$ has the same limit.

Now, if we can find a converging sequence $(\theta_{m},\gamma_{m})$ , $m\geq 1$ , in one of $\Gamma_{0}(b)$ , for some $b\geq 0$ , $\Gamma_{0}(\infty)$ , or converging in $\Gamma_{1}$ such that $(\theta_{m},\gamma_{m})=(\theta_{\varphi_{2}\circ\varphi_{1}(n)},\gamma_{\varphi_{2}\circ\varphi_{1}(n)})$ when $m=\varphi_{2}\circ\varphi_{1}(n)$ then the limiting rejection probability for the subsequence can be derived from the limiting rejection probability of the full sequence $(\theta_{m},\gamma_{m})$ . Suppose $(\theta_{\varphi_{2}\circ\varphi_{1}(n)},\gamma_{\varphi_{2}\circ\varphi_{1}(n)})\to(\theta_{0},\gamma_{0})\in\overline{\Theta}\times\Gamma_{1}$ . Pick $(\theta_{m},\gamma_{m})=(\theta_{\varphi_{2}\circ\varphi_{1}(n)},\gamma_{\varphi_{2}\circ\varphi_{1}(n)})$ when $m=\varphi_{2}\circ\varphi_{1}(n)$ and $(\theta_{m},\gamma_{m})=(\theta_{0},\gamma_{0})$ otherwise. $(\theta_{m},\gamma_{m})$ is a converging sequence with $\gamma_{0}\in\Gamma_{1}$ . If $(\theta_{0},\gamma_{0})\in\overline{\Theta}\times\Gamma_{0}$ , then $\sqrt{\varphi_{2}\circ\varphi_{1}(n)}\delta(\gamma_{\varphi_{2}\circ\varphi_{1}(n)})$ is a sequence taking values in $[0,+\infty)\cup\{+\infty\}$ , the positive part of the extended real line which is a compact space. This implies that $\sqrt{\varphi_{2}\circ\varphi_{1}(n)}\delta(\gamma_{\varphi_{2}\circ\varphi_{1}(n)})$ admits at least one subsequence $\sqrt{\varphi_{3}\circ\varphi_{2}\circ\varphi_{1}(n)}\delta(\gamma_{\varphi_{3}\circ\varphi_{2}\circ\varphi_{1}(n)})$ which converges in $[0,+\infty)\cup\{+\infty\}$ . Let $\varphi=\varphi_{3}\circ\varphi_{2}\circ\varphi_{1}$ index the resulting subsequence of $(\theta_{n},\gamma_{n})$ . There are now two possibilities: either $\sqrt{\varphi(n)}\delta(\gamma_{\varphi(n)})\to\infty$ or $\sqrt{\varphi(n)}\delta(\gamma_{\varphi(n)})\to b\in[0,\infty)$ .

Suppose $\sqrt{\varphi(n)}\delta(\gamma_{\varphi(n)})\to\infty$ . Pick $(\theta_{m},\gamma_{m})=(\theta_{\varphi(n)},\gamma_{\varphi(n)})$ when $m=\varphi(n)$ . For $\varphi(n)<m<\varphi(n+1)$ , pick $(\theta_{m},\gamma_{m})=(\theta_{\varphi(n)},\gamma_{\varphi(n)})$ as well. By construction $\sqrt{m}\delta(\gamma_{m})=\sqrt{m}\delta(\gamma_{\varphi(n)})>\sqrt{\varphi(n)}\delta(\gamma_{\varphi(n)})\to\infty$ so that $\gamma_{m}\in\Gamma_{0}(\infty)$ . By construction $\hat{d}_{n}\in\{0,\dots,d_{\theta_{2}}\}$ , hence Proposition 1 implies that

[TABLE]

for any $\alpha\in(0,1)$ . This in turn implies that:

[TABLE]

Suppose $\sqrt{\varphi(n)}\delta(\gamma_{\varphi(n)})\to b\in[0,\infty)$ . Pick $(\theta_{m},\gamma_{m})=(\theta_{\varphi(n)},\gamma_{\varphi(n)})$ for any $m=\varphi(n)$ . For $\varphi(n)<m<\varphi(n+1)$ , define $b_{m}=\min[\sqrt{\varphi(n)}\delta(\gamma_{\varphi(n)}),\sqrt{\varphi(n+1)}\delta(\gamma_{\varphi(n+1)})]$ ; note that $\lim_{m\to\infty}b_{m}=b$ . Suppose, without loss of generality, that $b_{m}=\sqrt{\varphi(n)}\delta(\gamma_{\varphi(n)})$ . Take $\varepsilon=d(\gamma_{0},\gamma_{\varphi(n)})$ . If $\varepsilon=0$ , then $\gamma_{\varphi(n)}=\gamma_{0}\in\Gamma_{0}$ and $b_{m}=0$ . If $b_{m}=0$ , pick $(\theta_{m},\gamma_{m})=(\theta_{\varphi(n)},\gamma_{\varphi(n)})$ . If $\varepsilon>0$ and $b_{m}>0$ , Assumption 1 i. implies that the closure of $B_{\varepsilon}(\gamma_{0})\cap\Gamma$ is connected. Hence, there exists a continuous map: $(\theta,\gamma):[0,1]\to\overline{\Theta}\times\Gamma$ such that $(\theta(0),\gamma(0))=(\theta_{0},\gamma_{0})$ and $(\theta(1),\gamma(1))=(\theta_{\varphi(n)},\gamma_{\varphi(n)})$ and $\|\theta(u)-\theta_{0}\|+d(\gamma(u)-\gamma(0))\leq\varepsilon$ for any $u\in[0,1]$ . By continuity of $\delta:\Gamma\to\mathbb{R}_{+}$ , the image of $u\to\delta\circ\gamma(u)$ is a closed interval which contains $0=\delta(\gamma_{0})$ and $\delta(\gamma_{\varphi(n)})>0$ . For each $m$ , the values [math] and $b_{m}$ are both contained in the image $\sqrt{m}[\delta\circ\gamma([0,1])]$ , so that there exists a $u_{m}$ such that $\sqrt{m}\delta\circ\gamma(u_{m})=b_{m}$ . Pick $(\theta_{m},\gamma_{m})=(\theta(u_{m}),\gamma(u_{m}))$ . If $b_{m}$ is attained at $\varphi(n+1)$ , repeat the above with $\varphi(n+1)$ instead of $\varphi(n)$ . By construction $\|\theta_{m}-\theta_{0}\|+d(\gamma_{m},\gamma_{0})\leq\max[\|\theta_{\varphi(n)}-\theta_{0}\|+d(\gamma_{\varphi(n)},\gamma_{0}),\|\theta_{\varphi(n+1)}-\theta_{0}\|+d(\gamma_{\varphi(n+1)},\gamma_{0})]\to 0$ and $\lim_{m\to\infty}\sqrt{m}\delta\circ\gamma(u_{m})=\lim_{m\to\infty}b_{m}=b\in[0,\infty)$ . This implies that $\gamma_{m}\in\Gamma_{0}(b)$ with $b\in[0,\infty)$ . As shown above, for this converging sequence we have $\hat{d}_{m}\leq d_{\phi}$ wpa 1. Using Proposition 2:

[TABLE]

Then, we have: $\lim_{n\to\infty}\mathbb{P}_{\gamma_{\varphi(n)}}\left(\text{AR}_{\varphi(n)}(\theta_{1\varphi(n)})>\chi^{2}_{d_{g}-\hat{d}_{\varphi(n)}}(1-\alpha)\right)\leq\alpha$ . Putting everything together, we have: $\limsup_{n\to\infty}\mathbb{P}_{\gamma_{n}}\left(\text{AR}_{n}(\theta_{1n})>\chi^{2}_{d_{g}-\hat{d}_{n}}(1-\alpha)\right)\leq\alpha$ for the original sequence $(\theta_{n},\gamma_{n})$ .

For the second part of the Theorem, note that $\kappa^{2}_{n}=o(\underline{\lambda}_{n}^{2})=o(\lambda_{\min}(\partial_{\theta}g(\theta_{n},\gamma_{n})^{\prime}\partial_{\theta}g(\theta_{n},\gamma_{n}))$ so Theorem 2 applies. Now, from the proof of Lemma D3: $\kappa_{n}^{-2}H_{n}^{1/2}\Sigma_{n}H_{n}^{1/2}\overset{p}{\to}\tilde{\Sigma},$ which is the arg-minimizer of the limiting sup-norm minimization and is non-singular because of the log-determinant. Hence, $\lambda_{\min}(\Sigma_{n}^{-1/2})\geq\kappa_{n}^{-1}\lambda_{\min}(H_{n}^{-1})\lambda_{\min}(\tilde{\Sigma})+o_{p}(\kappa_{n}^{-1})$ . Now, this implies:

[TABLE]

where the last inequality follows from Assumption 6 and the discussion after Theorem 2. Since $\lambda_{\min}(\Sigma_{n}^{-1/2})$ is bounded below, we have $\lambda_{d_{\theta_{1}}+1}(P_{\theta_{1}}^{\perp}\Sigma_{n}^{-1/2}P_{\theta_{1}}^{\perp}B_{n,\infty}^{\prime}\overline{V}_{n}^{-1}B_{n,\infty}P_{\theta_{1}}^{\perp}\Sigma_{n}^{-1/2}P_{\theta_{1}}^{\perp})>\underline{\lambda}_{n}^{2}$ wpa 1. This implies $\hat{d}_{n}=d_{\theta_{2}}$ wpa 1 and:

[TABLE]

which concludes the proof. ∎

Appendix C Proofs for the preliminary results

C.1 Preliminary results for Section 2

Proof of Lemma A1:

First, using $(a-b)^{2}\geq a^{2}/2-b^{2}$ for any $(a,b)\in\mathbb{R}^{2}$ we have:

[TABLE]

uniformly in $\theta\in\Theta$ . The second inequality is:

[TABLE]

since $g(\theta_{n},\gamma_{n})=0$ . Pick any $\varepsilon>0$ . For any approximate minimizer $\hat{\theta}_{n}$ such that $\|\bar{g}_{n}(\hat{\theta}_{n})\|^{2}_{W_{n}}\leq\inf_{\theta\in\Theta}\|\bar{g}_{n}(\theta)\|^{2}_{W_{n}}+o(n^{-1})$ , using the two inequalities above:

[TABLE]

since $\sqrt{n}\delta(\gamma_{n})\to\infty$ for sequences converging in $\Gamma_{0}(\infty)$ or $\Gamma_{1}$ . ∎

Proof of Lemma A2:

For any approximate minimizer $\hat{\theta}_{n}$ such that $\|\bar{g}_{n}(\hat{\theta}_{n})\|^{2}_{W_{n}}\leq\inf_{\theta\in\Theta}\|\bar{g}_{n}(\theta)\|^{2}_{W_{n}}+o(n^{-1})$ , we have $\|\hat{\theta}_{n}-\theta_{n}\|=o_{p}(1)$ by Lemma A1 and:

[TABLE]

By assumption, $W(\theta_{n})\to W(\theta_{0})$ positive definite so the above implies:

[TABLE]

As in Newey and McFadden (1994) completing the square above implies $[\sqrt{n}\|\partial_{\theta}g(\theta_{n},\gamma_{n})(\hat{\theta}_{n}-\theta_{n})\|+O_{p}(1)]^{2}\leq O(1)$ . Taking the square root on both sides yields:

[TABLE]

Define $\tilde{\theta}_{n}=\theta_{n}-\Big{(}\partial_{\theta}g(\theta_{n},\gamma_{n})^{\prime}W(\theta_{n})\partial_{\theta}g(\theta_{n},\gamma_{n})\Big{)}^{-1}\partial_{\theta}g(\theta_{n},\gamma_{n})^{\prime}W(\theta_{n})\bar{g}_{n}(\theta_{n}).$ By continuity of $\partial_{\theta}g(\theta,\gamma)$ and $W$ , we have: $\sqrt{n}H_{n}^{-1}(\tilde{\theta}_{n}-\theta_{n})=(R_{0}^{\prime}W_{0}R_{0})^{-1}R_{0}^{\prime}\sqrt{n}\bar{g}_{n}(\theta_{n})+o_{p}(1)\overset{d}{\to}\mathcal{N}(0,\Sigma_{0})$ . To conclude the proof we need to prove that $\sqrt{n}H_{n}^{-1}(\tilde{\theta}_{n}-\hat{\theta}_{n})=o_{p}(1)$ . Using similar calculations as above, we have:

[TABLE]

By construction of $\tilde{\theta}_{n}$ , $-\partial_{\theta}g(\theta_{n},\gamma_{n})^{\prime}W(\theta_{n})\bar{g}_{n}(\theta_{n})=\left(\partial_{\theta}g(\theta_{n},\gamma_{n})^{\prime}W(\theta_{n})\partial_{\theta}g(\theta_{n},\gamma_{n})\right)\left(\tilde{\theta}_{n}-\theta_{n}\right).$ This implies the following equalities:

[TABLE]

Since $\hat{\theta}_{n}$ is an approximate minimizer, we have:

[TABLE]

which implies $\sqrt{n}\partial_{\theta}g(\theta_{n},\gamma_{n})(\hat{\theta}_{n}-\tilde{\theta}_{n})=o_{p}(1)$ and concludes the proof. ∎

Appendix D Supplemental Results

The following results concern the matrix $\Sigma_{n}$ used for re-scaling in the procedure. The derivations follow very closely those in Theorems 2 and 3.

Lemma D3.

Suppose $K$ is the uniform kernel and the Assumptions for Theorem 2 hold, then $H_{n}^{1/2}\Sigma_{n}H_{n}^{1/2}\leq O_{p}(\kappa_{n}^{2})$ .

Proof of Lemma D3.

As in the proof of Theorem 2, let $\theta=\theta_{n}+\kappa_{n}H_{n}h$ . Take $\Sigma_{n}=\kappa_{n}^{2}H_{n}^{-1/2}\tilde{\Sigma}_{n}H_{n}^{-1/2}$ for $\tilde{\Sigma}_{n}\geq 0$ , $\mu_{n}=\theta_{n}+\kappa_{n}H_{n}\tilde{h}_{n}$ . Then wpa 1:

[TABLE]

using the argmax Theorem and taking the p-limit on the right-hand-side. The $-\log|\kappa_{n}H_{n}^{-1}|$ term can be removed because $\hat{K}_{n}\in\{0,1\}$ for the uniform kernel so it does not alter the supremum and the infimum. This implies that $\kappa_{n}^{-2}H_{n}^{1/2}\Sigma_{n}H_{n}^{1/2}=\tilde{\Sigma}_{n}=O_{p}(1)$ as desired. ∎

Lemma D4.

Suppose $K$ is the uniform kernel and the Assumptions for Theorem 3 hold, then there exists $C>0$ such that $\|\Sigma_{n}^{-1/2}M^{-1}v\|_{2}\leq C,$ wpa 1 for any $v=(0,\beta_{2}^{1}-\beta_{2}^{2})$ with $\beta_{2}^{1},\beta_{2}^{2}\in\mathcal{B}_{2}^{0}$ . This implies that $\|\Sigma_{n}^{-1/2}M^{-1}P_{V_{2}}\|\leq\tilde{C}$ , wpa 1 for some finite constant $\tilde{C}$ .

Proof of Lemma D4.

Pick $v\neq 0$ as stated in the Lemma. Let $\beta^{1}=(\beta_{1n},\beta_{2}^{1})$ , $\beta^{2}=(\beta_{1n},\beta_{2}^{2})$ so that $v=\beta^{1}-\beta^{2}$ . Because $K$ is the uniform Kernel, after a change of variable, $\tilde{\mu}_{n},\tilde{\Sigma}_{n}=M^{-1}\mu_{n},M^{{}^{\prime}-1}\Sigma_{n}M^{-1}$ are the minimizers of:

[TABLE]

wpa 1, because $\hat{K}_{n}(\beta^{1})=\hat{K}_{n}(\beta^{2})=1$ wpa 1 under the Assumptions, and the infimum is less than for $(\tilde{\mu},\tilde{\Sigma})=(0,I)$ . This implies that $|\|\beta^{1}-\tilde{\mu}_{n}\|^{2}_{\Sigma_{n}^{-1}}-\|\beta^{2}-\tilde{\mu}_{n}\|^{2}_{\tilde{\Sigma}_{n}^{-1}}|\leq 2\sup_{\beta}\|\beta\|$ . Using $\|\beta^{1}-\tilde{\mu}_{n}\|^{2}_{\tilde{\Sigma}_{n}^{-1}}=\|v\|^{2}_{\tilde{\Sigma}_{n}^{-1}}+\|\beta^{2}-\tilde{\mu}_{n}\|^{2}_{\tilde{\Sigma}_{n}^{-1}}+2\langle\tilde{\Sigma}_{n}^{-1/2}v,\tilde{\Sigma}_{n}^{-1/2}(\beta_{2}-\tilde{\mu}_{n})\rangle$ , we have $|\|\beta^{1}-\tilde{\mu}_{n}\|^{2}_{\tilde{\Sigma}_{n}^{-1}}-\|\beta^{2}-\tilde{\mu}_{n}\|^{2}_{\tilde{\Sigma}_{n}^{-1}}|=|\|v\|^{2}_{\tilde{\Sigma}_{n}^{-1}}+2\langle\tilde{\Sigma}_{n}^{-1/2}v,\tilde{\Sigma}_{n}^{-1/2}(\beta_{2}-\tilde{\mu}_{n})\rangle|=|\|v\|^{2}_{\tilde{\Sigma}_{n}^{-1}}-2\langle\tilde{\Sigma}_{n}^{-1/2}v,\tilde{\Sigma}_{n}^{-1/2}(\beta_{1}-\tilde{\mu}_{n})\rangle|$ . Apply the triangular inequality to find wpa 1: $\|v\|^{2}_{\tilde{\Sigma}_{n}^{-1}}\leq 4\sup_{\|\beta\|^{2}_{2}}:=C.$ To get the first inequality, note that $\|v\|^{2}_{\tilde{\Sigma}_{n}^{-1}}=\|v^{\prime}M^{{}^{\prime}-1}\Sigma_{n}^{-1/2}\Sigma_{n}^{-1/2}M^{-1}v\|_{2}=\|\Sigma_{n}^{-1/2}M^{-1}v\|_{2}^{2}$ . The second inequality, can be derived using the same steps used in the proof of Theorem 3 and the minimax principle. ∎

Appendix E Linear Reparameterization, Continued

The following gives additional details about the linear reparameterization in Section 2, and describes the additional steps to use when there are multiple sources of identification. To simplify the discussion, it will focus on two specific examples.

The main idea is that if there are multiple but finitely many sources of identification failure, we can construct a finite partition of $\mathcal{B}_{2}$ where each subset is associated with a common rate (semi-strong, weak). Then, refine the reparameterization by using only the subset(s) corresponding to weak identification. When there is a single (scalar) source of identification failure, the partition $\beta_{1},\beta_{2}$ presented in the main text systematically has $\beta_{2}$ weakly identified for weak sequences because the objective function becomes flat at the same rate on the entire set $\mathcal{B}_{2}$ . The partition only has one element which is $\mathcal{B}_{2}$ itself.

Example 1: Linear IV regression

First, consider the linear IV regression, now with multiple instruments. Let $y_{i}=x_{i}^{\prime}\theta_{0}+u_{i}$ with moment condition $\mathbb{E}_{\gamma}(z_{i}[y_{i}-x_{i}^{\prime}\theta])=0$ . It can be re-written as $\mathbb{E}_{\gamma}(z_{i}x_{i}^{\prime}[\theta_{0}-\theta])=0$ , if $\mathbb{E}_{\gamma}(z_{i}u_{i})=0$ . As seen from the discussion of Assumption 3, identification fails for any $\gamma_{0}$ such that $\sigma_{\min}[\mathbb{E}_{\gamma_{0}}(z_{i}x_{i}^{\prime})]=0$ . Because the moment condition is linear in $\theta$ for this example, the linear reparameterization described in Section 2 is such that $V_{2}=\text{kern}[\mathbb{E}_{\gamma_{0}}(z_{i}x_{i}^{\prime})]$ , where kern is the kernel, or null space, of the matrix. When the matrix has full rank $V_{2}=\{0\}$ and the solution $\theta_{0}$ is unique.

Consider sequences $\gamma_{n}\to\gamma_{0}$ such that $\mathbb{E}_{\gamma_{n}}(z_{i}x_{i}^{\prime})=U\Lambda_{n}V^{\prime}$ where $\Lambda_{n}=\text{diag}(\lambda_{1n},\dots,\lambda_{kn})$ is diagonal, and $U,V$ are semi-unitary: $U^{\prime}U=V^{\prime}V=I_{k}$ . The span $V_{2}$ covers directions associated with the singularity, i.e. all columns $V_{j}$ , $j\in\{1,\dots,k\}$ , of $V$ where $\lim_{n\to\infty}\lambda_{jn}=\lambda_{j0}=0$ . Consider only sequences such that the limit $b_{j}=\lim_{n\to\infty}\sqrt{n}\lambda_{jn}$ exists in $\mathbb{R}_{+}\cup\{+\infty\}$ .111Note that $\sqrt{n}\lambda_{jn}$ takes values in the extended real line $\mathbb{R}_{+}\cup\{+\infty\}$ which is compact so we can always find a converging subsequence in the extended real line. This step appears in the proof of Theorem 4. Split the indices in two sets: $J_{1}=\{1\leq j\leq k,b_{j}=+\infty\}$ and $J_{2}=\{1\leq j\leq k,b_{j}<+\infty\}$ . Clearly $J_{1}\cap J_{2}=\emptyset$ and $J_{1}\cup J_{2}=\{1,\dots,k\}$ . Take $V_{2}$ to be the span associated with the columns $V_{j}$ , $j\in J_{2}$ . Then complete the reparameterization by taking $\beta_{1}$ in the orthogonal of $V_{2}$ . Since the reparameterization is defined up to a rotation, suppose for simplicity that $V$ is ordered such that $\beta=V^{\prime}\theta$ and, note that: $\|g(\beta,\gamma_{n})\|^{2}=(\beta_{n}-\beta)^{\prime}\Lambda_{n}(\beta_{n}-\beta)$ , where $\beta_{n}=V^{\prime}\theta_{n}$ . Assumption 5 can now be verified from this representation. Here the sources of identification failure are indexed by the singular values $\lambda_{j}$ , $j\in\{1,\dots,k\}$ , and the parameter space is partitioned into $k$ different directions: $V_{j}^{\prime}\theta$ , $j\in\{1,\dots,k\}$ , associated with the $\lambda_{j}$ .

Example 2: Non-Linear regression

Consider the regression setup in Cheng (2015): $y_{i}=\sum_{j=1}^{k}g_{j}(x_{i},\pi_{j})\delta_{j}+w_{i}^{\prime}\xi+u_{i}$ , $\theta=(\pi,\delta,\xi)$ , each $\delta_{j}$ here is scalar. The coefficient $\pi_{j}$ is unidentified if the corresponding $\delta_{j}=0$ . This is related to the example used in Section 5. For a vector of instruments $z_{i}$ , take the moment condition $\mathbb{E}_{\gamma}(z_{i}[y_{i}-\sum_{j=1}^{k}f_{j}(x_{i},\pi_{j})\delta_{j}-w_{i}^{\prime}\xi])=0$ which can be re-written as:

[TABLE]

Take $\gamma=\gamma_{0}$ such that $\delta_{j0}=0$ for at least one $j$ . Then $\Theta_{0}$ is non-singleton and includes all possible values of $\pi_{j}$ for which $\delta_{j0}=0$ . Suppose $\gamma$ , $z_{i},x_{i},z_{i}$ , and the functions $f_{j}$ are such that only the coefficients $\pi$ are potentially unidentified. The linear reparameterization based on $\gamma_{0}$ is such that $\beta_{2}$ include all coefficients $\pi_{j}$ for which $\delta_{j0}=0$ , while $\beta_{1}$ includes $\delta,\xi$ and the remaining $\pi_{j}$ , for which $\delta_{j0}\neq 0$ .

Take a converging sequence $\gamma_{n}\to\gamma_{0}$ . Following the same steps as in the previous example, let $b_{j}=\lim_{n\to\infty}\sqrt{n}|\delta_{jn}|\in\mathbb{R}_{+}\cup\{+\infty\}$ , define $J_{1}$ and $J_{2}$ in the same way as above. As before, apply the reparameterization but now $\beta_{2}$ includes the $\pi_{j}$ with $j\in J_{2}$ and $\beta_{1}$ collects all remaining coefficients. Here the sources of identification failure are indexed by $|\delta_{j}|$ , $j\in\{1,\dots,k\}$ . This time, the partition separates the directions $\pi_{j}$ , associated with the different $\delta_{j}$ .

Linear Reparameterization with Mixed Identification Strength

The goal of the following is to refine the linear reparameterization give in the main text when there is mixed identification strength, so as to have $\beta_{1}$ semi-strongly and $\beta_{2}$ weakly identified. The procedure relies on having finitely many sources of identification failure as in the above examples.

In the previous two examples, there were $1\leq k<\infty$ sources of identification failure. There, for a sequence $\gamma_{n}$ associated with weak identification, there are $K=2^{k}-1$ possibilities for identification strength. For instance, in Example 2 we have $(b_{1},\dots,b_{k})\in(\mathbb{R}_{+}\cup\{+\infty\})^{k}$ with at least one $b_{j}<+\infty$ . For each $b_{j}$ there are two possibilities ( $b_{j}<\infty$ , $b_{j}=\infty$ ) leading to $2^{k}$ outcomes, minus $1$ where $b_{j}<\infty$ for all $j$ , which precludes weak identification, in which case all parameters are (semi)-strongly identified.

With these $K\geq 1$ possible combinations, there are $K$ possible subsets $S_{1},\dots,S_{K}\subseteq\mathcal{B}_{2}$ on which the parameters can be weakly identified. In Example 2, one possible subset is associated with $b_{1}<\infty$ and $b_{j}=+\infty$ for $j>1$ ; here $S_{1}=\{\pi_{1}\in\mathbb{R},\pi_{j}=\pi_{j0},j>1\}$ . Then there are $\delta_{j}(\cdot),\overline{\delta}_{j}(\cdot)$ continuous and $h_{j}(\cdot)>0$ such that:

[TABLE]

Now take $j^{\star}\in\{1,\dots,K\}$ such that $\sqrt{n}\delta_{j^{\star}}(\gamma_{n})\to\infty$ and $\limsup_{n\to\infty}\overline{\delta}_{j^{\star}}(\gamma_{n})<\infty$ .

If $S_{j^{\star}}=\mathcal{B}_{2}$ , all parameters in $\beta_{2}$ are weakly identified and Assumption 5 i. follows from the properties of linear reparameterization and the Maximum Theorem, as explained in the main text. Otherwise, $S_{j^{\star}}\subset\mathcal{B}_{2}$ , strictly; only some parameters in $\beta_{2}$ are weakly identified.

Let $\tilde{V}_{2}=\text{span}\left(\{v_{2}=(\beta_{1n},\beta_{2}^{1})-(\beta_{1n},\beta_{2}^{2}),(\beta_{2}^{1},\beta_{2}^{2})\in S_{j^{\star}}\times S_{j^{\star}}\}\right)$ and $\tilde{V}_{1}=\tilde{V}_{2}^{\perp}$ . Let $\tilde{\beta}_{1}=P_{\tilde{V}_{1}}\beta$ and $\tilde{\beta}_{2}=P_{\tilde{V}_{2}}\beta$ . By construction, $S_{j^{\star}}$ is at most a singleton on $\tilde{V}_{1}$ and a set of dimension $\text{rank}(P_{\tilde{V}_{2}})$ on $\tilde{V}_{2}$ , denoted $\tilde{S}_{j^{\star}}$ .

If $\text{rank}(P_{\tilde{V}_{2}})=d_{\beta_{2}}$ then all directions of $\beta_{2}$ are weakly identified and $\tilde{\beta}_{2}=\beta_{2}$ is unchanged. By construction, $\limsup_{n\to\infty}\sqrt{n}\sup_{\tilde{\beta}_{2}\in\tilde{S}_{j^{\star}}}\|g(\tilde{\beta}_{1n},\tilde{\beta}_{2},\gamma_{n})\|_{W}=\limsup_{n\to\infty}\sqrt{n}\overline{\delta}_{j^{\star}}(\gamma_{n})<\infty$ . To reduce notation, suppose $\beta_{2n}\in S_{j^{\star}}$ , then $\|\beta_{1}-\tilde{\beta}_{1n}\|\geq\varepsilon\Rightarrow d(\beta,\{\beta_{1n}\}\times(S_{j}\cup\{\beta_{2n}\}))\geq\varepsilon$ , using $\|\beta^{1}-\beta^{2}\|=\|P_{\tilde{V}_{1}}(\beta^{1}-\beta^{2})\|+\|P_{\tilde{V}_{2}}(\beta^{1}-\beta^{2})\|$ . This implies that $\inf_{\|\tilde{\beta}_{1}-\tilde{\beta}_{1n}\|\geq\varepsilon,\tilde{\beta}_{2}}\|g(\tilde{\beta}_{1},\tilde{\beta}_{2},\gamma_{n})\|_{W}\geq\delta_{j^{\star}}(\gamma_{n})h_{j^{\star}}(\varepsilon)$ with $\sqrt{n}\delta_{j^{\star}}(\gamma_{n})\to\infty$ which yields Assumption 5 i., i.e. $\tilde{\beta}_{1}$ is semi-strongly identified and $\tilde{\beta}_{2}$ is weakly identified on the set $\tilde{S}_{j^{\star}}$ .

Appendix F Uniform Sampling on Level Sets

As shown in Section 2.1, the computation of the quasi-Jacobian requires uniform draws over the level set $\Theta_{n}=\{\theta\in\Theta,\|\bar{g}_{n}(\theta)\|_{W_{n}}\leq\kappa_{n}\}$ and similarly test inversion amounts to finding the level set $\{\theta\in\Theta,\|\bar{g}_{n}(\theta)\|^{2}_{\hat{V}_{n}^{-1}}\leq\chi^{2}_{d_{g}-\hat{d}_{n}}(1-\alpha)\}$ and projecting it onto $\theta_{1}$ .

Direct approach:

the approach used in Section 5 amounts to importance sampling. Draw $\theta_{1},\dots,\theta_{B}$ uniformly distributed on $\theta$ and assign weights proportional to $\mathbbm{1}_{\|\bar{g}_{n}(\theta_{b})\|_{W_{n}}\leq\kappa_{n}}$ . The weighted sample is uniformly distributed on the level set. The draws $(\theta_{b})_{b=1,\dots,B}$ can be random or pseudo-random using quasi-Monte Carlo sequences such as the Sobol or Halton sequence (see Lemieux, 2009, Section 5). The main drawback of this approach is that the effective sample size can be very small, i.e. few draws have non-zero weight, when the level set is small relative to the parameter space. In particular, the effective sample size is approximately $B\times\text{volume}(\Theta_{n})/\text{volume}(\Theta)$ which tends to be small when the dimension of $\theta$ is moderately large.

Adaptive Sampling by Population Monte Carlo:

the main idea here to is preserve the simplicity of importance sampling while constructing a sequence of proposal distributions with a higher acceptance rate. Algorithm 1 below is adapted from the Population Monte Carlo principle laid out in Cappé et al. (2004). Consider a sequence of level sets: $\Theta_{jn}=\{\theta\in\Theta,\|\bar{g}_{n}(\theta)\|_{W_{n}}\leq\kappa_{jn}\}$ with $\kappa_{1n}>\kappa_{2n}>\dots>\kappa_{Jn}=\kappa_{n}$ for some $J\geq 1$ . By construction $\Theta_{n}=\Theta_{Jn}\subseteq\dots\subseteq\Theta_{2n}\subseteq\Theta_{1n}$ and $\text{volume}(\Theta_{1n})\geq\dots\geq\text{volume}(\Theta_{Jn})=\text{volume}(\Theta_{n})$ . This implies that it is easier to generate uniform draws on $\Theta_{1n}$ than on $\Theta_{n}$ .

The following summarizes the algorithm in plain terms. The initialization step is a simple accept-reject algorithm to generate iid draws on $\Theta_{1n}$ . Then given a set of draws $j-1\geq 1$ , draw uniformly $\theta^{j\star}_{b}$ from the weighted sample $(\theta^{j-1}_{b},w^{j-1}_{b})_{b=1,\dots,B}$ and generate $\theta^{j}_{b}$ using a transition kernel $q_{jb}$ , for instance a random-walk step $\theta^{j}_{b}\sim\mathcal{N}(\theta^{j\star}_{b},\Sigma^{j}_{b})$ . Re-draw both $\theta^{j\star}_{b}$ and $\theta^{j}_{b}$ until the criterion $\|\bar{g}_{n}(\theta^{j}_{b})\|_{W_{n}}\leq\kappa_{jn}$ is met and then set the weight according to the sampling probability $w^{j}_{b}\propto w(\theta^{j\star}_{b})/q_{jb}(\theta^{j}_{b}|\theta_{b}^{j\star})$ . Repeat this process for each $b=1,\dots,B$ and each $j=2,\dots,J$ . The final weighted sample $(\theta^{J}_{b},w^{J}_{b})_{b=1,\dots,B}$ targets the desired distribution.

There are several choices of tuning parameters in the steps above. First, $\kappa_{jn}$ can be chosen adaptively to avoid decreasing it too fast or too slow which would result in poor computational performance. In the empirical application, $\kappa_{1n}$ is set according to median value of $\|\bar{g}_{n}(\theta^{1}_{b})\|_{W_{n}}^{2}$ from uniform draws $\theta_{b}$ on $\Theta$ ; this yields $\kappa_{1n}^{2}=5500$ . Then $\kappa_{jn}$ is set according to $\kappa_{jn}^{2}=\min(0.9\kappa_{j-1n}^{2},q_{j-1}(0.6))$ where $q_{j-1}(0.6)$ is the $60\%$ quantile of $\|\bar{g}_{n}(\theta_{b}^{j-1})\|_{W_{n}}^{2}$ . This guarantees that $\kappa_{jn}$ is strictly decreasing but declines slowly enough to maintain a reasonable acceptance rate. To adapt to the shape of each $\Theta_{jn}$ , the proposal $q_{jn}$ is also constructed adaptively. For each $j\geq 2$ , a clustering algorithm is applied to the draws $(\theta^{j}_{b})_{b=1,\dots,B}$ to split the draws into $K=3$ clusters. Then $\Sigma_{b}^{j}$ is $2$ times the variance of the draws from the cluster in which $\theta^{j}_{b}$ belongs. This accommodates multimodality in the objective function. The inner loop, over $b=1,\dots,B$ , is run in parallel which speeds up the computation significantly. In the application, the final $n\times\kappa_{n}^{2}=\|\bar{g}_{n}(\hat{\theta}_{n})\|_{W_{n}}^{2}+2\log\log(n)=10.34$ is attained from the initial $n\times\kappa_{1n}^{2}=5500$ after $J=45$ iterations.

The output of Algorithm 1 is used both to compute $B_{n,\infty}$ and later for test inversion by picking $\hat{j}$ such that $\hat{j}=\inf\{j=1,\dots,J,\kappa_{jn}^{2}\geq\chi^{2}_{d_{g}-\hat{d}_{n}}(1-\alpha)\}$ and running one more iterations with $\kappa_{\hat{j}+1n}^{2}=\chi^{2}_{d_{g}-\hat{d}_{n}}(1-\alpha)$ . This yields the 5000 draws shown in Figure 4.

Appendix G Sample Code to Implement to Procedure

The following provides some sample R code to perform the steps outlined in Section 2.1 for the Monte Carlo example in Appendix H.2.

⬇

require(randtoolbox) # Used to generate the integration grid

library(pracma) # Used to compute matrix square root

library(CVXR) # CVX for R

library(Rmosek) # To use the MOSEK solver in CVX

set.seed(123)

n = 1e3 # Sample size

B = 1e4 # Number of draws

Robust and standard critical values:

critical_R = qchisq(0.95,2)

critical_S = qchisq(0.95,1)

**************************************************************

Simulate Data, Define Moment Conditions

**************************************************************

c = 1 # c determines identification strength

b1 = c/sqrt(n) # theta1 = c/sqrt(n)

b2 = 5 # theta2 is fixed

Simulate data: x1, x2, e, and y = b1x1 + b1b2*x2 + e

x1 = rnorm(n)

x2 = rnorm(n)

e = rnorm(n)

y = b1x1 + b1b2*x2 + e

moments $\bm{\leftarrow}$ function(b,y,x1,x2) {

  # computes the sample moments and the variance of the moments

  e_hat = y - b[1]*x1 - b[1]*b[2]*x2 # residuals

  mom   = cbind(e_hat,e_hat)*cbind(x1,x2)

  mom_m = apply(mom,2,mean) # g_bar

  V     = var(mom)          # V_hat

  return( list( mom = mom_m, V = V ) )

}

objective $\bm{\leftarrow}$ function(b,y,x1,x2) {

  # computes the GMM objective function

  mm = moments(b,y,x1,x2)

  return( t(mm$mom)%*%solve(mm$V,mm$mom) )

}

**************************************************************

Compute the quasi-Jacobian Matrix

**************************************************************

Set the integration grid:

s = sobol(B,2,scrambling=1)

p = cbind(rep(b1,B),rep(b2,B)) + 2*(s-1/2)

objs = rep(NA,B) # Store GMM objective values

moms = matrix(NA,B,2) # Store sample moments mom

Vs = array(NA,dim=c(2,2,B)) # Store variances V

for (b in 1:B) { # Evaluate the moments on the grid

mm   = moments(p[b,],y,x1,x2)

objs[b] = t(mm$mom)%*%solve(mm$V,mm$mom)

moms[b,] = mm$mom

Vs[,,b] = mm$V

}

Select draws on the level set

ind = which(objs - min(objs) $\bm{\leq}$ 2*log(log(n))/n)

grid_sub = p[ind,]

moms_sub = moms[ind,]

Vs_sub = Vs[,,ind]

X = cbind(1,grid_sub) # regressors: intercept and theta_b

write the optimization problem for CVX

beta = Variable(dim(X)[2],dim(moms_sub)[2]) # matrix of coefficients (A,B)

objc $\bm{\leftarrow}$ Minimize(norm( moms_sub - X %*% beta,"I")) # l-infinity loss

prob $\bm{\leftarrow}$ Problem(objc) # compile the problem

result $\bm{\leftarrow}$ solve(prob,solver="ECOS_BB") # compute the solution

coef = result$getValue(beta) # extract solution

Bn = t(coef[2:3,]) # quasi-Jacobian matrix

Now compute the normalization matrix for the left-hand-side

V = matrix(0,2,2) # Compute V_bar the average variance matrix

for (b in 1:length(ind)) {

V = V + Vs_sub[,,b]/length(ind)

}

Now compute the normalization matrix for the right-hand-side

mu $\bm{\leftarrow}$ Variable(1,2) # vector of means

one = matrix(1,length(ind),1)

VV = Variable(2,2) # matrix of variances

objc $\bm{\leftarrow}$ Minimize( - log_det(VV) + 0.5norm( (grid_sub%%VV - kronecker(one,mu))*∧*2,"I")) # setup the minimization problem in CVX

prob $\bm{\leftarrow}$ Problem(objc) # compile

result2 $\bm{\leftarrow}$ solve(prob,solver="MOSEK") # solve using MOSEK solver

phi = result2$getValue(VV) # extract solution

Note that phi = Sigma∧(-1/2), the problem was reparameterized

**************************************************************

Identification Category Selection

**************************************************************

v = c(1,0) # vector which spans theta1

M = diag(2)-v%*%t(v) # Projection matrix onto the span on theta2

Normalized quasi-Jacobian matrix

sqrtm computes the matrix square root and Binv its inverse

Bnorm = ( sqrtm(V)$Binv )%%( Bn%%M )%*%phi

singular values in decreasing order

sing = svd(Bnorm)$d

cutoff = sqrt(2*log(n)/n) # cutoff lambda_n for ICS

print(’Singular values without projecting out theta1:’)

print( round(svd(( sqrtm(V) $Binv )%*%( Bn )%*%phi)$ d,3) )

print(’Singular values after projecting out theta1:’)

print(round(sing,3))

print(’Cutoff:’)

print(cutoff)

Set critical value depending on the singular value and cutoff

cr = 1*(sing[1]>cutoff)critical_S + 1(sing[1]<cutoff)*critical_R

if (sing[1]>cutoff) {

print(’Nuisance parameter is semi-strongly identified’)

} else {

print(’Nuisance parameter is weakly identified’)

}

**************************************************************

Subvector Inference

**************************************************************

Test H0: b1 = b10 at the 5% significance level

b10 = 0

obj $\bm{\leftarrow}$ function(b2,b10,y,x1,x2) {

return( objective(c(b10,b2),y,x1,x2) )

}

Anderson-Rubin test statistic

AR = n*optimize(obj,c(-20,20),b10=b10,y=y,x1=x1,x2=x2)$objective

if (AR > cr) {

print(’Reject H0’)

} else {

print(’Cannot reject H0’)

}

Compute a 95% confidence set:

ind = which(objs $\bm{\leq}$ cr)

print(’Confidence Interval for theta1:’)

print(c(min(p[ind,1]),max(p[ind,1])))

print(’True value:’)

print(b1)

Appendix H Additional Results for Section 5

H.1 Verification of the Main Assumptions

We now verify the main assumptions for the NLS example in Appendix H.2:

[TABLE]

where $(x_{1i},x_{2i},u_{i})\sim\mathcal{N}(0,I)$ iid. The optimization space is $\Theta=\Theta_{1}\times\Theta_{2}=[\underline{\theta}_{1},\overline{\theta}_{1}]\times[\underline{\theta}_{2},\overline{\theta}_{2}]$ , where $-\infty<\underline{\theta}_{1,2}<0<\overline{\theta}_{1,2}<\infty$ . We can then set $\overline{\Theta}=\overline{\Theta}_{1}\times\overline{\Theta}_{2}=[\underline{\theta}_{1}+\varepsilon,\overline{\theta}_{1}-\varepsilon]\times[\underline{\theta}_{2}+\varepsilon,\overline{\theta}_{2}-\varepsilon]$ for any $0<\varepsilon<\min_{j=1,2}(|\underline{\theta}_{j}|,|\overline{\theta}_{j}|)$ . The parameter space is then $\Gamma=\{\gamma=(\theta,\omega)\in\overline{\Theta}\times\Omega\}$ , where $\Omega$ indexes the distribution $F$ of $(x_{1i},x_{2i},u_{i})$ which here is very simple since $\Omega=\{\Phi\}$ , the normal distribution above. More general choices of distribution spaces one could consider could take the form: $\Omega=\{F,\mathbb{E}_{F}(x_{1i},x_{2i},u_{i})=(\mu_{1},\mu_{2},0),\|(\mu_{1},\mu_{2})\|\leq c,\mathbb{E}_{F}((x_{1i}-\mu_{1},x_{2i}-\mu_{2},u_{i})(x_{1i}-\mu_{1},x_{2i}-\mu_{2},u_{i})^{\prime})=\Sigma,0<c\leq\lambda_{\min}(\Sigma)\leq\lambda_{\max}(\Sigma)\leq C<\infty,\mathbb{E}_{F}(\|(x_{1i},x_{2i},u_{i})\|^{4})\leq C\}$ . See Andrews and Cheng (2012) for more examples. Assumption 1 i., ii. hold for this choice of $\overline{\Theta},\Theta$ , and $\Gamma.$

The sample moments are $\bar{g}_{n}(\theta)=\frac{1}{n}\sum_{i=1}^{n}(y_{i}x_{1i}-\theta_{1},y_{i}x_{2i}-\theta_{1}\theta_{2})^{\prime}$ and their population counterpart is $g(\theta,\gamma_{0})=(\theta_{10}-\theta_{1},\theta_{10}\theta_{20}-\theta_{1}\theta_{2})^{\prime}.$ They can be re-written as:

[TABLE]

The lower triangular matrix has two eigenvalues: $1$ and $\theta_{10}$ . Hence, we have the following inequality: $\|g(\theta,\gamma_{0})\|\geq\min(1,|\theta_{10}|)\times\|\theta-\theta_{0}\|$ . This implies that Assumption 3 i. holds with $\delta(\gamma_{0})=\min(1,|\theta_{10}|)$ which is continuous in $\gamma=(\theta,\omega)$ , and $h(\varepsilon)=\varepsilon$ . To verify Assumption 3 ii., take $\theta=(\theta_{10},\theta_{2})$ , then $\|g(\theta,\gamma_{0})\|=|\theta_{10}|\times\|\theta-\theta_{0}\|$ . We have $|\theta_{10}|=\min(1,|\theta_{10}|)\times\frac{|\theta_{10}|}{\min(1,|\theta_{10}|)}\leq\max(1,|\overline{\theta}_{10}|,|\underline{\theta}_{10}|)\times\min(1,|\theta_{10}|)$ .

Assumption 4 i. holds for any $\theta_{1n}\neq 0$ . Condition ii. holds if $\sqrt{n}|\theta_{1n}|\to\infty$ . Condition iii. is a stochastic equicontinuity condition which can be verified by Lipschitz continuity and conditions on the parameter space and the distribution of the covariates and the errors. Condition iv holds because the quadratic term vanishes at the same rate as the first-order term in the Taylor expansion ( $g(\theta,\gamma_{n})$ is a polynomial of order $2$ which becomes flat wrt $\theta_{2}$ when $\theta_{1n}\to 0$ ). Condition v. can be verified numerically.

For Assumption 5 i., note that $\|g(\theta,\gamma_{n})\|^{2}=\|\theta_{1}-\theta_{1n}\|^{2}+\|\theta_{1}\theta_{2}-\theta_{1n}\theta_{2n}\|^{2}\geq\|\theta_{1}-\theta_{1n}\|^{2}$ . Here we can use $\tilde{\delta}(\gamma_{n})=1$ , $\tilde{h}(\varepsilon)=\varepsilon$ . For $\sqrt{n}\|g(\theta_{1n},\theta_{2},\gamma_{n})\|=\sqrt{n}|\theta_{1n}|\times\|\theta_{2}-\theta_{20}\|\leq\sqrt{n}|\theta_{1n}|\times 2\max(|\overline{\theta}_{2}|,|\underline{\theta}_{2}|)\to(\lim_{n\to\infty}\sqrt{n}|\theta_{1n}|)2\max(|\overline{\theta}_{2}|,|\underline{\theta}_{2}|)<\infty$ for weak sequences. Hence, Assumption 5 i. and ii. hold with $\mathcal{B}_{2}^{0}=\Theta_{2}$ .

H.2 Additional Simulation Results

Consumption Capital Asset Pricing Model (CAPM)

Figure H5 shows the sampling distribution of the CAPM estimates $\hat{\theta}_{n}=(\hat{\delta}_{n},\hat{\gamma}_{n})$ .

Table LABEL:t2xCAPM_JI and Figure H10 replicate the results in the main text for a just-identified specification where $Z_{t}=(1,R_{t})^{\prime}$ .

Non-Linear Regression Model.

To illustrate the finite-sample properties of the quasi-Jacobian matrix and the test procedure, consider the following nonlinear regression model:

[TABLE]

where $(x_{1i},x_{2i},u_{i})\sim\mathcal{N}(0,I)$ iid. The sample moment conditions are $\bar{g}_{n}(\theta)=\frac{1}{n}\sum_{i=1}^{n}(y_{i}x_{1i}-\theta_{1},y_{i}x_{2i}-\theta_{1}\theta_{2})^{\prime}$ with population counterpart $g(\theta,\gamma_{0})=(\theta_{10}-\theta_{1},\theta_{10}\theta_{20}-\theta_{1}\theta_{2})^{\prime}$ . For $\theta_{10}=0$ , $\theta_{2}$ is unidentified and for $\theta_{1n}=cn^{-1/2}$ , $c>0$ , $\theta_{2}$ is weakly identified, even if $\theta_{1}=\theta_{1n}$ is known and fixed. The reparameterization $\beta=M\theta$ here is $\beta_{1}=\theta_{1}$ , $\beta_{2}=\beta_{22}=\theta_{2}$ , $M=I$ , and $\mathcal{B}_{2}^{0}=\mathcal{B}_{22}^{0}=[\underline{\theta_{2}},\overline{\theta}_{2}]$ where $[\underline{\theta}_{1},\overline{\theta}_{1}]\times[\underline{\theta}_{2},\overline{\theta}_{2}]=\Theta_{1}\times\Theta_{2}=\Theta$ . The assumptions used for the main results are verified for this model in Appendix H.1.

In this simple example, the source of the identification failure is known so that the type I test procedure in Andrews and Cheng (2012, AC12) will be used as a benchmark. Let $\text{ICS}_{n}=|\hat{\theta}_{1n}|/\hat{\sigma}_{\hat{\theta}_{1n}}$ , where $\hat{\theta}_{n}=(\hat{\theta}_{1n},\hat{\theta}_{2n})^{\prime}$ is the sample minimizer of $\|\bar{g}_{n}(\theta)\|$ and $\hat{\sigma}_{\hat{\theta}_{1n}}^{2}$ estimates the asymptotic variance of $\hat{\theta}_{1n}$ using the sandwich formula. The test statistic is $\text{QLR}_{n}(\theta_{1})=\text{AR}_{n}(\theta_{1})$ since the model is just-identified. Let $\underline{\lambda}_{n}$ be as in Section 2.3. When $\text{ICS}_{n}>\underline{\lambda}_{n}$ , the test rejects $H_{0}$ if $\text{AR}_{n}(\theta_{1})>\chi^{2}_{1}(1-\alpha)$ . When $\text{ICS}_{n}\leq\underline{\lambda}_{n}$ , the test rejects $H_{0}$ if $\text{AR}_{n}(\theta_{1})>c_{LF,1-\alpha}$ where $c_{LF,1-\alpha}$ is the least-favorable $1-\alpha$ quantile of $\text{AR}_{n}(\theta_{1})$ over $(\theta_{2},\gamma)\in\Theta_{2}\times\Gamma$ . Note that under $H_{0}:\theta_{10}=0$ , $\text{AR}_{n}(\theta_{10})\overset{d}{\to}\chi^{2}_{2}$ , regardless of $\theta_{20}$ . Hence, the projection-based critical value in Section 2.3 is the least-favorable critical value, $c_{LF,1-\alpha}=\chi^{2}_{2}(1-\alpha)$ .222A null-imposed least-favorable critical value can also be computed by simulating the distribution of $\text{AR}_{n}(\theta_{10})$ for each $H_{0}:\theta_{1}=\theta_{10}$ and all possible $\theta_{20}$ . This will not be used here to keep computation manageable. To summarize, this implementation of the Andrews and Cheng (2012) procedure relies on the same test statistic and critical values as in Section 2.3; the only difference is the choice of ICS statistic.

Figure H11 reports the finite sample properties of several tests and ICS procedures. The top panel shows coverage for $H_{0}:\theta_{1}=\theta_{1n}=cn^{-1/2}$ , $c\in[0,10]$ , $n=1000$ , using a Wald statistic, full projection inference, AC12, and the test procedure from Section 2.3 using the normalized and unnormalized quasi-Jacobian $B_{n,\infty}$ . The Wald test suffers from severe size distortion for $c\in[0,2]$ but is accurate for larger values of $c$ . Full projection inference is robust regardless of $c$ but conservative for $c>0$ . AC12 and the present procedures have coverage above the 95% nominal level, the unnormalized procedure is more conservative, AC12 is non-monotonic. To better understand these patterns, the bottom two panels provide further information on the ICS procedures. The left panel shows how often $\text{ICS}_{n}\leq\underline{\lambda}_{n}$ . The normalized statistic sees a large decline around $c=2$ when size distortion is less severe. AC12 is non-monotonic around $c=1$ where the Wald statistic, on which it is based, has large size distortion. The unnormalized statistic declines sharply but later than the normalized one. To further understand these differences, the right panel plots the distribution of $\log(1+\text{ICS}_{n})$ . The solid horizontal line indicates the cutoff $\log(1+\underline{\lambda}_{n})$ . The normalized statistic diverges quickly with $c$ , as identification becomes stronger. This matches the above discussion on the role of post-multiplying the quasi-Jacobian by $\Sigma_{n}^{-1/2}$ . AC12 is more dispersed, resulting in more variable outcomes for the ICS procedure as seen in the slow decline in the left panel. AC12 increases with $c$ at a similar rate as the unnormalized statistic. Finite sample power properties of these test procedures are reported in Appendix H.2 as well as results using a larger $\kappa_{n}$ .

Figure H12 below presents the finite-sample power properties of the test procedures used in Section 5. It shows rejection rates against local alternatives $H_{0}:\theta_{1}=\theta_{1n}+an^{-1/2}$ where the true $\theta_{1n}=cn^{-1/2}$ . The nuisance parameter $\theta_{2}$ is unidentified for $c=0$ and weakly identified for $c\simeq 0$ . Each panel summarizes the finite-sample power properties for a specific level of identification strength $c$ .

For $c=0$ the Projection, AC12, and (un)normalized procedures have identical properties. For $c=1$ , AC12 does not detect identification all the time (see Figure H11, bottom left panel) which leads to small critical value and higher rejection rates than the other methods. For $c\in[2,3]$ , the normalized test procedure relies on $\chi^{2}_{1}$ critical values and has comparable power to the Wald test except for $a+c\simeq 0$ . Recall that for just-identified models, the test procedure in Section 2.3 is equivalent to a standard QLR test when $\hat{d}_{n}=d_{\theta_{2}}$ which can be more powerful than the Wald test in finite samples. The normalized ICS procedure is thus more powerful since it almost always picks $\hat{d}_{n}=d_{\theta_{2}}$ when $c\geq 2$ (see Figure H11, bottom left panel). The Wald test is not reported for $c<2$ where it suffers from important size distortion. AC12 has lower power for $c\in[2,4]$ and similar power properties for $c\geq 5$ . The unnormalized procedure is comparable to AC12 for $c\in[2,3]$ and is more powerful for $c=3$ .

H.3 Additional Empirical Results

Confidence sets for $\gamma$ and $\psi^{-1}$ with a $\chi^{2}_{6}$ critical value: $[5.28,25]$ and $[0.01,0.87]$ , respectively. Using a $\chi^{2}_{6}$ critical value amounts to using $\underline{\lambda}_{n}\in[0.25,387)$ in the baseline results (Table 3) and $\underline{\lambda}_{n}\in[0.51,130)$ with the larger value for $\kappa_{n}$ (Table H6 below).

Appendix I Asymptotic Properties of the quasi-Jacobian under Higher-Order Identification

The following provides pointwise asymptotic results for the quasi-Jacobian matrix when the model is globally but not locally identified.

Assumption I7 (Higher-Order Identification).

Let $(\theta_{0},\gamma_{0})\in\overline{\Theta}\times\Gamma$ be such that for some $\varepsilon>0$ the moments satisfy:

[TABLE]

where $\underline{\delta}>0$ . For some $r\geq 2$ , there exists orthogonal projection matrices $P_{1},\dots,P_{r}$ and constants $C_{1}\geq 0,\dots,C_{r-1}\geq 0,C_{r}>0$ where $\sum_{j}C_{j}P_{j}$ has rank $d_{\theta}$ and $C_{j}C_{\ell}P_{j}P_{\ell}=0$ for any $1\leq j<\ell\leq r$ . These constants and projection matrices are such that for some $\overline{C}>0$ and any $\|\theta-\theta_{0}\|\leq\varepsilon$ :

[TABLE]

Assumption I7 implies that the model is globally identified but local identification fails so that around $\theta=\theta_{0}$ , the moment function is not linear but approximately polynomial of order $r\geq 2$ . If $C_{j}>0$ then $\|g(\theta,\gamma_{0})\|$ is approximately a polynomial of order $j$ in the directions spanned by $P_{j}$ . This contrasts with locally identified models where $g(\theta,\gamma_{0})\approx\partial_{\theta}g(\theta_{0},\gamma_{0})(\theta-\theta_{0})$ which is locally linear when $\partial_{\theta}g(\theta_{0},\gamma_{0})$ is full rank and the non-linear remainder terms are negligible. Under this type of local identification failure, the parameters are consistently estimable but $\hat{\theta}_{n}$ has non-standard limiting distribution. Full vector inference using the Anderson and Rubin (1949) statistic remains valid. As in weakly identified models, concentrating out locally identified nuisance parameters leads to more powerful and asymptotically valid inferences.

Theorem I5.

Suppose Assumption 1 ii-iii, 2, and I7 hold for $\gamma=\gamma_{0}$ , then:

[TABLE]

For any $v_{j}$ such that $P_{j}v_{j}=v_{j}$ and $C_{j}>0$ : $\|B_{n,\infty}v_{j}\|=O_{p}(\kappa_{n}^{1-1/j})$ .

Proof of Theorem I5 for $B_{n,\infty}$ :

Pick $h\in\mathbb{R}$ and $v_{j}\in\text{span}(P_{j})$ with $\|v_{j}\|=1$ for some $j\in\{2,\dots,r\}$ with $C_{j}\neq 0$ . Let $\theta_{jn}=\theta_{0}+\kappa_{n}^{1/j}hv_{j}$ , by Assumption I7 we have:

[TABLE]

wpa 1 for all $|h|\leq 1/2[\overline{\lambda}_{W}C_{j}]^{-1/j}$ . This implies that $\hat{K}_{n}(\theta_{jn})\geq\underline{K}=\in_{x\in[0,3/4]}K(x)>0$ , wpa 1 uniformly in $|h|\leq 1/2[\overline{\lambda}_{W}C_{j}]^{-1/j}$ . Using similar arguments as in the proof of Theorem 3, we have: $\|\bar{g}_{n}(\theta)-A_{n,\infty}-B_{n,\infty}\theta\|\hat{K}_{n}(\theta)\leq\overline{K}\underline{\lambda}_{W}^{-1}\kappa_{n}+o(\kappa_{n})$ , for all $\theta\in\Theta$ . Using the triangular inequality we have for any $h_{1}\neq h_{2}$ such that $|h_{1,2}|\leq 1/2[\overline{\lambda}_{W}C_{j}]^{-1/j}$ :

[TABLE]

wpa 1. Since $j>1$ , this implies that:

[TABLE]

wpa 1 for each $j$ such that $C_{j}\neq 0$ . In particular, we have for $j=r$ that: $\|B_{n,\infty}v_{r}\|\leq O_{p}(\kappa_{n}^{1-1/r})$ so that $\lambda_{\min}(B_{n,\infty}^{\prime}B_{n,\infty})\leq v_{r}^{\prime}B_{n,\infty}^{\prime}B_{n,\infty}v_{r}\leq O_{p}(\kappa_{n}^{2[1-1/r]})$ . ∎

Bibliography65

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Anderson and Rubin (1949) Anderson, T. W. and H. Rubin (1949): “Estimation of the Parameters of a Single Equation in a Complete System of Stochastic Equations,” The Annals of Mathematical Statistics , 20, 46–63.
2Andrews (2017) Andrews, D. W. (2017): “Identification-robust subvector inference,” Cowles Foundation Discussion Paper .
3Andrews and Cheng (2012) Andrews, D. W. and X. Cheng (2012): “Estimation and Inference With Weak, Semi-Strong, and Strong Identification,” Econometrica , 80, 2153–2211.
4Andrews and Cheng (2013) ——— (2013): “Maximum likelihood estimation and uniform inference with sporadic identification failure,” Journal of Econometrics , 173, 36–56.
5Andrews and Cheng (2014) ——— (2014): “GMM Estimation and Uniform Subvector Inference with Possible Identification Failure,” Econometric Theory , 30, 287–333.
6Andrews et al. (2020) Andrews, D. W., X. Cheng, and P. Guggenberger (2020): “Generic results for establishing the asymptotic size of confidence sets and tests,” Journal of Econometrics , 218, 496–531.
7Andrews and Mikusheva (2016) Andrews, I. and A. Mikusheva (2016): “Conditional Inference With a Functional Nuisance Parameter,” Econometrica , 84, 1571–1612.
8Antoine and Renault (2009) Antoine, B. and E. Renault (2009): “Efficient GMM with nearly-weak instruments,” Econometrics Journal , 12, S 135–S 171.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Detecting Identification Failure in

Abstract

1 Introduction

Structure of the Paper

Related Literature

Notation

2 Setting and Assumptions

Assumption 1** (Parameter Space, Sample Moments, Weighting Matrix).**

2.1 Outline of the Procedure

2.2 Linear Approximations and the quasi-Jacobian Matrix

Definition 1**.**

Assumption 2** (Kernel, Bandwidth).**

2.3 Test Procedure

Choice of Tuning Parameters:

2.4 The quasi-Jacobian

Theorem 1** (quasi-Jacobian, n=∞n=\inftyn=∞).**

Intuition for linear models.

Non-linear models: a pen and pencil example.

2.5 Drifting Sequences of Parameters, Identification Regimes

Assumption 3** (Identification).**

Assumption 4** (Strong and Semi-Strong Sequences).**

Proposition 1** (Asymptotic Distribution for (Semi)-Strong Sequences).**

Linear reparameterization.

Assumption 5** (Weak Sequences).**

Proposition 2** (Asymptotic Distribution for Weak Sequences).**

3 Asymptotic Behaviour of the quasi-Jacobian

3.1 Strong and Semi-Strong Sequences

Theorem 2** (quasi-Jacobian and Jacobian Equivalence).**

3.2 Weak Sequences

Theorem 3** (Asymptotic Singularity of the quasi-Jacobian).**

Proposition 3** (quasi-Jacobian after Projection).**

4 Asymptotic Properties of the Test Procedure

Assumption 6**.**

Theorem 4** (Asymptotic Size).**

5 Monte-Carlo Simulations

6 Application to the Long-Run Risks Model

7 Conclusion

Appendix A Preliminary Results

A.1 Preliminary results for Section 2

Lemma A1** (Strong and Semi-Strong Sequences: Consistency).**

Lemma A2** (Strong and Semi-Strong Sequences: Asymptotic Normality).**

Appendix B Proofs for the main results

B.1 Proofs for Section 2

Proof of Theorem 1:

Proof of Proposition 1:

B.2 Proofs for Section 3

B.2.1 Strong and semi-strong sequences.

Proof of Theorem 2 for Bn,∞B_{n,\infty}Bn,∞​:

B.2.2 Weak sequences.

Definition B2**.**

Proof of Theorem 3:

Proof of Proposition 3:

B.3 Proofs for Section 4

Proof of Theorem 4:

Appendix C Proofs for the preliminary results

C.1 Preliminary results for Section 2

Proof of Lemma A1:

Proof of Lemma A2:

Appendix D Supplemental Results

Lemma D3**.**

Proof of Lemma D3.

Lemma D4**.**

Proof of Lemma D4.

Appendix E Linear Reparameterization, Continued

Example 1: Linear IV regression

Example 2: Non-Linear regression

Linear Reparameterization with Mixed Identification Strength

Appendix F Uniform Sampling on Level Sets

Direct approach:

Adaptive Sampling by Population Monte Carlo:

Appendix G Sample Code to Implement to Procedure

Robust and standard critical values:

**************************************************************

Simulate Data, Define Moment Conditions

Assumption 1 (Parameter Space, Sample Moments, Weighting Matrix).

Definition 1.

Assumption 2 (Kernel, Bandwidth).

Theorem 1 (quasi-Jacobian, $n=\infty$ ).

Assumption 3 (Identification).

Assumption 4 (Strong and Semi-Strong Sequences).

Proposition 1 (Asymptotic Distribution for (Semi)-Strong Sequences).

Assumption 5 (Weak Sequences).

Proposition 2 (Asymptotic Distribution for Weak Sequences).

Theorem 2 (quasi-Jacobian and Jacobian Equivalence).

Theorem 3 (Asymptotic Singularity of the quasi-Jacobian).

Proposition 3 (quasi-Jacobian after Projection).

Assumption 6.

Theorem 4 (Asymptotic Size).

Lemma A1 (Strong and Semi-Strong Sequences: Consistency).

Lemma A2 (Strong and Semi-Strong Sequences: Asymptotic Normality).

Proof of Theorem 2 for $B_{n,\infty}$ :

Definition B2.

Lemma D3.

Lemma D4.

Note that phi = Sigma∧(-1/2), the problem was reparameterized

Assumption I7 (Higher-Order Identification).

Theorem I5.

Proof of Theorem I5 for $B_{n,\infty}$ :