A Proof of Orthogonal Double Machine Learning with $Z$-Estimators

Vasilis Syrgkanis

arXiv:1704.03754·stat.ML·April 18, 2017

A Proof of Orthogonal Double Machine Learning with $Z$-Estimators

Vasilis Syrgkanis

PDF

Open Access

TL;DR

This paper provides an alternative proof for the asymptotic properties of orthogonal Z-estimators in two-stage estimation, simplifying the understanding of their consistency and normality under certain conditions.

Contribution

It offers a simplified, expository proof of the asymptotic normality of orthogonal Z-estimators in two-stage models, extending prior results to a variant based on empirical moment conditions.

Findings

01

Orthogonal Z-estimators are $\,\sqrt{n}$-consistent and asymptotically normal.

02

Sample splitting and $n^{1/4}$-consistency of the first stage are sufficient.

03

The proof simplifies understanding of the estimator's asymptotic behavior.

Abstract

We consider two stage estimation with a non-parametric first stage and a generalized method of moments second stage, in a simpler setting than (Chernozhukov et al. 2016). We give an alternative proof of the theorem given in (Chernozhukov et al. 2016) that orthogonal second stage moments, sample splitting and $n^{1/4}$ -consistency of the first stage, imply $n$ -consistency and asymptotic normality of second stage estimates. Our proof is for a variant of their estimator, which is based on the empirical version of the moment condition (Z-estimator), rather than a minimization of a norm of the empirical vector of moments (M-estimator). This note is meant primarily for expository purposes, rather than as a new technical contribution.

Equations35

E [m (Z, θ_{0}, h_{0} (X))] = 0

E [m (Z, θ_{0}, h_{0} (X))] = 0

\hat{θ} solves : \frac{1}{n} t = 1 \sum n m (Z_{t}, \hat{θ}, \hat{h} (X_{t})) = 0

\hat{θ} solves : \frac{1}{n} t = 1 \sum n m (Z_{t}, \hat{θ}, \hat{h} (X_{t})) = 0

n (\hat{θ} - θ_{0}) \to N (0, Σ)

n (\hat{θ} - θ_{0}) \to N (0, Σ)

E [\nabla_{γ} m (Z, θ_{0}, h_{0} (X)) \cdot (\hat{h} (X) - h_{0} (X))] = 0

E [\nabla_{γ} m (Z, θ_{0}, h_{0} (X)) \cdot (\hat{h} (X) - h_{0} (X))] = 0

n^{1/2} E_{X} [∥ \hat{h} (X) - h_{0} (X) ∥^{2}] \to_{p} 0

n^{1/2} E_{X} [∥ \hat{h} (X) - h_{0} (X) ∥^{2}] \to_{p} 0

n (\hat{θ} - θ_{0}) = A [\frac{1}{n} t = 1 \sum n \nabla_{θ} m (Z_{t}, \tilde{θ}, \hat{h} (X_{t}))]^{- 1} B \frac{1}{n} t = 1 \sum n m (Z_{t}, θ_{0}, \hat{h} (X_{t}))

n (\hat{θ} - θ_{0}) = A [\frac{1}{n} t = 1 \sum n \nabla_{θ} m (Z_{t}, \tilde{θ}, \hat{h} (X_{t}))]^{- 1} B \frac{1}{n} t = 1 \sum n m (Z_{t}, θ_{0}, \hat{h} (X_{t}))

θ \in Θ sup \frac{1}{n} t = 1 \sum n \nabla_{θ} m (Z_{t}, θ, \hat{h} (X)) - E [\nabla_{θ} m (Z, θ, \hat{h} (X))] \to_{p} 0

θ \in Θ sup \frac{1}{n} t = 1 \sum n \nabla_{θ} m (Z_{t}, θ, \hat{h} (X)) - E [\nabla_{θ} m (Z, θ, \hat{h} (X))] \to_{p} 0

\frac{1}{n} t = 1 \sum n \nabla_{θ} m (Z_{t}, \tilde{θ}, \hat{h} (X)) \to E [\nabla_{θ} m (Z, θ_{0}, \hat{h} (X)]

\frac{1}{n} t = 1 \sum n \nabla_{θ} m (Z_{t}, \tilde{θ}, \hat{h} (X)) \to E [\nabla_{θ} m (Z, θ_{0}, \hat{h} (X)]

\frac{1}{n} t = 1 \sum n \nabla_{θ} m (Z_{t}, \tilde{θ}, \hat{h} (X)) \to E [\nabla_{θ} m (Z, θ_{0}, h_{0} (X)]

\frac{1}{n} t = 1 \sum n \nabla_{θ} m (Z_{t}, \tilde{θ}, \hat{h} (X)) \to E [\nabla_{θ} m (Z, θ_{0}, h_{0} (X)]

[\frac{1}{n} t = 1 \sum n \nabla_{θ} m (Z_{t}, \tilde{θ}, \hat{h} (X))]^{- 1} \to [E [\nabla_{θ} m (Z, θ_{0}, h_{0} (X))]]^{- 1} = J^{- 1}

[\frac{1}{n} t = 1 \sum n \nabla_{θ} m (Z_{t}, \tilde{θ}, \hat{h} (X))]^{- 1} \to [E [\nabla_{θ} m (Z, θ_{0}, h_{0} (X))]]^{- 1} = J^{- 1}

B = C \frac{1}{n} t = 1 \sum n m (Z_{t}, θ_{0}, h_{0} (X_{t})) + D \frac{1}{n} t = 1 \sum n \nabla_{γ} m (Z_{t}, θ_{0}, h_{0} (X_{t})) \cdot (\hat{h} (X_{t}) - h_{0} (X_{t})) + E \frac{1}{2 n} t = 1 \sum n (\hat{h} (X_{t}) - h_{0} (X_{t}))^{T} \nabla_{γ γ} m (Z_{t}, θ_{0}, \tilde{h} (X_{t})) \cdot (\hat{h} (X_{t}) - h_{0} (X_{t}))

B = C \frac{1}{n} t = 1 \sum n m (Z_{t}, θ_{0}, h_{0} (X_{t})) + D \frac{1}{n} t = 1 \sum n \nabla_{γ} m (Z_{t}, θ_{0}, h_{0} (X_{t})) \cdot (\hat{h} (X_{t}) - h_{0} (X_{t})) + E \frac{1}{2 n} t = 1 \sum n (\hat{h} (X_{t}) - h_{0} (X_{t}))^{T} \nabla_{γ γ} m (Z_{t}, θ_{0}, \tilde{h} (X_{t})) \cdot (\hat{h} (X_{t}) - h_{0} (X_{t}))

∣ E ∣ \leq \frac{λ ^{*}}{2} n (\frac{1}{n} t = 1 \sum n ∥ \hat{h} (X_{t}) - h_{0} (X_{t}) ∥^{2})

∣ E ∣ \leq \frac{λ ^{*}}{2} n (\frac{1}{n} t = 1 \sum n ∥ \hat{h} (X_{t}) - h_{0} (X_{t}) ∥^{2})

E [D ∣ \hat{h}] = n E [\nabla_{γ} m (Z, θ_{0}, h_{0} (X)) \cdot (\hat{h} (X) - h_{0} (X)) ∣ \hat{h}] = 0

E [D ∣ \hat{h}] = n E [\nabla_{γ} m (Z, θ_{0}, h_{0} (X)) \cdot (\hat{h} (X) - h_{0} (X)) ∣ \hat{h}] = 0

E [D^{2} ∣ \hat{h}] = \frac{1}{n} t \neq = t^{'} \sum E [\nabla_{γ} m (Z, θ_{0}, h_{0} (X_{t})) \cdot (\hat{h} (X) - h_{0} (X)) ∣ \hat{h}]^{2} + \frac{1}{n} t = t^{'} \sum E [∥ \nabla_{γ} m (Z, θ_{0}, h_{0} (X)) \cdot (\hat{h} (X) - h_{0} (X)) ∥^{2} ∣ \hat{h}]

E [D^{2} ∣ \hat{h}] = \frac{1}{n} t \neq = t^{'} \sum E [\nabla_{γ} m (Z, θ_{0}, h_{0} (X_{t})) \cdot (\hat{h} (X) - h_{0} (X)) ∣ \hat{h}]^{2} + \frac{1}{n} t = t^{'} \sum E [∥ \nabla_{γ} m (Z, θ_{0}, h_{0} (X)) \cdot (\hat{h} (X) - h_{0} (X)) ∥^{2} ∣ \hat{h}]

E [D^{2} ∣ \hat{h}] = E [∥ \nabla_{γ} m (Z, θ_{0}, h_{0} (X)))^{2} \cdot (\hat{h} (X) - h_{0} (X)) ∥^{2}] \leq σ^{2} E [∥ \hat{h} (X) - h_{0} (X) ∥^{2}]

E [D^{2} ∣ \hat{h}] = E [∥ \nabla_{γ} m (Z, θ_{0}, h_{0} (X)))^{2} \cdot (\hat{h} (X) - h_{0} (X)) ∥^{2}] \leq σ^{2} E [∥ \hat{h} (X) - h_{0} (X) ∥^{2}]

E [\nabla_{γ} m (Z, θ_{0}, h_{0} (X)) ∣ X] = 0

E [\nabla_{γ} m (Z, θ_{0}, h_{0} (X)) ∣ X] = 0

E [\nabla_{γ} m (Z, θ_{0}, h_{0} (X)) \cdot (\hat{h} (X) - h_{0} (X))] =

E [\nabla_{γ} m (Z, θ_{0}, h_{0} (X)) \cdot (\hat{h} (X) - h_{0} (X))] =

=

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Advanced Statistical Process Monitoring · Fault Detection and Control Systems

Full text

A Proof of Orthogonal Double Machine Learning with $Z$ -Estimators

Vasilis Syrgkanis

Microsoft Research

Abstract

We consider two stage estimation with a non-parametric first stage and a generalized method of moments second stage, in a simpler setting than [CCD*+*16]. We give an alternative proof of the theorem given in Chernozhukov et al. [CCD*+*16] that orthogonal second stage moments, sample splitting and $n^{1/4}$ -consistency of the first stage, imply $\sqrt{n}$ -consistency and asymptotic normality of second stage estimates. Our proof is for a variant of their estimator, which is based on the empirical version of the moment condition (Z-estimator), rather than a minimization of a norm of the empirical vector of moments (M-estimator). This note is meant primarily for expository purposes, rather than as a new technical contribution.

1 Two-Stage Estimation

Suppose we have a model which predicts the following set of moment conditions:

[TABLE]

where $\theta_{0}\in R^{d}$ is a finite dimensional parameter of interest, $h_{0}:S\rightarrow R^{\ell}$ is a nuisance function we do not know, $Z$ are the observed data which are drawn from some distribution and $X\in S$ is a subvector of the observed data.

We want to understand the asymptotic properties of the following two-stage estimation process:

First stage. Estimate $h_{0}(\cdot)$ from an auxiliary data set (e.g. running some non-parametric regresssion) yielding an estimate $\hat{h}$ . 2. 2.

Second stage. Use the first stage estimate $\hat{h}$ and compute an estimate $\hat{\theta}$ of $\theta_{0}$ from an empirical version of the moment condition: i.e.

[TABLE]

The question we want to ask is: is $\hat{\theta}$ $\sqrt{n}$ -consistent. More formally, is it true that:

[TABLE]

for some constant co-variance matrix $\Sigma$ . We will assume that the moment conditions that we use satisfy the following orthogonality property:

Definition 1 (Orthogonality).

For any fixed estimate $\hat{h}$ that can be the outcome of the first stage estimation, the moment conditions are orthogonal if:

[TABLE]

where $\nabla_{\gamma}m(\cdot,\cdot,\cdot)$ denotes the gradient of $m$ with respect to its third argument.

2 Orthogonality Implies Root- $n$ Consistency

Assumption 1.

We will make the following regularity assumptions:

•

Rate of First Stage.* The first stage estimation is $n^{-1/4}$ -consistent in the squared mean-square-error sense, i.e.*

[TABLE]

where the convergence in probability statement is with respect to the auxiliary data set

•

Regularity of First Stage.* The first stage estimate and the nuisance function are uniformly bounded by a constant, i.e.: $\|\hat{h}(x)\|,\|h_{0}(x)\|\leq C$ for all $x\in S$ .*

•

Regularity of Moments.* The following smoothness conditions hold for the moments*

For any $z,x,\gamma$ the function $m(z,\theta,\gamma)$ is continuous in $\theta$ . Also $m(z,\theta,\gamma)\leq d(z)$ and $\mathbb{E}[d(Z)]<\infty$ . 2. 2.

Similarly, the same conditions hold for $\nabla_{\theta}m(z,\theta,\gamma)$ . 3. 3.

$\mathbb{E}\left[\nabla_{\theta}m(z,\theta_{0},h_{0}(x))\right]$ * is non-singular.* 4. 4.

the Hessian $\nabla_{\gamma\gamma}m(z,\theta,\gamma)$ has the largest eigenvalue bounded by some constant $\lambda$ uniformly for all $\theta$ and $\gamma$ . 5. 5.

the derivative $\nabla_{\gamma}m(z,\theta,\gamma)$ has norm, uniformly bounded by $\sigma$

Theorem 2.

Under Assumption 1 and assuming that $\hat{\theta}$ is consistent, if the moment conditions satisfy the orthogonality property then $\hat{\theta}$ is also $\sqrt{n}$ -consistent and asymptotically normal.

Proof.

By doing a first-order Taylor expansion of the empirical moment condition around $\theta_{0}$ and by the mean value theorem, we have:

[TABLE]

where $\tilde{\theta}$ is convex combination of $\theta_{0}$ and $\hat{\theta}$ . We will show that $A$ converges in probability to a constant $J^{-1}$ and that $B$ converges in distribution to a normal $N(0,V)$ , for some constant co-variance matrix $V$ . Then the theorem follows by invoking Slutzky’s theorem, which shows convergence in distribution to $N(0,J^{-1}V)$ .

Convergence of $A$ to inverse derivative.

By the regularity of the moments, we have a uniform law of large numbers for the quantity $\frac{1}{n}\sum_{t=1}^{n}\nabla_{\theta}m(Z_{t},\theta,\hat{h}(X))$ , i.e.:

[TABLE]

Since $\hat{\theta}$ is consistent, we also have that $\tilde{\theta}$ is consistent, i.e. $\tilde{\theta}\rightarrow_{p}\theta$ . Combining the latter two properties, we get that conditional on the auxiliary data set:

[TABLE]

Moreover, since $\hat{h}$ is consistent we get that:

[TABLE]

Since the matrix $\mathbb{E}\left[\nabla_{\theta}m(z,\theta_{0},h_{0}(x))\right]$ is non-singular, by continuity of the inverse we get:

[TABLE]

Asymptotic normality of $B$ .

To argue asymptotic normality of $B$ we take a second-order Taylor expansion of $B$ around $h_{0}(X_{t})$ for each $X_{t}$ :

[TABLE]

First we observe that $C$ is the sum of $n$ i.i.d. random variables, divided by $\sqrt{n}$ . Thus by the Central Limit Theorem, we get that $C\rightarrow N(0,V)$ , for some constant co-variance matrix $V$ . Then we conclude by showing that $D,E\rightarrow_{p}0$ .

Second we argue that $n^{1/4}$ consistency of the first stage, implies that $E\rightarrow_{p}0$ . Since $\nabla_{\gamma\gamma}m(z,\theta,\gamma)$ has a largest eigenvalue uniformly bounded by $\lambda^{*}$ , we have that the quantity $E$ is bounded by

[TABLE]

Fixing the auxiliary data set, the quantity $\frac{1}{n}\sum_{t=1}^{n}\|\hat{h}(X_{t})-h_{0}(X_{t})\|^{2}$ converges to $\mathbb{E}[\|\hat{h}(X_{t})-h_{0}(X_{t})\|^{2}]$ . Subsequently by $n^{1/4}$ -consistency of the first stage, and regularity of the first stage, we get that $E\rightarrow_{p}0$ .

Finally, we argue that orthogonality implies that $D\rightarrow_{p}0$ . We show that both the mean and the trace of the co-variance of $D$ converge to [math]. The mean conditional on the auxiliary data set is:

[TABLE]

The diagonal entries of the co-variance conditional on the auxiliary dataset is:

[TABLE]

All the cross terms are zero by orthogonality, giving:

[TABLE]

Since $\hat{h}$ is consistent, we get that the latter converges to zero. Since the mean of $D$ and the trace of its co-variance converge to zero, we get that $D\rightarrow_{p}0$ .

Consistency of the estimator also follows easily from standard arguments, if one makes Assumption 1 and the extra condition that the moment condition in the limit is satisfied only for the true parameters, which is needed for identification (see e.g. [NM94] for the formal set of extra regularity assumptions needed for consistency).

3 Orthogonal Moments for Conditional Moment Problems

One special case of when the orthogonality condition is satisfied is the following stronger, but easier to check property of conditional orthogonality:

Definition 2 (Conditional Orthogonality).

The moment conditions are conditionally orthogonal if:

[TABLE]

Lemma 3.

Conditional orthogonality implies orthogonality, when an auxiliary data set is used to estimate $\hat{h}$ .

Proof.

By the law of iterated expectations we have:

[TABLE]

Where in the last part we used the conditional orthogonality property.

For conditional moment problems studied in [Cha92], [CCD*+*16] shows how one can transform in an algorithmic manner an initial set of moments to a vector of orthogonal moments.

Bibliography3

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[CCD + 16] V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, and a. W. Newey. Double Machine Learning for Treatment and Causal Parameters. Ar Xiv e-prints , July 2016.
2[Cha 92] Gary Chamberlain. Efficiency bounds for semiparametric regression. Econometrica , 60(3):567–596, 1992.
3[NM 94] Whitney K. Newey and Daniel Mc Fadden. Chapter 36 large sample estimation and hypothesis testing. Handbook of Econometrics , 4:2111 – 2245, 1994.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

A Proof of Orthogonal Double Machine Learning with ZZZ-Estimators

Abstract

1 Two-Stage Estimation

Definition 1** (Orthogonality).**

2 Orthogonality Implies Root-nnn Consistency

Assumption 1**.**

Theorem 2**.**

Proof.

Convergence of AAA to inverse derivative.

Asymptotic normality of BBB.

3 Orthogonal Moments for Conditional Moment Problems

Definition 2** (Conditional Orthogonality).**

Lemma 3**.**

Proof.

A Proof of Orthogonal Double Machine Learning with $Z$ -Estimators

Definition 1 (Orthogonality).

2 Orthogonality Implies Root- $n$ Consistency

Assumption 1.

Theorem 2.

Convergence of $A$ to inverse derivative.

Asymptotic normality of $B$ .

Definition 2 (Conditional Orthogonality).

Lemma 3.