A Proof of Orthogonal Double Machine Learning with $Z$-Estimators
Vasilis Syrgkanis

TL;DR
This paper provides an alternative proof for the asymptotic properties of orthogonal Z-estimators in two-stage estimation, simplifying the understanding of their consistency and normality under certain conditions.
Contribution
It offers a simplified, expository proof of the asymptotic normality of orthogonal Z-estimators in two-stage models, extending prior results to a variant based on empirical moment conditions.
Findings
Orthogonal Z-estimators are $\,\sqrt{n}$-consistent and asymptotically normal.
Sample splitting and $n^{1/4}$-consistency of the first stage are sufficient.
The proof simplifies understanding of the estimator's asymptotic behavior.
Abstract
We consider two stage estimation with a non-parametric first stage and a generalized method of moments second stage, in a simpler setting than (Chernozhukov et al. 2016). We give an alternative proof of the theorem given in (Chernozhukov et al. 2016) that orthogonal second stage moments, sample splitting and -consistency of the first stage, imply -consistency and asymptotic normality of second stage estimates. Our proof is for a variant of their estimator, which is based on the empirical version of the moment condition (Z-estimator), rather than a minimization of a norm of the empirical vector of moments (M-estimator). This note is meant primarily for expository purposes, rather than as a new technical contribution.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Advanced Statistical Process Monitoring · Fault Detection and Control Systems
A Proof of Orthogonal Double Machine Learning with -Estimators
Vasilis Syrgkanis
Microsoft Research
Abstract
We consider two stage estimation with a non-parametric first stage and a generalized method of moments second stage, in a simpler setting than [CCD*+*16]. We give an alternative proof of the theorem given in Chernozhukov et al. [CCD*+*16] that orthogonal second stage moments, sample splitting and -consistency of the first stage, imply -consistency and asymptotic normality of second stage estimates. Our proof is for a variant of their estimator, which is based on the empirical version of the moment condition (Z-estimator), rather than a minimization of a norm of the empirical vector of moments (M-estimator). This note is meant primarily for expository purposes, rather than as a new technical contribution.
1 Two-Stage Estimation
Suppose we have a model which predicts the following set of moment conditions:
[TABLE]
where is a finite dimensional parameter of interest, is a nuisance function we do not know, are the observed data which are drawn from some distribution and is a subvector of the observed data.
We want to understand the asymptotic properties of the following two-stage estimation process:
First stage. Estimate from an auxiliary data set (e.g. running some non-parametric regresssion) yielding an estimate . 2. 2.
Second stage. Use the first stage estimate and compute an estimate of from an empirical version of the moment condition: i.e.
[TABLE]
The question we want to ask is: is -consistent. More formally, is it true that:
[TABLE]
for some constant co-variance matrix . We will assume that the moment conditions that we use satisfy the following orthogonality property:
Definition 1** (Orthogonality).**
For any fixed estimate that can be the outcome of the first stage estimation, the moment conditions are orthogonal if:
[TABLE]
where denotes the gradient of with respect to its third argument.
2 Orthogonality Implies Root- Consistency
Assumption 1**.**
We will make the following regularity assumptions:
- •
Rate of First Stage.* The first stage estimation is -consistent in the squared mean-square-error sense, i.e.*
[TABLE]
where the convergence in probability statement is with respect to the auxiliary data set
- •
Regularity of First Stage.* The first stage estimate and the nuisance function are uniformly bounded by a constant, i.e.: for all .*
- •
Regularity of Moments.* The following smoothness conditions hold for the moments*
For any the function is continuous in . Also and . 2. 2.
Similarly, the same conditions hold for . 3. 3.
* is non-singular.* 4. 4.
the Hessian has the largest eigenvalue bounded by some constant uniformly for all and . 5. 5.
the derivative has norm, uniformly bounded by
Theorem 2**.**
Under Assumption 1 and assuming that is consistent, if the moment conditions satisfy the orthogonality property then is also -consistent and asymptotically normal.
Proof.
By doing a first-order Taylor expansion of the empirical moment condition around and by the mean value theorem, we have:
[TABLE]
where is convex combination of and . We will show that converges in probability to a constant and that converges in distribution to a normal , for some constant co-variance matrix . Then the theorem follows by invoking Slutzky’s theorem, which shows convergence in distribution to .
Convergence of to inverse derivative.
By the regularity of the moments, we have a uniform law of large numbers for the quantity , i.e.:
[TABLE]
Since is consistent, we also have that is consistent, i.e. . Combining the latter two properties, we get that conditional on the auxiliary data set:
[TABLE]
Moreover, since is consistent we get that:
[TABLE]
Since the matrix is non-singular, by continuity of the inverse we get:
[TABLE]
Asymptotic normality of .
To argue asymptotic normality of we take a second-order Taylor expansion of around for each :
[TABLE]
First we observe that is the sum of i.i.d. random variables, divided by . Thus by the Central Limit Theorem, we get that , for some constant co-variance matrix . Then we conclude by showing that .
Second we argue that consistency of the first stage, implies that . Since has a largest eigenvalue uniformly bounded by , we have that the quantity is bounded by
[TABLE]
Fixing the auxiliary data set, the quantity converges to . Subsequently by -consistency of the first stage, and regularity of the first stage, we get that .
Finally, we argue that orthogonality implies that . We show that both the mean and the trace of the co-variance of converge to [math]. The mean conditional on the auxiliary data set is:
[TABLE]
The diagonal entries of the co-variance conditional on the auxiliary dataset is:
[TABLE]
All the cross terms are zero by orthogonality, giving:
[TABLE]
Since is consistent, we get that the latter converges to zero. Since the mean of and the trace of its co-variance converge to zero, we get that .
Consistency of the estimator also follows easily from standard arguments, if one makes Assumption 1 and the extra condition that the moment condition in the limit is satisfied only for the true parameters, which is needed for identification (see e.g. [NM94] for the formal set of extra regularity assumptions needed for consistency).
3 Orthogonal Moments for Conditional Moment Problems
One special case of when the orthogonality condition is satisfied is the following stronger, but easier to check property of conditional orthogonality:
Definition 2** (Conditional Orthogonality).**
The moment conditions are conditionally orthogonal if:
[TABLE]
Lemma 3**.**
Conditional orthogonality implies orthogonality, when an auxiliary data set is used to estimate .
Proof.
By the law of iterated expectations we have:
[TABLE]
Where in the last part we used the conditional orthogonality property.
For conditional moment problems studied in [Cha92], [CCD*+*16] shows how one can transform in an algorithmic manner an initial set of moments to a vector of orthogonal moments.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[CCD + 16] V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, and a. W. Newey. Double Machine Learning for Treatment and Causal Parameters. Ar Xiv e-prints , July 2016.
- 2[Cha 92] Gary Chamberlain. Efficiency bounds for semiparametric regression. Econometrica , 60(3):567–596, 1992.
- 3[NM 94] Whitney K. Newey and Daniel Mc Fadden. Chapter 36 large sample estimation and hypothesis testing. Handbook of Econometrics , 4:2111 – 2245, 1994.
