Semiparametric Difference-in-Differences with Potentially Many Control   Variables

Neng-Chieh Chang

arXiv:1812.10846·econ.GN·January 9, 2019

Semiparametric Difference-in-Differences with Potentially Many Control Variables

Neng-Chieh Chang

PDF

Open Access

TL;DR

This paper introduces three new semiparametric difference-in-differences estimators that effectively handle many control variables, including high-dimensional cases, achieving bias reduction, sqrt{N}-consistency, and valid inference.

Contribution

It proposes novel DID estimators that incorporate machine learning methods for high-dimensional controls, overcoming bias and inconsistency issues of traditional approaches.

Findings

01

New estimators achieve sqrt{N}-consistency and asymptotic normality.

02

Estimators have the small bias property, with bias diminishing faster than nonparametric estimators.

03

Method enables valid inference with many control variables, including high-dimensional settings.

Abstract

This paper discusses difference-in-differences (DID) estimation when there exist many control variables, potentially more than the sample size. In this case, traditional estimation methods, which require a limited number of variables, do not work. One may consider using statistical or machine learning (ML) methods. However, by the well-known theory of inference of ML methods proposed in Chernozhukov et al. (2018), directly applying ML methods to the conventional semiparametric DID estimators will cause significant bias and make these DID estimators fail to be sqrt{N}-consistent. This article proposes three new DID estimators for three different data structures, which are able to shrink the bias and achieve sqrt{N}-consistency and asymptotic normality with mean zero when applying ML methods. This leads to straightforward inferential procedures. In addition, I show that these new…

Figures19

Click any figure to enlarge with its caption.

Tables1

Table 1. Table 1:

	Sequeira (2016)	Abadie (kernel)	$\tilde{θ}$ (kernel)	Abadie (Lasso)	$\tilde{θ}$ (Lasso)
ATT	-2.928** (0.944)	-7.986** (3.028)	-8.670** (3.643)	-7.499** (2.746)	-9.191* (4.854)

Equations628

Y_{i} (t) = μ + X_{i}^{'} π (t) + τ \cdot D_{i} + δ \cdot t + α \cdot D_{i} (t) + ε_{i} (t),

Y_{i} (t) = μ + X_{i}^{'} π (t) + τ \cdot D_{i} + δ \cdot t + α \cdot D_{i} (t) + ε_{i} (t),

θ_{0} : = E [Y_{i}^{1} (1) - Y_{i}^{0} (1) ∣ D_{i} = 1] .

θ_{0} : = E [Y_{i}^{1} (1) - Y_{i}^{0} (1) ∣ D_{i} = 1] .

θ_{0} = E [\frac{Y _{i} ( 1 ) - Y _{i} ( 0 )}{P ( D _{i} = 1 )} \frac{D _{i} - P ( D _{i} = 1 ∣ X _{i} )}{1 - P ( D _{i} = 1 ∣ X _{i} )}] .

θ_{0} = E [\frac{Y _{i} ( 1 ) - Y _{i} ( 0 )}{P ( D _{i} = 1 )} \frac{D _{i} - P ( D _{i} = 1 ∣ X _{i} )}{1 - P ( D _{i} = 1 ∣ X _{i} )}] .

θ_{0} = E [\frac{T _{i} - λ _{0}}{λ _{0} ( 1 - λ _{0} )} \frac{Y _{i}}{P ( D _{i} = 1 )} \frac{D _{i} - P ( D _{i} = 1 ∣ X _{i} )}{1 - P ( D _{i} = 1 ∣ X _{i} )}],

θ_{0} = E [\frac{T _{i} - λ _{0}}{λ _{0} ( 1 - λ _{0} )} \frac{Y _{i}}{P ( D _{i} = 1 )} \frac{D _{i} - P ( D _{i} = 1 ∣ X _{i} )}{1 - P ( D _{i} = 1 ∣ X _{i} )}],

θ_{0}^{w} : = E [Y^{w} (1) - Y^{0} (1) ∣ W = w] .

θ_{0}^{w} : = E [Y^{w} (1) - Y^{0} (1) ∣ W = w] .

E [Y_{i}^{0} (1) - Y_{i}^{0} (0) ∣ X_{i}, W_{i} = w] = E [Y_{i}^{0} (1) - Y_{i}^{0} (0) ∣ X_{i}, W_{i} = 0]

E [Y_{i}^{0} (1) - Y_{i}^{0} (0) ∣ X_{i}, W_{i} = w] = E [Y_{i}^{0} (1) - Y_{i}^{0} (0) ∣ X_{i}, W_{i} = 0]

θ_{0}^{w} = E [\frac{Y ( 1 ) - Y ( 0 )}{P ( W = w )} \frac{I ( W = w ) \cdot P ( W = 0 ∣ X ) - I ( W = 0 ) \cdot P ( W = w ∣ X )}{P ( W = 0 ∣ X )}],

θ_{0}^{w} = E [\frac{Y ( 1 ) - Y ( 0 )}{P ( W = w )} \frac{I ( W = w ) \cdot P ( W = 0 ∣ X ) - I ( W = 0 ) \cdot P ( W = w ∣ X )}{P ( W = 0 ∣ X )}],

\hat{θ} = \frac{1}{N} i = 1 \sum N \frac{Y _{i} ( 1 ) - Y _{i} ( 0 )}{p ^} \frac{D _{i} - g ^ ( X _{i} )}{1 - g ^ ( X _{i} )} .

\hat{θ} = \frac{1}{N} i = 1 \sum N \frac{Y _{i} ( 1 ) - Y _{i} ( 0 )}{p ^} \frac{D _{i} - g ^ ( X _{i} )}{1 - g ^ ( X _{i} )} .

\partial_{g} E [φ (W, θ_{0}, p_{0}, g_{0})] [g - g_{0}] \neq = 0,

\partial_{g} E [φ (W, θ_{0}, p_{0}, g_{0})] [g - g_{0}] \neq = 0,

ψ_{1} (W, θ_{0}, p_{0}, η_{10})

ψ_{1} (W, θ_{0}, p_{0}, η_{10})

- c_{1} \frac{D - P ( D = 1 ∣ X )}{P ( D = 1 ) ( 1 - P ( D = 1 ∣ X ) )} E [Y (1) - Y (0) ∣ X, D = 0],

η_{10} = (P (D = 1 ∣ X), E [Y (1) - Y (0) ∣ X, D = 0]) = : (g_{0}, ℓ_{10}) .

η_{10} = (P (D = 1 ∣ X), E [Y (1) - Y (0) ∣ X, D = 0]) = : (g_{0}, ℓ_{10}) .

ψ_{2} (W, θ_{0}, p_{0}, λ_{0}, η_{20}) = \frac{T - λ _{0}}{λ _{0} ( 1 - λ _{0} )} \frac{Y}{P ( D = 1 )} \frac{D - P ( D = 1 ∣ X )}{1 - P ( D = 1 ∣ X )} - θ_{0} - c_{2},

ψ_{2} (W, θ_{0}, p_{0}, λ_{0}, η_{20}) = \frac{T - λ _{0}}{λ _{0} ( 1 - λ _{0} )} \frac{Y}{P ( D = 1 )} \frac{D - P ( D = 1 ∣ X )}{1 - P ( D = 1 ∣ X )} - θ_{0} - c_{2},

c_{2} = \frac{D - P ( D = 1 ∣ X )}{λ _{0} ( 1 - λ _{0} ) \cdot P ( D = 1 ) \cdot ( 1 - P ( D = 1 ∣ X ) )} \times E [(T - λ_{0}) Y ∣ X, D = 0] .

c_{2} = \frac{D - P ( D = 1 ∣ X )}{λ _{0} ( 1 - λ _{0} ) \cdot P ( D = 1 ) \cdot ( 1 - P ( D = 1 ∣ X ) )} \times E [(T - λ_{0}) Y ∣ X, D = 0] .

η_{20} = (P (D = 1 ∣ X), E [(T - λ) Y ∣ X, D = 0]) = : (g_{0}, ℓ_{20}) .

η_{20} = (P (D = 1 ∣ X), E [(T - λ) Y ∣ X, D = 0]) = : (g_{0}, ℓ_{20}) .

ψ_{w} (W, θ_{w 0}, p_{w 0}, η_{w 0}) =

ψ_{w} (W, θ_{w 0}, p_{w 0}, η_{w 0}) =

- θ_{w 0} - c_{w},

c_{w} =

c_{w} =

E [Y (1) - Y (0) ∣ X, I (W = 0) = 1] .

η_{w 0} = (P (W = w ∣ X), P (W = 0 ∣ X), E [Y (1) - Y (0) ∣ X, I (W = 0) = 1]) = : (g_{0 w}, g_{0 z}, ℓ_{30}) .

η_{w 0} = (P (W = w ∣ X), P (W = 0 ∣ X), E [Y (1) - Y (0) ∣ X, I (W = 0) = 1]) = : (g_{0 w}, g_{0 z}, ℓ_{30}) .

\tilde{θ}_{k} = \frac{1}{n} i \in I_{k} \sum \frac{D _{i} - g ^ _{k} ( X _{i} )}{p ^ _{k} ( 1 - g ^ _{k} ( X _{i} ) )} \times (Y_{i} (1) - Y_{i} (0) - \hat{ℓ}_{1 k} (X_{i})) \leavevmode (repeated outcomes)

\tilde{θ}_{k} = \frac{1}{n} i \in I_{k} \sum \frac{D _{i} - g ^ _{k} ( X _{i} )}{p ^ _{k} ( 1 - g ^ _{k} ( X _{i} ) )} \times (Y_{i} (1) - Y_{i} (0) - \hat{ℓ}_{1 k} (X_{i})) \leavevmode (repeated outcomes)

\tilde{θ}_{k} = \frac{1}{n} i \in I_{k} \sum \frac{D _{i} - g ^ _{k} ( X _{i} )}{p ^ _{k} λ ^ _{k} ( 1 - λ ^ _{k} ) ( 1 - g ^ _{k} ( X _{i} ) )} \times ((T_{i} - \hat{λ}_{k}) Y_{i} - \hat{ℓ}_{2 k} (X_{i})) \leavevmode (repeated cross sections)

\tilde{θ}_{k} = \frac{1}{n} i \in I_{k} \sum \frac{D _{i} - g ^ _{k} ( X _{i} )}{p ^ _{k} λ ^ _{k} ( 1 - λ ^ _{k} ) ( 1 - g ^ _{k} ( X _{i} ) )} \times ((T_{i} - \hat{λ}_{k}) Y_{i} - \hat{ℓ}_{2 k} (X_{i})) \leavevmode (repeated cross sections)

q_{i} : = (q_{i 1} (X_{i}), ..., q_{i p} (X_{i}))^{'} .

q_{i} : = (q_{i 1} (X_{i}), ..., q_{i p} (X_{i}))^{'} .

\overset{g}{^}_{k} (x_{i}) : = Λ (q_{i}^{'} \hat{β}_{k}),

\overset{g}{^}_{k} (x_{i}) : = Λ (q_{i}^{'} \hat{β}_{k}),

\hat{β}_{k} : = ar g β \in R^{p} min \frac{1}{M} i \in I_{k}^{c} \sum {- D_{i} (q_{i}^{'} β) + lo g (1 + exp (q_{i}^{'} β))} + λ_{k} ∥ β ∥_{1}

\hat{β}_{k} : = ar g β \in R^{p} min \frac{1}{M} i \in I_{k}^{c} \sum {- D_{i} (q_{i}^{'} β) + lo g (1 + exp (q_{i}^{'} β))} + λ_{k} ∥ β ∥_{1}

\hat{ℓ}_{1 k} (x_{i}) : = q_{i}^{'} \hat{β}_{1 k},

\hat{ℓ}_{1 k} (x_{i}) : = q_{i}^{'} \hat{β}_{1 k},

\hat{ℓ}_{2 k} (x_{i}) : = q_{i}^{'} \hat{β}_{2 k},

\hat{ℓ}_{2 k} (x_{i}) : = q_{i}^{'} \hat{β}_{2 k},

\hat{β}_{1 k} \in ar g β \in R^{p} min \frac{1}{M _{k}} i \in I_{k z}^{c} \sum (Y_{i} (1) - Y_{i} (0) - q_{i}^{'} β)^{2} + \frac{λ _{1 k}}{M _{k}} ∥ \hat{Υ}_{1 k} β ∥_{1}

\hat{β}_{1 k} \in ar g β \in R^{p} min \frac{1}{M _{k}} i \in I_{k z}^{c} \sum (Y_{i} (1) - Y_{i} (0) - q_{i}^{'} β)^{2} + \frac{λ _{1 k}}{M _{k}} ∥ \hat{Υ}_{1 k} β ∥_{1}

\hat{β}_{2 k} \in ar g β \in R^{p} min \frac{1}{M _{k}} i \in I_{k z}^{c} \sum ((T_{i} - \hat{λ}_{k}) Y_{i} - q_{i}^{'} β)^{2} + \frac{λ _{2 k}}{M _{k}} ∥ \hat{Υ}_{2 k} β ∥_{1}

\hat{β}_{2 k} \in ar g β \in R^{p} min \frac{1}{M _{k}} i \in I_{k z}^{c} \sum ((T_{i} - \hat{λ}_{k}) Y_{i} - q_{i}^{'} β)^{2} + \frac{λ _{2 k}}{M _{k}} ∥ \hat{Υ}_{2 k} β ∥_{1}

D_{r} [η - η_{0}] : = \partial_{r} {E_{P} [ψ (W, θ_{0}, ρ_{0}, η_{0} + r (η - η_{0}))]}, η \in T,

D_{r} [η - η_{0}] : = \partial_{r} {E_{P} [ψ (W, θ_{0}, ρ_{0}, η_{0} + r (η - η_{0}))]}, η \in T,

\partial_{η} E_{P} ψ (W, θ_{0}, ρ_{0}, η_{0}) [η - η_{0}] : = D_{0} [η - η_{0}], η \in T .

\partial_{η} E_{P} ψ (W, θ_{0}, ρ_{0}, η_{0}) [η - η_{0}] : = D_{0} [η - η_{0}], η \in T .

\partial_{η} E_{P} ψ (W, θ_{0}, ρ_{0}, η_{0}) [η - η_{0}] = 0, for all η \in T_{N} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Causal Inference Techniques

Full text

Semiparametric Difference-in-Differences with Potentially Many Control Variables

Neng-Chieh Chang111Department of Economics, University of California Los Angeles, 315 Portola Plaza, Los Angeles, CA 90095, USA. email: [email protected]

Abstract

This paper discusses difference-in-differences (DID) estimation when there exist many control variables, potentially more than the sample size. In this case, traditional estimation methods, which require a limited number of variables, do not work. One may consider using statistical or machine learning (ML) methods. However, by the well-known theory of inference of ML methods proposed in Chernozhukov \BOthers. (\APACyear2018), directly applying ML methods to the conventional semiparametric DID estimators will cause significant bias and make these DID estimators fail to be $\sqrt{N}$ -consistent. This article proposes three new DID estimators for three different data structures, which are able to shrink the bias and achieve $\sqrt{N}$ -consistency and asymptotic normality with mean zero when applying ML methods. This leads to straightforward inferential procedures. In addition, I show that these new estimators have the small bias property (SBP), meaning that their bias will converge to zero faster than the pointwise bias of the nonparametric estimator on which it is based.

Keyword: difference-in-differences, causal inference, high-dimensional data, Neyman orthogonality, $\sqrt{N}$ -consistency, undersmoothing

JEL Classification: C13, C14

1 Introduction

The difference-in-differences (DID) estimator has been widely used in empirical economics to evaluate causal effects when there exists a natural experiment with a treated group and an untreated group. By comparing the variation over time in an outcome variable between the treated group and the untreated group, the DID estimator can be used to calculate the effect of treatment on the outcome variable. Applications of DID include but are not limited to studies of the effects of immigration on labor markets (Card, \APACyear1990), the effects of minimum wage law on wages (Card \BBA Krueger, \APACyear1994), the effect of tariffs liberalization on corruption (Sequeira, \APACyear2016), the effect of household income on children’s personalities (Akee, Copeland, Costello\BCBL \BBA Simeonova, \APACyear2018), and the effect of corporate tax on wages (Fuest, Peichl\BCBL \BBA Siegloch, \APACyear2018).

The traditional linear DID estimator depends on a parallel trend assumption that in the absence of treatment, the difference of outcomes between treated and untreated groups remains constant over time. In many situations, however, this assumption may not hold because there are other individual characteristics that may be associated with the variations of the outcomes. The treatment may be taken as exogenous only after controlling these characteristics. To address this problem, Abadie (\APACyear2005) proposed the semiparametric DID estimators. Compared to the traditional linear DID estimators, the advantages of Abadie’s estimators are threefold. First, the characteristics are treated nonparametrically so that any estimation error caused by functional specification is avoided. Second, the effect of treatment is allowed to vary among individuals, while the traditional linear DID estimator does not allow this heterogeneity. Third, the estimation framework proposed in Abadie (\APACyear2005) allows researchers to estimate how the effect of treatment varies with changes in the characteristics.

This paper is an extension of Abadie (\APACyear2005). Abadie (\APACyear2005) considered the case where the number of control variables has to be limited. A practical difficulty empirical researchers encounter is choosing what variables to include when there is a rich data set. Although economic intuition can help us narrow down the choice set, it will not completely select all the important variables. This variable selection problem may lead to the chance of omitted variables in practice. In this paper, I consider the DID estimation with many control variables, potentially more than the sample size. The classical estimation methods which require a fixed number of variables do not work in this situation. One has to consider using ML methods such as Lasso, Logit Lasso, random forests, boosted trees, or various hybrids. However, by the well-known theory of inference of ML methods developed in Chernozhukov \BOthers. (\APACyear2018), if one directly applies ML methods to the conventional semiparametric DID estimators proposed in Abadie (\APACyear2005), the result will lead to significant bias and invalid inference. In particular, the regularization bias embedded in ML methods will result in the conventional semiparametric DID estimators failing to be $\sqrt{N}$ -consistent.

I contribute to the literature by proposing three new DID estimators for three different data structures: repeated outcomes, repeated cross-sections, and multilevel treatment. These new estimators can relieve the impact of the regularization bias of ML methods and achieve $\sqrt{N}$ -consistency. The key is to find the so-called Neyman-orthogonal scores (Chernozhukov \BOthers., \APACyear2018) of Abadie (\APACyear2005)’s estimands. The Neyman-orthogonal score is a function that identifies the parameter of interest, and its derivatives with respect to the nuisance parameters are zero. This property helps us remove the first-order bias caused by ML methods so that only the second-order bias remains, which is much smaller and easier to control than the first-order bias as in the conventional semiparametric DID estimators. Using the cross-fitting algorithm in Chernozhukov \BOthers. (\APACyear2018), I show that the new DID estimators can be $\sqrt{N}$ -consistent and asymptotically normal when using ML methods. Figure 1 presents a Monte Carlo simulation that illustrates the negative effect of directly combining ML methods with Abadie’s estimator and the benefit of using the newly proposed DID estimator.

The second contribution is concerned with the conventional semiparametric DID estimators with a limited number of control variables considered in Abadie (\APACyear2005). In this case, the conventional semiparametric DID estimators are able to achieve $\sqrt{N}$ -consistency using kernel estimators, but they will require undersmoothing. Undersmoothing is a condition that requires the pointwise bias of the kernel estimators to converge to zero faster than the pointwise standard deviation. This condition will be violated if researchers use standard data-driven methods, such as cross-validation (CV), to choose the bandwidths of kernel estimators because those methods do not undersmooth.

In this paper, I show that the new estimators do not require undersmoothing to achieve $\sqrt{N}$ -consistency. Specifically, I will show that the new estimators have the small bias property (SBP), in terms of Newey, Hsieh\BCBL \BBA Robins (\APACyear2004), meaning that the bias of the new estimators will converge to zero faster than the pointwise bias of the nonparametric estimator on which it is based. The SBP, as shown in Chernozhukov, Escanciano, Ichimura\BCBL \BBA Newey (\APACyear2016), is a sufficient condition to remove the undersmoothing requirement. Figure 2 shows the Monte Carlo simulation results of Abadie’s estimator and the new estimator with bandwidths chosen by CV. We can observe that Abadie’s estimator is biased since CV does not undersmooth, and the newly proposed estimator can correct this bias.

As an empirical example, I study the effect of tariff reduction on corruption behavior using the trade data between South Africa and Mozambique during 2006 and 2014. The treatment is the large tariff reduction on certain commodities occurring in 2008. This natural experiment was previously studied by Sequeira (\APACyear2016) using the traditional linear DID estimator. I apply my proposed semiparametric DID estimator and Abadie (\APACyear2005)’s semiparemetric DID estimator on the same data set (Table 9 of Sequeira (\APACyear2016)). In comparison to Sequeira (\APACyear2016) that a decrease in tariff rate will decrease corruption behavior, the two semiparametric estimators consistently suggest that the effect is actually substantially larger than previously reported by Sequeira (\APACyear2016). A potential explanation for this difference is that the true data generating process violates the linear specification assumed in the traditional linear DID estimator. In addition, when compared to Abadie (\APACyear2005)’s estimator, my proposed estimator shows that the effect is even larger.

The new estimators proposed in this paper heavily rely on the recent high-dimensional and ML literature: Belloni, Chen, Chernozhukov\BCBL \BBA Hansen (\APACyear2012), Belloni, Chernozhukov\BCBL \BBA Hansen (\APACyear2014), Chernozhukov, Hansen\BCBL \BBA Spindler (\APACyear2015), Belloni, Chernozhukov, Fernández-Val\BCBL \BBA Hansen (\APACyear2017), and Chernozhukov \BOthers. (\APACyear2018); and the literature of the SBP in semiparametric estimation: Newey, Hsieh\BCBL \BBA Robins (\APACyear1998, \APACyear2004) and Chernozhukov, Escanciano, Ichimura\BCBL \BBA Newey (\APACyear2016).

Plan of the paper. Section 2 describes the conventional semiparametric DID estimators and discusses their limitations when applying ML methods. Section 3 presents the new DID estimators and discusses their theoretical properties. Section 4 conducts Monte Carlo simulation to shed some light on the finite sample performance of the proposed estimators. Section 5 provides an application, and Section 6 concludes the paper.

2 The Conventional Semiparametric DID Estimators

Let $Y_{i}\left(t\right)$ be the outcome of interest for individual $i$ at time $t$ and $D_{i}\left(t\right)\in\left\{0,1\right\}$ the treatment status. The population is observed in a pre-treatment period $t=0$ , and in a post-treatment period $t=1$ . With potential outcome notations (Rubin, \APACyear1974), we have $Y_{i}\left(t\right)=Y_{i}^{0}\left(t\right)+\left(Y_{i}^{1}\left(t\right)-Y_{i}^{0}\left(t\right)\right)D_{i}\left(t\right)$ , where $Y_{i}^{0}\left(t\right)$ is the outcome that individual $i$ would attain at time $t$ in the absence of the treatment, and $Y_{i}^{1}\left(t\right)$ represents the outcome that individual $i$ would attain at time $t$ if exposed to the treatment. Since individuals are only exposed to treatment at $t=1$ , we have $D_{i}\left(0\right)=0$ for all $i$ . To reduce notation, I define $D_{i}\coloneqq D_{i}\left(1\right)$ . Also, let $X_{i}\in\mathbb{R}^{d}$ be a vector of control variables with dimension $d$ potentially larger than the sample size $N$ .

The traditional linear DID estimator is the parameter $\alpha$ in the following linear model

[TABLE]

where $\varepsilon_{i}\left(t\right)$ is an exogenous shock that has mean zero and $\left(\mu,\pi\left(t\right),\tau,\delta\right)$ are the corresponding parameters. Clearly, the linear specification assumed here is a strong assumption since the true data generating process may be nonlinear. In addition, Meyer, Viscusi\BCBL \BBA Durbin (\APACyear1995) noticed that including control variables in this linear form may not be appropriate if the treatment has different effects for different groups in the population. To deal with these problems, Abadie (\APACyear2005) proposed the semiparametric DID estimators which can identify average treatment effect on the treated (ATT)

[TABLE]

According to the data, there are three particular cases.

Case 1: Random sample with repeated outcomes

Consider the case that researchers can observe both pre-treatment and post-treatment outcomes for each individual of interest. That is, researchers observe $\left\{Y_{i}\left(0\right),Y_{i}\left(1\right),D_{i},X_{i}\right\}_{i=1}^{N}$ . In this case, the ATT can be identified under the following assumptions (Abadie, \APACyear2005):

Assumption 2.1.

$E\left[Y_{i}^{0}\left(1\right)-Y_{i}^{0}\left(0\right)\mid X_{i},D_{i}=1\right]=E\left[Y_{i}^{0}\left(1\right)-Y_{i}^{0}\left(0\right)\mid X_{i},D_{i}=0\right]$ .

Assumption 2.2.

$P\left(D_{i}=1\right)>0$ and with probability one $P\left(D_{i}=1\mid X_{i}\right)<1$ .

Assumption (2.1) is the conditional parallel trend assumption. It states that conditional on individual’s characteristics, the average outcomes for treated and untreated groups would have followed parallel paths in the absence of treatment. With these two assumptions, the ATT is identified (Abadie, \APACyear2005) as

[TABLE]

Case 2: Random sample with repeated cross sections

Often times, researchers may not be able to observe both pre-treatment and post-treatment outcomes of the same individual. Instead, they observe repeated cross-section data sets. Let $T_{i}$ be a time indicator that takes value one if the observation belongs to the post-treatment sample. Researchers observe $\left\{Y_{i},D_{i},T_{i},X_{i}\right\}_{i=1}^{N}$ , where $Y_{i}=Y_{i}\left(0\right)+T_{i}\left(Y_{i}\left(1\right)-Y_{i}\left(0\right)\right)$ .

Assumption 2.3.

Conditional on $T=0$ , the data are i.i.d. from the distribution of $\left(Y\left(0\right),D,X\right)$ ; conditional on $T=1$ , the data are i.i.d. from the distribution of $\left(Y\left(1\right),D,X\right)$ .

Suppose Assumptions (2.1)-(2.3) hold, the ATT is identified (Abadie, \APACyear2005) as

[TABLE]

where $\lambda_{0}\coloneqq P\left(T_{i}=1\right).$

Case 3: Multilevel treatments

In many cases, individuals can be exposed to different levels of treatment. Let $W\in\left\{0,w_{1},...,w_{J}\right\}$ be the level of treatment, where $W=0$ denotes the untreated individuals. Researchers observe $\left\{Y_{i}\left(0\right),Y_{i}\left(1\right),W_{i},X_{i}\right\}_{i=1}^{N}$ .

For $w\in\left\{0,w_{1},...,w_{J}\right\}$ and $t\in\left\{0,1\right\}$ , let $Y^{w}\left(t\right)$ be the potential outcome for treatment level $w$ at period $t$ . Denote the ATT for each level of treatment $w$ by

[TABLE]

Suppose that Assumptions (2.1) and (2.2) hold for each level of treatment:

[TABLE]

for $w\in\left\{w_{1},...,w_{J}\right\}$ and * $P\left(W_{i}=w\right)>0$ and with probability one $P\left(W_{i}=w\mid X_{i}\right)<1$ *for $w\in\left\{w_{1},...,w_{J}\right\}$ . Then we have (Abadie, \APACyear2005)

[TABLE]

where $I\left(\cdot\right)$ is an indicator function.

Let us focus on Case 1 in which researchers confront repeated outcomes data. To use the identification result (2.1), the first step is to estimate the two nuisance parameters: $P\left(D_{i}=1\right)\eqqcolon p_{0}$ and $P\left(D_{i}=1\mid X_{i}\right)\eqqcolon g_{0}\left(X_{i}\right)$ . The estimator of $p_{0}$ is just a sample average $\hat{p}=N^{-1}\sum_{i=1}^{N}D_{i}$ , while the propensity score $g_{0}$ is infinite-dimensional and needs to be estimated nonparametrically. Denote by $\hat{g}$ the estimator of $g_{0}$ , then the plug-in estimator based on equation (2.1) is

[TABLE]

When $\hat{g}$ is estimated using classical nonparametric methods such as kernel or series estimators, the estimator $\hat{\theta}$ can be $\sqrt{N}$ -consistent and asymptotically normal under certain conditions provided in the semiparametric estimation literature (Newey, \APACyear1994; Newey \BBA McFadden, \APACyear1994).

When $\hat{g}$ is an ML estimator, however, the estimator $\hat{\theta}$ will fail to be $\sqrt{N}$ -consistent in general. By the general theory of inference of ML methods developed in Chernozhukov \BOthers. (\APACyear2018), the reason is twofold : (1) the score function based on (2.1), $\varphi\left(W,\theta_{0},p_{0},g_{0}\right)\coloneqq\frac{Y\left(1\right)-Y\left(0\right)}{P\left(D=1\right)}\frac{D-g_{0}\left(X\right)}{1-g_{0}\left(X\right)}-\theta_{0}$ , has a non-zero directional (Gateaux) derivative with respect to the propensity score $g_{0}$ :

[TABLE]

where the directional (Gateaux) derivative is formally defined in Section 3; (2) ML estimators usually have a convergence rate slower than $N^{-1/2}$ due to regularization bias. Similarly, the estimators obtained by directly plugging ML estimators into (2.2) and (2.3) will not be $\sqrt{N}$ -consistent in general. The Monte Carlo simulation in Section 4 supports this theoretical insight and reveals significant bias on the estimators based on (2.1)-(2.3) when using ML estimators in the first-stage nonparametric estimation.

The next section proposes three new score functions to relieve the regularization bias of the first-stage ML estimators. These three new score functions are derived under the same identification assumptions as those in Abadie (\APACyear2005), so that no extra assumption is made. Heuristically, a distinctive feature of the new score functions is that their derivatives with respect to their infinite-dimensional nuisance parameters are zero. This property can help us remove the first-order bias of the first-stage estimation so that the bias of the estimators based on these new score functions will be much smaller. In addition, I will use the cross-fitting algorithm to improve the over-fitting phenomena that frequently arise when using highly adaptive ML methods (Chernozhukov \BOthers., \APACyear2018).

3 The New DID Estimators

3.1 The Main Algorithm

Supposing Assumptions (2.1)-(2.3) hold, consider the following three new score functions.

Case 1: Random sample with repeated outcomes

The new score function for repeated outcomes is

[TABLE]

with the unknown constant $p_{0}$ and the infinite-dimensional nuisance parameter

[TABLE]

Case 2: Random sample with repeated cross sections

The new score function for repeated cross sections is

[TABLE]

where the adjustment term is

[TABLE]

The nuisance parameters are the unknown constants $p_{0}$ and $\lambda_{0}$ , and the infinite-dimensional parameter

[TABLE]

Case 3: Multilevel treatment

For each $w\in\left\{w_{1},...,w_{J}\right\}$ , the new score function for multilevel treatment is

[TABLE]

where the adjustment term is

[TABLE]

The nuisance parameters are the unknown constant $p_{w0}\coloneqq P\left(W=w\right)$ and the infinite-dimensional parameter

[TABLE]

Notice that the above three new functions are equal to the original score functions (2.1)-(2.3) plus the adjustment terms, $\left(c_{1},c_{2},c_{w}\right)$ , which have zero expectations. Thus, the new score functions (3.1)-(3.3) still identify the ATT in each case. I will use these new scores to construct new DID estimators.

To avoid repetition, I will focus on the estimation of ATT when data belongs to repeated outcomes and repeated cross sections. The estimation of multilevel treatment is provided in appendix. Now I combine the score functions described above with the cross-fitting estimation algorithm of Chernozhukov \BOthers. (\APACyear2018).

**Algorithm 1 **

*Take a $K$ -fold random partition $\left(I_{k}\right)_{k=1}^{K}$ of observation indices $\left[N\right]=\left\{1,...,N\right\}$ . For simplicity, assume that each fold $I_{k}$ has the same size $n=N/K$ . For each $k\in\left[K\right]=\left\{1,...,K\right\}$ , define the auxiliary sample $I_{k}^{c}\coloneqq\left\{1,...,N\right\}\setminus I_{k}$ . * 2. 2.

For each $k$ , construct the intermediate ATT estimators

[TABLE]

where $\hat{p}_{k}=\frac{1}{n}\sum_{i\in I_{k}^{c}}D_{i}$ , $\hat{\lambda}_{k}=\frac{1}{n}\sum_{i\in I_{k}^{c}}T_{i}$ , and $\left(\hat{g}_{k},\hat{\ell}_{1k},\hat{\ell}_{2k}\right)$ are the estimators of $\left(g_{0},\ell_{10},\ell_{20}\right)$ constructed using the auxiliary sample $I_{k}^{c}$ . 3. 3.

Construct the final ATT estimator $\tilde{\theta}=\frac{1}{K}\sum_{k=1}^{K}\tilde{\theta}_{k}.$

The estimators $\left(\hat{g}_{k},\hat{\ell}_{1k},\hat{\ell}_{2k}\right)$ can be constructed using any ML methods or classical estimators such as kernel or series estimators. For completeness, I present the Logit Lasso and Lasso estimators here.

Consider a class of approximating functions of $X_{i}$ ,

[TABLE]

For example, $q_{i}$ can be polynomials or B-splines. Let $\Lambda\left(u\right)\coloneqq 1/\left(1+\exp\left(-u\right)\right)$ be the cumulative distribution function of the standard Logistic distribution, construct the estimator of the propensity score $g_{0}$ by

[TABLE]

where

[TABLE]

is the Logit Lasso estimator and $M=N-n$ is the sample size of the auxiliary sample $I_{k}^{c}$ . Next, define $I_{kz}^{c}\coloneqq I_{k}^{c}\cap\left\{i:D_{i}=0\right\}$ , $M_{k}$ the sample size of $I_{kz}^{c}$ . Construct the estimators of $\ell_{10}$ and $\ell_{20}$ by

[TABLE]

where

[TABLE]

and

[TABLE]

are the modified Lasso estimators proposed in Belloni, Chen, Chernozhukov\BCBL \BBA Hansen (\APACyear2012). The choices of the penalty levels and loadings $\left(\lambda_{1k},\lambda_{2k},\hat{\Upsilon}_{1k},\hat{\Upsilon}_{2k}\right)$ suggested by Belloni, Chen, Chernozhukov\BCBL \BBA Hansen (\APACyear2012) are provided in appendix.

3.2 Theoretical Properties

In this section, I discuss the theoretical properties of the new DID estimator $\tilde{\theta}$ . In particular, I will show that the estimator $\tilde{\theta}$ can achieve $\sqrt{N}$ -consistency and asymptotic normality as long as the first-stage estimators converge at rates faster than $N^{-1/4}$ . This rate of convergence can be achieved by many ML methods, including Lasso and Logit Lasso. Further, I will show that when using kernel estimators in the first-stage estimation, the estimator $\tilde{\theta}$ has the SBP while the conventional semiparametric DID estimators do not.

3.2.1 The Neyman Orthogonality

The differences between the new DID estimators and the conventional semiparametric DID estimators in Abadie (\APACyear2005) are the score functions on which they are based. The key property of the new score functions (3.1)-(3.3) is that their directional (or the Gateaux) derivatives with respect to their infinite-dimensional nuisance parameters are zero, while the scores based on (2.1)-(2.3) do not have this property. This property is the so-called Neyman orthogonality in Chernozhukov \BOthers. (\APACyear2018). The Neyman orthogonality enables us to remove the first-order bias of the first-stage estimation so that the estimators based on these Neyman-orthogonal scores can achieve $\sqrt{N}$ -consistency under less restrictive conditions.

The definition of the Neyman-orthogonal score provided here is slightly different from Chernozhukov \BOthers. (\APACyear2018) that instead of being orthogonal against all nuisance parameters, the Neyman-orthogonal score defined here is orthogonal against only those infinite-dimensional nuisance parameters. Formally, let $\theta_{0}\in\Theta$ be the low-dimensional parameter of interest, $\rho_{0}$ be the true value of the finite-dimensional nuisance parameter $\rho$ , and $\eta_{0}$ the true value of the infinite-dimensional nuisance parameter $\eta\in\mathcal{T}$ . Suppose that $W$ is a random element taking values in a measurable space $\left(\mathcal{W},\mathcal{A}_{\mathcal{W}}\right)$ with probability measure $P$ . Define the directional (or the Gateaux) derivative against the infinite-dimensional nuisance parameter $D_{r}:\tilde{\mathcal{T}}\rightarrow\mathbb{R}$ , where $\tilde{\mathcal{T}}=\left\{\eta-\eta_{0}:\eta\in\mathcal{T}\right\}$ ,

[TABLE]

for all $r\in[0,1)$ . For convenience, denote

[TABLE]

In addition, let $\mathcal{T}_{N}\subset\mathcal{T}$ be a nuisance realization set such that the estimator of $\eta_{0}$ take values in this set with high probability.

Definition

(The Neyman Orthogonality)

The score $\psi$ obeys the Neyman orthogonality condition at $\left(\theta_{0},\rho_{0},\eta_{0}\right)$ with respect to the nuisance parameter realization set $\mathcal{T}_{N}\subset\mathcal{T}$ if the directional derivative map $D_{r}\left[\eta-\eta_{0}\right]$ exists for all $r\in[0,1)$ and $\eta\in\mathcal{T}_{N}$ and vanishes at $r=0$ :

[TABLE]

Lemma 1

The new score functions (3.1)-(3.3) obey the Neyman orthogonality.

This property embedded in (3.1)-(3.3) will play the key role to make less restrictive assumptions in the following proofs of asymptotic distribution and the SBP.

3.2.2 Asymptotic Distribution

In the following, I will discuss the theoretical properties of the new estimator $\tilde{\theta}$ when data belongs to repeated outcomes and repeated cross sections. The results of multilevel treatment can be proven using the same arguments. Let $\kappa$ and $C$ be strictly positive constants, $K\geq 2$ be a fixed integer, and $\varepsilon_{N}$ be a sequence of positive constants approaching zero. Denote by $\parallel\cdot\parallel_{P,q}$ the $L^{q}$ norm of some probability measure $P$ : $\parallel f\parallel_{P,q}\coloneqq\left(\int\mid f\left(w\right)\mid^{q}dP\left(w\right)\right)^{1/q}$ and $\parallel f\parallel_{P,\infty}\coloneqq\sup_{w}\mid f\left(w\right)\mid$ .

Assumption 3.1

(Regularity Conditions for Repeated Outcomes)

Let $P$ be the probability law for $\left(Y\left(0\right),Y\left(1\right),D,X\right)$ . Let $D=g_{0}\left(X\right)+U$ and $Y\left(1\right)-Y\left(0\right)=\ell_{10}\left(X\right)+V_{1}$ with $E_{P}\left[U\mid X\right]=0$ and $E_{P}\left[V_{1}\mid X,D=0\right]=0$ . Define $G_{1p0}\coloneqq E_{P}\left[\partial_{p}\psi_{1}\left(W,\theta_{0},p_{0},\eta_{10}\right)\right]$ and $\Sigma_{10}\coloneqq E_{P}\left[\left(\psi_{1}\left(W,\theta_{0},p_{0},\eta_{10}\right)+G_{1p0}\left(D-p_{0}\right)\right)^{2}\right]$ . Suppose the following conditions hold: (a) $Pr\left(\kappa\leq g_{0}\left(X\right)\leq 1-\kappa\right)=1$ ; (b) $\parallel UV_{1}\parallel_{P,4}\leq C$ ; (c) $E\left[U^{2}\mid X\right]\leq C$ ; (d) $E\left[V_{1}^{2}\mid X\right]\leq C$ ; (e) $\Sigma_{10}>0$ ; and (f) given the auxiliary sample $I_{k}^{c}$ , the estimator $\hat{\eta}_{1k}=\left(\hat{g}_{k},\hat{\ell}_{1k}\right)$ obeys the following conditions. With probability $1-o\left(1\right)$ , * $\parallel\hat{\eta}_{1k}-\eta_{10}\parallel_{P,2}\leq\varepsilon_{N}$ , $\parallel\hat{g}_{k}-1/2\parallel_{P,\infty}\leq 1/2-\kappa$ , and $\parallel\hat{g}_{k}-g_{0}\parallel_{P,2}^{2}+\parallel\hat{g}_{k}-g_{0}\parallel_{P,2}\times\parallel\hat{\ell}_{1k}-\ell_{10}\parallel_{P,2}\leq\left(\varepsilon_{N}\right)^{2}$ . *

Assumption 3.2

(Regularity Conditions for Repeated Cross Sections)

Let $P$ be the probability law for $\left(Y,T,D,X\right)$ . Let $D=g_{0}\left(X\right)+U$ and $\left(T-\lambda_{0}\right)Y=\ell_{20}\left(X\right)+V_{2}$ with $E_{p}\left[U\mid X\right]=0$ and $E_{p}\left[V_{2}\mid X,D=0\right]=0$ . Define $G_{2p0}\coloneqq E_{P}\left[\partial_{p}\psi_{2}\left(W,\theta_{0},p_{0},\lambda_{0},\eta_{20}\right)\right]$ , $G_{2\lambda 0}\coloneqq E_{P}\left[\partial_{\lambda}\psi_{2}\left(W,\theta_{0},p_{0},\lambda_{0},\eta_{20}\right)\right]$ , and $\Sigma_{20}\coloneqq E_{P}\left[\left(\psi_{1}\left(W,\theta_{0},p_{0},\eta_{10}\right)+G_{2p0}\left(D-p_{0}\right)+G_{2\lambda 0}\left(T-\lambda_{0}\right)\right)^{2}\right]$ . Suppose the following conditions hold: (a) $Pr\left(\kappa\leq g_{0}\left(X\right)\leq 1-\kappa\right)=1$ ; (b) $\parallel UV_{2}\parallel_{P,4}\leq C$ ; (c) $E\left[U^{2}\mid X\right]\leq C$ ; (d) $E\left[V_{2}^{2}\mid X\right]\leq C$ ; (e) $E_{P}\left[Y^{2}\mid X\right]\leq C$ ; (f) $\mid E_{P}\left[YU\right]\mid\leq C$ ; (g) $\Sigma_{20}>0$ ; and (h) given the auxiliary sample $I_{k}^{c}$ , the estimators $\hat{\eta}_{2k}=\left(\hat{g}_{k},\hat{\ell}_{2k}\right)$ obeys the following conditions. With probability $1-o\left(1\right)$ , * $\parallel\hat{\eta}_{2k}-\eta_{20}\parallel_{P,2}\leq\varepsilon_{N}$ , $\parallel\hat{g}_{k}-1/2\parallel_{P,\infty}\leq 1/2-\kappa$ , and $\parallel\hat{g}_{k}-g_{0}\parallel_{P,2}^{2}+\parallel\hat{g}_{k}-g_{0}\parallel_{P,2}\times\parallel\hat{\ell}_{2k}-\ell_{20}\parallel_{P,2}\leq\left(\varepsilon_{N}\right)^{2}$ . *

**Theorem 1 **

For repeated outcomes, suppose Assumptions (2.1), (2.2) and (3.1) hold. For repeated cross sections, suppose Assumptions (2.1)-(2.3) and (3.2) hold. If $\varepsilon_{N}=o\left(N^{-1/4}\right)$ , then the new ATT estimator $\tilde{\theta}$ satisfies

[TABLE]

*with $\Sigma=\Sigma_{10}$ for repeated outcomes and $\Sigma=\Sigma_{20}$ for repeated cross sections. *

Theorem 2 (Variance Estimator)

Construct the estimators of the asymptotic variances as

[TABLE]

where $\mathbb{E}_{n,k}\left[f\left(W\right)\right]=n^{-1}\sum_{i\in I_{k}}f\left(W_{i}\right)$ , $\hat{G}_{1p}=\hat{G}_{2p}=-\tilde{\theta}/\hat{p}_{k}$ , and $\hat{G}_{2\lambda}$ is a consistent estimator of $G_{2\lambda 0}$ . If the assumptions of Theorem 1 hold, then $\hat{\Sigma}_{1}=\Sigma_{10}+o_{P}\left(1\right)$ and $\hat{\Sigma}_{2}=\Sigma_{20}+o_{P}\left(1\right).$

The interpretation of Theorem 1 and 2 is that the new DID estimator $\tilde{\theta}$ can achieve $\sqrt{N}$ -consistency and asymptotic normality provided that the first-stage estimators of the infinite dimensional nuisance parameters converge at a rate faster than $N^{-1/4}$ . This rate of convergence can be achieved by many ML methods. In particular, Van de Geer (\APACyear2008) and Belloni, Chen, Chernozhukov\BCBL \BBA Hansen (\APACyear2012) provided detail conditions for Logit Lasso and the modified Lasso estimators to satisfy this rate of convergence. It is also worth noting that even when the first-stage estimators do not converge as fast as $N^{-1/4}$ , the new estimator $\tilde{\theta}$ still has smaller bias than the original estimator because the Neyman orthogonality removes the first-order bias of the first-stage estimators.

3.2.3 The Small Bias Property

Consider the conventional semiparametric DID estimators with a limited number of control variables studied in Abadie (\APACyear2005). Let $\widehat{g}_{h}$ be the kernel estimator of $g_{0}$ with bandwidth $h\rightarrow 0$ in (2.1) and (2.2). Under the standard assumptions of kernel estimation (Assumption (3.3) below), one can show that the pointwise bias of $\hat{g}_{h}$ is of order $O\left(h^{m}\right)$ , where $m$ can be interpreted as the minimum number of derivatives of $g_{0}$ ; and the pointwise standard deviation is $sd\left(\hat{g}_{h}\left(x\right)\right)=O(\left(Nh^{d+2s}\right)^{-1/2})$ . By Theorem 8.11 of Newey \BBA McFadden (\APACyear1994), one can show that the $\sqrt{N}$ -consistency of the plug-in estimators based on (2.1) and (2.2) requires $\sqrt{N}h^{m}\rightarrow 0$ . That is, the pointwise bias of the kernel estimator has to converge to zero faster than $N^{-1/2}$ . Since the pointwise standard deviation converges to zero slower than $N^{-1/2}$ , undersmoothing is required. In this case, standard data-driven bandwidth selection methods which do not undersmooth, such as cross-validation, are invalid.

To avoid undersmoothing, by the analysis of SBP in Newey, Hsieh\BCBL \BBA Robins (\APACyear1998, \APACyear2004), the estimator of the parameter of interest needs to have smaller bias than the pointwise bias of the first-stage nonparametric estimators. That is, the SBP requires that the bias of the estimator of $\theta_{0}$ converges to zero faster than $h^{m}$ .

In the following, I will show that the new DID estimator $\tilde{\theta}$ has the SBP. Let $\left(\hat{g}_{kh},\hat{\ell}_{1kh},\hat{\ell}_{2kh}\right)$ be the kernel estimators of $\left(g_{0},\ell_{10},\ell_{20}\right)$ using auxiliary sample $I_{k}^{c}$ . I assume here that they have the same bandwidth $h$ and kernel $K\left(u\right)$ for convenience.

Assumption 3.3

(Newey \BBA McFadden, \APACyear1994)

$K\left(u\right)$ is differentiable of order $s$ , the derivatives of order $s$ are bounded, $K\left(u\right)$ is zero outside a bounded set, $\int K\left(u\right)du=1$ , there is a positive $m$ such that for all $j<m$ , $\int K\left(u\right)\left[\bigotimes_{\ell=1}^{j}u\right]du=0.$ 2. 2.

Define $\gamma_{0}\left(x\right)=f_{0}\left(x\right)E\left(z\mid x\right)$ , where $z\in\left(1,D,Y\left(1\right)-Y\left(0\right)\mid D=0,\left(T-\lambda_{0}\right)Y\mid D=0\right)$ and $f_{0}\left(x\right)$ is the true density of $x$ . Assume that $\gamma_{0}\left(x\right)$ is continuously differentiable to order $s$ with bounded derivatives on an open set containing $\mathcal{X}$ , where $\mathcal{X}$ is the support of $x$ . 3. 3.

There is $\alpha\geq 4$ such that $E\left[\mid z\mid^{\alpha}\right]<\infty$ and $E\left[\mid z\mid^{\alpha}\mid x\right]f_{0}\left(x\right)$ is bounded.

Theorem 3

For repeated outcomes, suppose Assumptions (2.1), (2.2), (3.1), and (3.3) hold. For repeated cross sections, suppose Assumptions (2.1)-(2.3), (3.2), and (3.3) hold. Suppose that $\inf_{x\in\mathcal{X}}f_{0}\left(x\right)\neq 0$ , $h=h\left(N\right)$ with $\log N/\left(\sqrt{N}h^{d+2s}\right)\rightarrow 0$ . If $\sqrt{N}h^{2m}\rightarrow 0$ , then

[TABLE]

*with $\Sigma=\Sigma_{10}$ for repeated outcomes and $\Sigma=\Sigma_{20}$ for repeated cross sections. *

The interpretation of Theorem 3 is that the new estimator $\tilde{\theta}$ only requires $\sqrt{N}h^{2m}\rightarrow 0$ to achieve $\sqrt{N}$ -consistency, while the conventional semiparametric DID estimators require $\sqrt{N}h^{m}\rightarrow 0$ under the same assumptions. With the Neyman orthogonality, the bias of $\tilde{\theta}$ is only of the second-order of the pointwise bias of the first-stage kernel estimators. The bias of $\tilde{\theta}$ is $h^{2m}$ instead of $h^{m}$ . Hence, $\tilde{\theta}$ satisfies the SBP. In particular, the bandwidth $h$ such that $\log N/\left(\sqrt{N}h^{d+2s}\right)\rightarrow 0$ and $\sqrt{N}h^{2m}\rightarrow 0$ exists only if $2m>d+2s$ . Under this condition, the optimal bandwidth selected by minimizing mean-square errors (CV), $h=N^{-1/\left(d+2s+2m\right)}$ , satisfies the conditions for $\sqrt{N}$ -consistency.

**Theorem 4 **

Construct the estimators of the asymptotic variances as

[TABLE]

where $\hat{G}_{1p}=\hat{G}_{2p}=-\tilde{\theta}/\hat{p}_{k}$ and $\hat{G}_{2\lambda}$ is a consistent estimator of $G_{2\lambda 0}$ . If the assumptions of Theorem 3 hold, then $\hat{\Sigma}_{1}=\Sigma_{10}+o_{P}\left(1\right)$ and $\hat{\Sigma}_{2}=\Sigma_{20}+o_{P}\left(1\right).$

4 Simulation

In this section, I present Monte Carlo simulation results of the conventional semiparametric DID estimators and the new DID estimator $\tilde{\theta}$ in three different data structures: repeated outcomes, repeated cross sections, and multilevel treatment. I use both ML methods and kernel estimators in the first-stage estimation. For ML estimation, I generate high-dimensional (HD) data and estimate the propensity score by Logit Lasso (Multi-Logit Lasso for multilevel treatment). To choose the penalty parameter for Logit Lasso (Multi-Logit Lasso), I use $K$ -fold CV (as recommended by Van de Geer (\APACyear2008)) with $K=10$ . Alternatively, one could use a method developed in Belloni, Chernozhukov, Chetverikov\BCBL \BBA Wei (\APACyear2018). The other infinite-dimensional nuisance parameters are estimated by random forests with 500 regression trees. For kernel estimation, all the infinite-dimensional nuisance parameters are estimated using the standard Gaussian kernel.

Figure 3-20 in appendix show the simulation results. I find that the conventional semiparametric DID estimators are biased when using ML methods, while the new DID estimator $\tilde{\theta}$ can correct the bias. For kernel estimation, the conventional DID estimator with bandwidth selected by CV is biased, while the new DID estimator $\tilde{\theta}$ is centered at the true value. The data generating processes are presented in the following.

4.1 Repeated Outcomes

4.1.1 ML Estimation

Let $N\in\left\{200,500\right\}$ be the sample size and $p\in\left\{100,300\right\}$ the dimension of control variables, $X_{i}\sim N\left(0,I_{p\times p}\right)$ . Also, let $\gamma_{0}=\left(1,1/2,1/3,1/4,1/5,0,...,0\right)\in\mathbb{R}^{p}$ and $D_{i}$ is generated by the propensity score

[TABLE]

At $t=0$ , the potential outcome is generated

[TABLE]

and at $t=1$ ,

[TABLE]

where $\beta_{0}=\gamma_{0}+0.5$ and $\theta_{0}=3$ , and all error terms follow $N\left(0,0.1\right)$ . Researchers observe $\left\{Y_{i}\left(0\right),Y_{i}\left(1\right),D_{i},X_{i}\right\}$ for $i=1,...,N$ , where $Y_{i}\left(0\right)=Y_{i}^{0}\left(0\right)$ and $Y_{i}\left(1\right)=Y_{i}^{0}\left(1\right)\left(1-D_{i}\right)+Y_{i}^{1}\left(1\right)D_{i}$ . Figure 3-6 present the results.

4.1.2 Kernel Estimation

Let $N\in\left\{200,500\right\}$ be the sample size, $D_{i}\sim Bernoulli(0.5)$ , and $X_{i}\mid D_{i}\sim N\left(D_{i},1\right)$ . At $t=0$ , the potential outcome is generated

[TABLE]

and at $t=1$ ,

[TABLE]

where $\theta_{0}=3$ and all error terms follow $N\left(0,0.1\right)$ . Researchers observe $\left\{Y_{i}\left(0\right),Y_{i}\left(1\right),D_{i},X_{i}\right\}$ for $i=1,...,N$ , where $Y_{i}\left(0\right)=Y_{i}^{0}\left(0\right)$ and $Y_{i}\left(1\right)=Y_{i}^{0}\left(1\right)\left(1-D_{i}\right)+Y_{i}^{1}\left(1\right)D_{i}$ . Figure 7-8 present the results.

4.2 Repeated Cross Sections

4.2.1 ML Estimation

Let $N\in\left\{200,500\right\}$ be the sample size and $p\in\left\{100,300\right\}$ the dimension of control variables, $X_{i}\sim N\left(0.3,I_{p\times p}\right)$ . Also, let $\gamma_{0}=\left(1,1/2,1/3,1/4,1/5,0,...,0\right)\in\mathbb{R}^{p}$ and $D$ is generated by the propensity score

[TABLE]

At $t=0$ , the potential outcome is generated

[TABLE]

and at $t=1$ ,

[TABLE]

where $\beta_{0}=\gamma_{0}+0.5$ and $\theta_{0}=3$ , and all error terms follow $N\left(0,0.1\right)$ . Define $Y_{i}\left(0\right)=Y_{i}^{0}\left(0\right)$ and $Y_{i}\left(1\right)=Y_{i}^{0}\left(1\right)\left(1-D_{i}\right)+Y_{i}^{1}\left(1\right)D_{i}$ . Let $T_{i}$ follow a Bernoulli distribution with parameter $0.5$ . Researchers observe $\left\{Y_{i},T_{i},D_{i},X_{i}\right\}$ for $i=1,...,N$ , where $Y_{i}=Y_{i}\left(0\right)+T_{i}\left(Y_{i}\left(1\right)-Y_{i}\left(0\right)\right)$ . Figure 9-12 present the results.

4.2.2 Kernel Estimation

Let $N\in\left\{200,500\right\}$ be the sample size, $D_{i}\sim Bernoulli(0.5)$ , and $X_{i}\mid D_{i}\sim N\left(D_{i},1\right)$ . At $t=0$ , the potential outcome is generated

[TABLE]

and at $t=1$ ,

[TABLE]

where $\theta_{0}=3$ and all error terms follow $N\left(0,0.1\right)$ . Let $Y_{i}\left(0\right)=Y_{i}^{0}\left(0\right)$ and $Y_{i}\left(1\right)=Y_{i}^{0}\left(1\right)\left(1-D_{i}\right)+Y_{i}^{1}\left(1\right)D_{i}$ . Let $T_{i}\sim Bernoulli(0.5)$ . Researchers observe $\left\{Y_{i},T_{i},D_{i},X_{i}\right\}$ for $i=1,...,N$ , where $Y_{i}=Y_{i}\left(0\right)+T_{i}\left(Y_{i}\left(1\right)-Y_{i}\left(0\right)\right)$ . Figure 13-14 present the results.

4.3 Multilevel Treatment

4.3.1 ML Estimation

Suppose there are two levels of treatment so that $W\in\left\{0,1,2\right\}$ . Let $N\in\left\{200,500\right\}$ be the sample size and $p\in\left\{100,300\right\}$ the dimension of control variables, $X_{i}\sim N\left(0,I_{p\times p}\right)$ . Also, let $\gamma_{0}=\left(1,1/2,1/3,1/4,1/5,0,...,0\right)\in\mathbb{R}^{p}$ and

[TABLE]

At $t=0$ , the potential outcome is generated

[TABLE]

and at $t=1$ ,

[TABLE]

where $\beta_{0}=\gamma_{0}+0.5$ and $\theta_{10}=3$ and $\theta_{20}=6$ , and all error terms follow $N\left(0,0.1\right)$ . Researchers observe $\left\{Y_{i}\left(0\right),Y_{i}\left(1\right),W_{i},X_{i}\right\}$ for $i=1,...,N$ , where $Y_{i}\left(0\right)=Y_{i}^{0}\left(0\right)$ and $Y_{i}\left(1\right)=Y_{i}^{0}\left(1\right)I\left(W_{i}=0\right)+Y_{i}^{1}\left(1\right)I\left(W_{i}=1\right)+Y_{i}^{2}\left(1\right)I\left(W_{i}=2\right)$ . I focus on the estimation of the second level ATT $\theta_{20}$ . Figure 15-18 present the results

4.3.2 Kernel Estimation

Suppose there are two levels of treatment so that $W\in\left\{0,1,2\right\}$ . Let $N$ be the sample size, $X_{i}\mid W_{i}\sim N\left(W_{i},1\right)$ , and

[TABLE]

At $t=0$ , the potential outcome is generated

[TABLE]

and at $t=1$ ,

[TABLE]

where $\theta_{10}=3$ , $\theta_{20}=6$ , and all error terms follow $N\left(0,0.1\right)$ . Let $Y_{i}\left(0\right)=Y_{i}^{0}\left(0\right)$ and $Y_{i}\left(1\right)=Y_{i}^{0}\left(1\right)I\left(W_{i}=0\right)+Y_{i}^{1}\left(1\right)I\left(W_{i}=1\right)+Y_{i}^{2}\left(1\right)I\left(W_{i}=2\right)$ . Researchers observe $\{Y_{i}\left(0\right),Y_{i}\left(1\right),W_{i},X_{i}\}$ for $i=1,...,N$ . I focus on the estimation of the second level ATT $\theta_{20}$ . Figure 19-20 present the results.

5 Empirical Example

In this example, I analyze the effect of tariffs reduction on corruption behaviors using the bribe payment data collected by Sequeira (\APACyear2016) between South Africa and Mozambique. There have been theoretical and empirical debates on whether higher tariff rates increase incentives for corruption to occur (Clotfelter, \APACyear1983; Sequeira \BBA Djankov, \APACyear2014) or lower tariffs encourage agents to pay higher bribes through an income effect (Feinstein, \APACyear1991; Slemrod \BBA Yitzhaki, \APACyear2002). The former argues that an increase in the tariff rate makes it more profitable to evade taxes on the margin. The latter argues that an increased tariff rate makes the tax payer less wealthy and this, under the decreasing risk aversion of being penalized, tend to reduce evasion (Allingham \BBA Sandmo, \APACyear1972).

Sequeira (\APACyear2016) collected primary data on the bribed payments between the ports in Mozambique and South Africa from 2007 to 2013. The treatment is the large reduction in the average nominal tariff rate (of 5 percent) occurring in 2008. Since not all products were on the tariff reduction list, a credible control group of products is available. This allows for a DID estimation.

This natural experiment between South Africa and Mozambique was previously studied by Sequeira (\APACyear2016) by pooling the cross section data between 2007 and 2013, with sample size $N=1084$ , and estimating the effect of treatment using the traditional linear DID. Here I focus on the specification of one of the main results (Table 9 of Sequeira (\APACyear2016)):

[TABLE]

where $y_{it}$ is the natural log of the amount of bribe paid for shipment $i$ in period $t$ , conditional on paying a bribe. $TariffChangeCategory\in\left\{0,1\right\}$ denotes the treatment status of commodities, $POST\in\left\{0,1\right\}$ is an indicator for the years following 2008, and $BaselineTariff$ is the tariff rate before the tariff reduction. The specification also includes a vector of characteristics $\Gamma_{i}$ , and time and individual fixed effects $p_{i}$ , $w_{t}$ , and $\delta_{i}$ . The parameter $\gamma_{1}$ is the parameter of interest in the traditional linear DID estimation. Sequeira (\APACyear2016) found that the amount of bribes paid dropped after the tariff reduction ( $\hat{\gamma}_{1}=-2.928^{**}$ ).

I use the same data set but instead of using the traditional linear DID estimation, I estimate the ATT by Abadie (\APACyear2005)’s DID estimator and my proposed DID estimator $\tilde{\theta}$ . Since the data is repeated cross sections, I construct the estimators based on (2.2) and (3.2), respectively. The estimators with the first-stage kernel estimation contain one individual characteristic (the natural log of shipment value per ton), which is a significant characteristic in Table 9 of Sequeira (\APACyear2016). The estimators with the first-stage Lasso estimation contain a list of the significant characteristics in Table 9 of Sequeira (\APACyear2016), which includes product, shipment, firm-level characteristics, and their interaction terms. Table 1 below shows the results. I find that all these estimators consistently suggest that a decrease in tariff rate will lead to less bribes payment, but the effect of treatment may be actually substantially larger than previously reported by Sequeira (\APACyear2016).

6 Conclusion

In this article, I have introduced three new DID estimators based on the newly-derived Neyman-orthogonal scores. These new scores do not require any additional conditions other than the original conditions made in Abadie (\APACyear2005). The new DID estimators will be particularly appropriate when researchers would like to use ML methods in the first-stage nonparametric estimation. When using kernel estimators in the first-stage estimation , the new DID estimators do not require undersmoothing to achieve $\sqrt{N}$ -consistency. Hence, researchers can use standard data-driven methods, such as CV, to select bandwidths.

Bibliography25

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abadie ( \APA Cyear 2005) \APA Cinsertmetastar abadie 2005 semiparametric {APA Crefauthors} Abadie, A. \APA Cref Year Month Day 2005. \BBOQ \APA Crefatitle Semiparametric difference-in-differences estimators Semiparametric difference-in-differences estimators. \BBCQ \APA Cjournal Vol Num Pages The Review of Economic Studies 7211–19. \Print Back Refs \Current Bib
2Akee \B Others . ( \APA Cyear 2018) \APA Cinsertmetastar akee 2018 does {APA Crefauthors} Akee, R., Copeland, W., Costello, E \BPBI J. \BCBL \BBA Simeonova, E. \APA Cref Year Month Day 2018. \BBOQ \APA Crefatitle How does household income affect child personality traits and behaviors? How does household income affect child personality traits and behaviors? \BBCQ \APA Cjournal Vol Num Pages American Economic Review 1083775–827. \Print Back Refs \Current Bib
3Allingham \BBA Sandmo ( \APA Cyear 1972) \APA Cinsertmetastar allingham 1972 income {APA Crefauthors} Allingham, M \BPBI G. \BCBT \BBA Sandmo, A. \APA Cref Year Month Day 1972. \BBOQ \APA Crefatitle Income tax evasion: A Theoretical Analysis Income tax evasion: A theoretical analysis. \BBCQ \APA Cjournal Vol Num Pages Journal of public economics 1323–338. \Print Back Refs \Current Bib
4Belloni \B Others . ( \APA Cyear 2012) \APA Cinsertmetastar belloni 2012 sparse {APA Crefauthors} Belloni, A., Chen, D., Chernozhukov, V. \BCBL \BBA Hansen, C. \APA Cref Year Month Day 2012. \BBOQ \APA Crefatitle Sparse models and methods for optimal instruments with an application to eminent domain Sparse models and methods for optimal instruments with an application to eminent domain. \BBCQ \APA Cjournal Vol Num Pages Econometrica 8062369–2429. \Print Back Refs \Current Bib
5Belloni \B Others . ( \APA Cyear 2018) \APA Cinsertmetastar belloni 2018 uniformly {APA Crefauthors} Belloni, A., Chernozhukov, V., Chetverikov, D. \BCBL \BBA Wei, Y. \APA Cref Year Month Day 2018. \BBOQ \APA Crefatitle Uniformly valid post-regularization confidence regions for many functional parameters in z-estimation framework Uniformly valid post-regularization confidence regions for many functional parameters in z-estimation framework. \BBCQ \APA Cjournal Vol Num Pages The Annals of S
6Belloni \B Others . ( \APA Cyear 2017) \APA Cinsertmetastar belloni 2017 program {APA Crefauthors} Belloni, A., Chernozhukov, V., Fernández-Val, I. \BCBL \BBA Hansen, C. \APA Cref Year Month Day 2017. \BBOQ \APA Crefatitle Program evaluation and causal inference with high-dimensional data Program evaluation and causal inference with high-dimensional data. \BBCQ \APA Cjournal Vol Num Pages Econometrica 851233–298. \Print Back Refs \Current Bib
7Belloni \B Others . ( \APA Cyear 2014) \APA Cinsertmetastar Belloni 14restud {APA Crefauthors} Belloni, A., Chernozhukov, V. \BCBL \BBA Hansen, C. \APA Cref Year Month Day 2014. \BBOQ \APA Crefatitle Inference on Treatment Effects after Selection among High-Dimensional Controls† Inference on treatment effects after selection among high-dimensional controls†. \BBCQ \APA Cjournal Vol Num Pages The Review of Economic Studies 812608-650. \Print Back Refs \Current Bib
8Card ( \APA Cyear 1990) \APA Cinsertmetastar card 1990 impact {APA Crefauthors} Card, D. \APA Cref Year Month Day 1990. \BBOQ \APA Crefatitle The impact of the Mariel boatlift on the Miami labor market The impact of the mariel boatlift on the miami labor market. \BBCQ \APA Cjournal Vol Num Pages ILR Review 432245–257. \Print Back Refs \Current Bib