Nonseparable Multinomial Choice Models in Cross-Section and Panel Data

Victor Chernozhukov; Iv\'an Fern\'andez-Val; Whitney Newey

arXiv:1706.08418·stat.ME·May 10, 2018

Nonseparable Multinomial Choice Models in Cross-Section and Panel Data

Victor Chernozhukov, Iv\'an Fern\'andez-Val, Whitney Newey

PDF

Open Access

TL;DR

This paper develops new identification strategies for nonseparable multinomial choice models using cross-section and panel data, addressing endogenous heterogeneity and time stationarity issues.

Contribution

It extends identification results to nonseparable models with endogenous heterogeneity and analyzes the limitations of time stationarity for identifying structural derivatives.

Findings

01

Derivatives of choice probabilities are weighted averages of utility derivatives.

02

In random coefficient models, the probability derivative at zero relates to the mean coefficients.

03

Time differences in derivatives identify utility derivatives on the diagonal, but not off the diagonal.

Abstract

Multinomial choice models are fundamental for empirical modeling of economic choices among discrete alternatives. We analyze identification of binary and multinomial choice models when the choice utilities are nonseparable in observed attributes and multidimensional unobserved heterogeneity with cross-section and panel data. We show that derivatives of choice probabilities with respect to continuous attributes are weighted averages of utility derivatives in cross-section models with exogenous heterogeneity. In the special case of random coefficient models with an independent additive effect, we further characterize that the probability derivative at zero is proportional to the population mean of the coefficients. We extend the identification results to models with endogenous heterogeneity using either a control function or panel data. In time stationary panel models with two periods, we…

Equations177

Y = 1 (U_{1} (X, ε) \geq U_{0} (X, ε)) .

Y = 1 (U_{1} (X, ε) \geq U_{0} (X, ε)) .

Y = 1 (δ (X, ε) \geq 0), δ (X, ε) = U_{1} (X, ε) - U_{0} (X, ε) .

Y = 1 (δ (X, ε) \geq 0), δ (X, ε) = U_{1} (X, ε) - U_{0} (X, ε) .

δ (X, ε) = β_{0}^{'} X + ε .

δ (X, ε) = β_{0}^{'} X + ε .

δ (X, ε) = η^{'} X + v .

δ (X, ε) = η^{'} X + v .

P (X) := Pr (Y = 1 ∣ X) = Pr (δ (X, ε) \geq 0 ∣ X) = \int 1 (δ (X, ε) \geq 0) F_{ε} (d ε),

P (X) := Pr (Y = 1 ∣ X) = Pr (δ (X, ε) \geq 0 ∣ X) = \int 1 (δ (X, ε) \geq 0) F_{ε} (d ε),

\partial_{x} P (x) = E [\partial_{x} δ (x, ε) ∣ δ (x, ε) = 0] \cdot f_{δ (x, ε)} (0) .

\partial_{x} P (x) = E [\partial_{x} δ (x, ε) ∣ δ (x, ε) = 0] \cdot f_{δ (x, ε)} (0) .

\partial_{x} P (x) = - \partial_{x} Pr (δ (x, ε) \leq y) ∣_{y = 0} .

\partial_{x} P (x) = - \partial_{x} Pr (δ (x, ε) \leq y) ∣_{y = 0} .

τ_{u} (u) = \partial_{u} τ (u) = E [\partial_{u} h (u, ε) ∣ h (u, ε) = 0] \cdot f_{h (u, ε)} (0) .

τ_{u} (u) = \partial_{u} τ (u) = E [\partial_{u} h (u, ε) ∣ h (u, ε) = 0] \cdot f_{h (u, ε)} (0) .

\partial_{x} P (x) = β_{0} \cdot τ_{u} (β_{0}^{'} x) .

\partial_{x} P (x) = β_{0} \cdot τ_{u} (β_{0}^{'} x) .

δ (x, η, v) \geq 0 ⟺ h (x, η) + v \geq 0,

δ (x, η, v) \geq 0 ⟺ h (x, η) + v \geq 0,

P (x) = E [Pr (v \geq - h (x, η) ∣ η)] = E [1 - F_{v} (- h (x, η) ∣ η)] = \int [1 - F_{v} (- h (x, η) ∣ η)] F_{η} (d η),

P (x) = E [Pr (v \geq - h (x, η) ∣ η)] = E [1 - F_{v} (- h (x, η) ∣ η)] = \int [1 - F_{v} (- h (x, η) ∣ η)] F_{η} (d η),

\partial_{x} P (x) = E [{\partial_{x} h (x, η)} f_{v} (- h (x, η) ∣ η)] = \int [\partial_{x} h (x, η)] f_{v} (- h (x, η) ∣ η) d F_{η} (η) .

\partial_{x} P (x) = E [{\partial_{x} h (x, η)} f_{v} (- h (x, η) ∣ η)] = \int [\partial_{x} h (x, η)] f_{v} (- h (x, η) ∣ η) d F_{η} (η) .

Pr (\tilde{Y} = 1∣ X) = Pr (\tilde{v} \leq P (X) ∣ X) = P (X) .

Pr (\tilde{Y} = 1∣ X) = Pr (\tilde{v} \leq P (X) ∣ X) = P (X) .

\partial_{x} P (x) = E [f_{v} (- η^{'} x ∣ η) η] .

\partial_{x} P (x) = E [f_{v} (- η^{'} x ∣ η) η] .

\partial_{x} P (x) ∣_{x = 0} = E [f_{v} (0) η] = f_{v} (0) \cdot E [η] .

\partial_{x} P (x) ∣_{x = 0} = E [f_{v} (0) η] = f_{v} (0) \cdot E [η] .

[\partial_{x_{j}} P (x) / \partial_{x_{k}} P (x)]_{x = 0} = E [η_{j}] / E [η_{k}] .

[\partial_{x_{j}} P (x) / \partial_{x_{k}} P (x)]_{x = 0} = E [η_{j}] / E [η_{k}] .

\frac{\partial ^{2} P ( x )}{\partial x \partial x ^{'}}_{x = 0} = - E [η η^{'}] f_{v v} (0),

\frac{\partial ^{2} P ( x )}{\partial x \partial x ^{'}}_{x = 0} = - E [η η^{'}] f_{v v} (0),

E [w (X) \partial_{x} P (X)] = E [w (X) f_{v} (- h (X, η) ∣ η) {\partial_{x} h (X, η)}] .

E [w (X) \partial_{x} P (X)] = E [w (X) f_{v} (- h (X, η) ∣ η) {\partial_{x} h (X, η)}] .

\partial_{x} P (x) = E [\frac{e ^{ξ + x^{'} η}}{( 1 + e ^{ξ + x^{'} η} ) ^{2}} η] .

\partial_{x} P (x) = E [\frac{e ^{ξ + x^{'} η}}{( 1 + e ^{ξ + x^{'} η} ) ^{2}} η] .

Y_{j} = 1 ({U_{j} (X, ε) \geq U_{k} (X, ε); k = 1, ..., J}) .

Y_{j} = 1 ({U_{j} (X, ε) \geq U_{k} (X, ε); k = 1, ..., J}) .

P_{j} (x) = Pr {U_{j} (x, ε) \geq U_{k} (x, ε); k = 1, ..., J} = \int 1 ({U_{j} (x, ε) \geq U_{k} (x, ε); k = 1, ..., J}) F_{ε} (d ε),

P_{j} (x) = Pr {U_{j} (x, ε) \geq U_{k} (x, ε); k = 1, ..., J} = \int 1 ({U_{j} (x, ε) \geq U_{k} (x, ε); k = 1, ..., J}) F_{ε} (d ε),

U_{j} (x, ε) = u_{j} (x, η) + v_{j},

U_{j} (x, ε) = u_{j} (x, η) + v_{j},

p_{j} (u ∣ η) := Pr (u_{j} + v_{j} \geq u_{k} + v_{k}; k = 1, ..., J ∣ η) .

p_{j} (u ∣ η) := Pr (u_{j} + v_{j} \geq u_{k} + v_{k}; k = 1, ..., J ∣ η) .

p_{j k} (u ∣ η) := \partial p_{j} (u ∣ η) / \partial u_{k}, u (x, η) := (u_{1} (x, η), ..., u_{J} (x, η))^{'} .

p_{j k} (u ∣ η) := \partial p_{j} (u ∣ η) / \partial u_{k}, u (x, η) := (u_{1} (x, η), ..., u_{J} (x, η))^{'} .

\partial_{x} P_{j} (x) = E [k = 1 \sum J p_{j k} (u (x, η) ∣ η) \partial_{x} u_{k} (x, η)] = \int [k = 1 \sum J p_{j k} (u (x, η) ∣ η) \partial_{x} u_{k} (x, η)] F_{η} (d η) .

\partial_{x} P_{j} (x) = E [k = 1 \sum J p_{j k} (u (x, η) ∣ η) \partial_{x} u_{k} (x, η)] = \int [k = 1 \sum J p_{j k} (u (x, η) ∣ η) \partial_{x} u_{k} (x, η)] F_{η} (d η) .

p_{j} (u ∣ η) = \frac{e ^{u_{j}}}{\sum _{k = 1}^{J} e ^{u_{k}}} .

p_{j} (u ∣ η) = \frac{e ^{u_{j}}}{\sum _{k = 1}^{J} e ^{u_{k}}} .

\partial_{x} P_{j} (x) = E [\tilde{p}_{j} (x, η) {\partial_{x} u_{j} (x, η) - k = 1 \sum J \tilde{p}_{k} (x, η) \partial_{x} u_{k} (x, η)}] .

\partial_{x} P_{j} (x) = E [\tilde{p}_{j} (x, η) {\partial_{x} u_{j} (x, η) - k = 1 \sum J \tilde{p}_{k} (x, η) \partial_{x} u_{k} (x, η)}] .

\partial_{x^{ℓ}} P_{j} (x) = \int \tilde{p}_{j} (x, η) {1 (j = j_{ℓ}) - \tilde{p}_{j_{ℓ}} (x, η)} \partial_{x^{ℓ}} u_{j_{ℓ}} (x, η) F_{η} (d η) .

\partial_{x^{ℓ}} P_{j} (x) = \int \tilde{p}_{j} (x, η) {1 (j = j_{ℓ}) - \tilde{p}_{j_{ℓ}} (x, η)} \partial_{x^{ℓ}} u_{j_{ℓ}} (x, η) F_{η} (d η) .

\partial_{x^{k}} P_{j} (x) = E [p_{j k} (u (x, η) ∣ η) η] .

\partial_{x^{k}} P_{j} (x) = E [p_{j k} (u (x, η) ∣ η) η] .

\partial_{x^{k}} P_{j} (x) = p_{j k} (u (x, β_{0})) \cdot β_{0},

\partial_{x^{k}} P_{j} (x) = p_{j k} (u (x, β_{0})) \cdot β_{0},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEconomic and Environmental Valuation · Spatial and Panel Data Analysis · Housing Market and Economics

Full text

Nonseparable Multinomial Choice Models in Cross-Section and Panel Data

Victor Chernozhukov

MIT Department of Economics, MIT, Cambridge, MA 02139, U.S.A E-mail: [email protected].

Iván Fernández-Val

Boston University Department of Economics, Boston University, Boston, MA 02215, U.S.A E-mail: [email protected].

Whitney K. Newey

MIT Department of Economics, MIT, Cambridge, MA 02139, U.S.A E-mail: [email protected].

Abstract

Multinomial choice models are fundamental for empirical modeling of economic choices among discrete alternatives. We analyze identification of binary and multinomial choice models when the choice utilities are nonseparable in observed attributes and multidimensional unobserved heterogeneity with cross-section and panel data. We show that derivatives of choice probabilities with respect to continuous attributes are weighted averages of utility derivatives in cross-section models with exogenous heterogeneity. In the special case of random coefficient models with an independent additive effect, we further characterize that the probability derivative at zero is proportional to the population mean of the coefficients. We extend the identification results to models with endogenous heterogeneity using either a control function or panel data. In time stationary panel models with two periods, we find that differences over time of derivatives of choice probabilities identify utility derivatives “on the diagonal,” i.e. when the observed attributes take the same values in the two periods. We also show that time stationarity does not identify structural derivatives “off the diagonal” both in continuous and multinomial choice panel models.

Keywords: Multinomial choice, binary choice, nonseparable model, random coefficients, panel data, control function.

1 Introduction

Multinomial choice models are fundamental for empirical modeling of economic choices among discrete alternatives. Our starting point is the assumption that much of what determines preferences is unobserved to the econometrician. This assumption is consistent with many empirical demand and other studies where prices, income, and other observed variables explain only a small fraction of the variation in the data. From the beginning unobserved preference heterogeneity has had an important role in multinomial choice models. The classic formulation of McFadden (1974) allowed for unobserved heterogeneity through an additive term in the utility of each alternative. Hausman and Wise (1978) developed a more general specification where coefficients of regressors vary in unobserved ways among agents. Our results build on this pioneering work as well as other contributions to be discussed in what follows.

Economic theory does not generally restrict the way unobserved heterogeneity affects preferences. This observation motivates allowing for general forms of heterogeneity, as we do in this paper. We allow choice utilities to depend on observed characteristics and unobserved heterogeneity in general ways that need not be additively or multiplicatively separable. The specifications we consider allow for random coefficients but also more general specifications.

In this paper we show that derivatives of choice probabilities with respect to continuous observed attributes are weighted averages of utility derivatives. These results allow us to identify signs of utility derivatives as well as relative utility effects for different attributes. We also find that probability derivatives can be even more informative in special cases, such as random coefficients. For example, we find that for linear random coefficients with an independent additive effect the probability derivative at zero is proportional to the population mean of the coefficients.

We give choice probability derivative results for binary and multinomial choice. We do this for cross-section data where unobserved heterogeneity is independent of observed attributes. We also give derivative formulas for two cases with endogeneity. One is where the heterogeneity and utility variables are independent conditional on a control function. There we show that derivatives of choice probabilities conditional on the control function have a utility derivative interpretation. We also verify that under a common support condition, averaging over the control function gives structural function derivatives.

We also allow for endogeneity by using panel data. We give derivative formulas for discrete choice in panel data under the time stationarity condition of Manski (1987). For the constant coefficient case these give new identification results for ratios of coefficients of continuously distributed variables in panel data without requiring infinite support for any regressor or disturbance. The panel data results are partly based on Hoderlein and White (2012) as extended to the time stationary case by Chernozhukov et al. (2015). These results use the ”diagonal” where regressors in two time periods are equal to each other.

We also consider identification ”off the diagonal,” where regressors in different time periods are not equal to each other. For the case of a single regressor and two time periods we construct an alternative, observationally equivalent model that is linear in the regressor. This alternative model can have a different average utility derivative off the diagonal, showing that utility average derivatives are not identified there.

The model and goal of this paper are different than that of Burda, Harding, and Hausman (2008, 2010) and Gautier and Kitamura (2013) Their goal is recover the distribution of heterogeneity in a linear random coefficients model. We consider a more general nonseparable model and a more modest goal of obtaining weighted average effects from probability derivatives. Our results provide a way of recovering certain averages of utility derivatives. Also, our results are simpler in only depending on nonparametric regressions rather than the Bayesian or deconvolution methods required to identify distributions of random coefficients.

Section 2 gives derivative formulae for binary choice. Section 3 extends these results to multinomial choice models. Section 4 obtains derivative results in the presence of a control function. Section 5 gives identification results for multinomial choice in panel data. Section 6 shows nonidentification off the diagonal. Section 7 concludes. The Appendix gives proofs.

2 Binary Choice Model

We first consider a binary choice model in cross-section data where we observe $(Y_{i},X_{i}),(i=1,...,n)$ with $Y\in\{0,1\}$ a binary choice variable and $X$ a vector of observed characteristics (regressors). Let $\varepsilon$ be a vector that is possibly infinite dimensional, representing unobserved aspects of agents’ preferences. We will assume that the utility of choices [math] and $1$ is given by $U_{0}(X,\varepsilon)$ and $U_{1}(X,\varepsilon)$ respectively. The binary choice variable $Y$ is

[TABLE]

Here we impose no restrictions on the way that $X$ and $\varepsilon$ interact. As we will discuss, this specification includes but is not limited to random coefficient models. This specification is general enough to be like the stochastic revealed preference setting of McFadden and Richter (1991).

We begin our analysis under the assumption that $X$ and $\varepsilon$ are independently distributed:

Assumption 1: (Independence) $X$ * and $\varepsilon$ are independently distributed.*

In what follows we will relax this condition when we have a control function or when we have panel data. It is helpful to think about this model as a threshold crossing model where

[TABLE]

The classic constant coefficients model is a special case where $\varepsilon$ is a scalar and

[TABLE]

This model only allows for additive unobserved heterogeneity. An important generalization is a random coefficients model where $\varepsilon=(v,\eta^{\prime})^{\prime}$ is a vector and

[TABLE]

This specification allows for the coefficients of the regressors to vary with the individual. Hausman and Wise (1978) proposed such a specification for Gaussian $\varepsilon$ . Berry (1994) proposed a mixed logit/Gaussian specification where $v$ is the difference of Type I extreme value variables plus a constant and $\eta$ is Gaussian. Gautier and Kitamura (2013) gave results on identification and estimation of the distribution of $\eta$ when that distribution is unknown. The nonseparable specification we consider is more general in allowing for $\delta(X,\varepsilon)$ to be nonlinear in $X$ and/or $\varepsilon$ .

In this binary choice setting the choice probability is given by

[TABLE]

where $F_{\varepsilon}$ denotes the CDF of $\varepsilon$ . Here we derive a formula that relates the derivatives of the choice probability with respect to $X$ to the derivatives of $\delta(X,\varepsilon).$ Let $\partial_{x}$ denote the vector of partial derivatives with respect to all the continuously distributed components of $X$ and $\partial_{v}$ the partial derivative with respect to a scalar $v.$

Assumption 2: (Monotonicity) (i) For some $\eta$ and $v$ , $\varepsilon=(\eta^{\prime},v)^{\prime}$ where $v$ is a scalar, $\delta(x,\varepsilon)=\delta(x,\eta,v)$ is continuously differentiable in $x$ and $v$ , and there is $C>0$ such that $\partial_{v}\delta(x,\eta,v)\geq 1/C$ and $\|\partial_{x}\delta(x,\eta,v)\|\leq C$ everywhere. (ii) The variable $v$ is continuously distributed conditional on $\eta$ with a conditional density $f_{v}(v\mid\eta)$ that is bounded and continuous in $v$ .

As discussed below, for binary choice this condition will be equivalent to $\delta(x,\varepsilon)$ being additive in $v$ that is continuously distributed with a density satisfying the above condition. Let $f_{\delta(x,\varepsilon)}$ denote the density of $\delta(x,\varepsilon)$ .

Theorem 1: If Assumptions 1 and 2 are satisfied then,

[TABLE]

Theorem 1 shows that derivatives of the choice probability are scalar multiples of averages of the derivative $\partial_{x}\delta(x,\varepsilon)$ conditional on being at the zero threshold, i.e. conditional on being indifferent between the two choices. Here the choice probability is one minus the CDF of $\delta(x,\varepsilon)$ at zero, so that the choice probability derivative is the negative of the CDF derivative at zero, i.e.

[TABLE]

The formula in Theorem 1 corresponds to the derivative of the CDF of a nonseparable model derived in Blomquist et al. (2014), which builds on the quantile derivative result of Hoderlein and Mammen (2007). The conclusion of Theorem 1 is an important application of this formula to the choice probability derivative in a nonseparable model.

Assumption 2 restricts our model somewhat relative to the stochastic revealed preference model of McFadden and Richter (1991). It is possible to obtain another informative derivative formula under regularity conditions like those of Sasaki (2015) and Chernozhukov, Fernandez-Val, and Luo (2015), that are different than Assumption 2. Those conditions lead to a more general formula for $\partial_{x}P(x)$ . That formula allows for multiple crossings of the threshold [math] while the monotonicity condition in Assumption 2 implies that there is only one threshold crossing conditional on $\eta$ . It is not clear how restrictive Assumption 2 or the alternative conditions are relative to the stochastic revealed preference setting of McFadden and Richter (1991). For brevity we omit further discussion of this issue.

Another special case of a nonseparable model is an index model where $\delta(x,\varepsilon)=h(\beta_{0}^{\prime}x,\varepsilon)$ for some constant coefficients $\beta_{0}$ . Here $P(X)=\tau(\beta_{0}^{\prime}X)$ for $\tau(u)=\Pr(h(u,\varepsilon)\geq 0)=\int 1(h(u,\varepsilon)\geq 0)F_{\varepsilon}(d\varepsilon).$ This model results in a choice probability that depends on $X$ only through the index $\beta_{0}^{\prime}X,$ similarly to Stoker (1986), Ichimura (1993), and Ai (1997). By Theorem 1 it follows that

[TABLE]

Differentiating with respect to the continuous components of $X$ gives the well known index derivative formula,

[TABLE]

Here the derivatives of the choice probability are scalar multiples of the components of $\beta_{0}$ .

There is an alternative version of Theorem 1 that provides further insight and motivates our multinomial choice results that follow in Section 3. By the monotonicity condition of Assumption 2

[TABLE]

where $h(x,\eta):=-\left.\delta^{-1}(x,\eta,r)\right|_{r=0}$ and the function inverse is with respect to the $v$ argument in $\delta(x,\eta,v).$ Then the choice probability is

[TABLE]

where $F_{v}(v\mid\eta)$ is the conditional CDF of $v$ given $\eta$ , and $F_{\eta}(\eta)$ is the CDF of $\eta.$ Differentiating the expression of $P(x)$ in (2.1) with respect to $x$ gives the following result:

Theorem 2: If Assumptions 1 and 2 are satisfied then

[TABLE]

This formula is easier to interpret than the formula in Theorem 1. Here we clearly see that the derivative of the choice probability is a weighted average of the derivative $\partial_{x}h(x,\eta)$ where the weight is the conditional pdf of $v$ given $\eta$ evaluated at $-h(x,\eta).$

It is well known that the binary choice model is observationally equivalent to a threshold crossing model where $h(x,\eta)$ is nonrandom. Let $\tilde{h}(x)=P(x),$ $\tilde{v}$ be distributed $U(0,1)$ independently of $X$ , and $\tilde{Y}=1(P(X)-\tilde{v}\geq 0)$ . Then

[TABLE]

This feature of binary choice models is not important for our purposes. Our purpose is to provide interpretations of $P(x)$ and its derivatives in the case where choice utilities vary across individuals in more complicated ways than through an additive effect. Essentially we have strong, a priori views that the utilities of different individuals are not just additive shifts of one another.

An important kind of varying utility is one with random coefficients, where $h(x,\eta)=x^{\prime}\eta.$ In this case the conclusion of Theorem 2 is that

[TABLE]

It is interesting to note when $v$ is independent of $\eta$ that at $x=0$

[TABLE]

Thus, when $X$ has positive density around zero and $v$ and $\eta$ are independent, the derivative of the choice probability at zero estimates the expected value of the random coefficients up to scale. Consequently

[TABLE]

This equation is a binary choice analog of the result that in a linear random coefficients model the regression of $Y$ on $X$ estimates the expectation of the coefficients. With binary choice only ratios of coefficients are identified, so here only ratios of expected values are identified.

Corollary 3: If Assumptions 1 and 2 are satisfied, $\delta(x,\varepsilon)=\eta^{\prime}x+v,$ and $v$ is independent of $\eta,$ then equation (2.2) is satisfied.

Higher order derivatives of the choice probability are also informative about the distribution of random coefficients. For example, when $v$ is independent of $\eta$ then differentiating twice with respect to $x$ gives

[TABLE]

where $f_{vv}(v)=\partial_{v}f_{v}(v)$ . Thus the second derivative of the probability is the second moment matrix of the random coefficients, up to scale. When $f_{vv}(0)$ is nonzero this result allows us to identify correlations among random coefficients as well as relative variances. It follows similarly that higher order derivatives will be scalar multiples of higher-order moments of $\eta.$

Weighted average derivatives of the choice probability can be used to summarize the effect of $x$ on $h(x,\eta)$ . From Theorem 2 we can see that weighted average derivatives will be weighted averages of $\partial_{x}h(x,\eta)$ conditional on $v=-h(x,\eta)$ . In particular, for any bounded nonnegative function $w(x)$ it follows that

[TABLE]

Here the derivative is weighted by both $w(X)$ and the density $f_{v}(-h(X,\eta)\mid\eta)$ . The density weight is present because the derivatives of $h(x,\eta)$ have been “filtered” through the discrete choice and so the probability derivative only recovers effects where $h(x,\eta)+v=0$ .

An example that we will consider repeatedly is random coefficients logit. For binary choice this model has $v=\xi+\rho,$ where $\xi$ is a constant and $\rho$ is the difference of two Type I extreme value disturbances that are independent of $\eta.$ Let $f_{v}(v)=e^{\xi-v}/[1+e^{\xi-v}]^{2}$ be the logistic pdf with location $\xi$ . Then the conclusion of Theorem 2 gives

[TABLE]

Here the probability derivative is a weighted average of the random coefficients, with the weight being the logistic pdf values evaluated at the regression $\xi+\eta^{\prime}x.$

3 Multinomial Choice Models

In this Section we extend the analysis to the nonseparable multinomial choice model. Here there are $J$ choices $j=1,...,J.$ Each choice has a utility $U_{j}(X,\varepsilon)$ associated with it, depending on observed characteristics $X$ and unobserved characteristics $\varepsilon$ . Let $Y_{j}$ denote the choice indicator that is equal to one if the $j^{th}$ alternative is chosen and zero otherwise. Then

[TABLE]

The probability $P_{j}(x):=\Pr(Y_{j}=1\mid X=x)$ that $j$ is chosen conditional on $X=x$ is the probability that $U_{j}(x,\varepsilon)$ is the maximum utility among the $J$ choices, i.e.

[TABLE]

where we maintain Assumption 1 and assume that the probability of ties is zero.

We can obtain a useful formula for the derivative of this probability under a condition analogous to Assumption 2. Recall that the monotonicity condition of Assumption 2 is equivalent to the existence of a scalar additive disturbance. Here we will impose scalar additive disturbances from the outset.

Assumption 3: (Multinomial Choice) There are $\eta,v_{j},u_{j}(x,\eta),(j=1,...,J)$ such that $\varepsilon=(\eta^{\prime},v^{\prime})^{\prime}$ for $v:=(v_{1},...,v_{J}),$

[TABLE]

and $u_{j}(x,\eta)$ * is continuously differentiable in $x$ with bounded derivative.*

In this condition we assume directly an additive disturbance condition that we showed is equivalent to Assumption 2 in the binary case. Assumption 3 generalizes that additive disturbance condition to multinomial choice. Similarly to binary choice, we are not sure what restrictions this additive specification would impose in the stochastic revealed preference setting of McFadden and Richter (1991). For brevity we do not give a result for a multinomial version of Assumption 1 which is quite complicated.

As for binomial choice we could formulate the results in terms of differences of utilities. However, we find it convenient to work directly with choice specific utilities $U_{j}(x,\varepsilon)=u_{j}(x,\eta)+v_{j}$ rather than differences. Let $u:=(u_{1},...,u_{J})$ denote a $J\times 1$ vector of constants and

[TABLE]

This $p_{j}(u\mid\eta)$ is just the usual multinomial choice probability, conditioned on $\eta$ . When the $f_{v}(v\mid\eta)$ is continuous, $p_{j}(u\mid\eta)$ will be continuously differentiable in each $u_{k}.$ Let

[TABLE]

Theorem 4: If Assumptions 1 and 3 are satisfied, the conditional density $f_{v}(v\mid\eta)$ of $v$ given $\eta$ is continuous in $v,$ and $p_{jk}(u\mid\eta),(j,k=1,...,J)$ are bounded, then $P_{j}(x)$ is differentiable in $x$ and

[TABLE]

Example 1: (Multinomial Logit Model) Here $v$ is a vector of i.i.d. Type I extreme value random variables independent of $\eta$ . The conditional choice probabilities $p_{j}$ have the multinomial logit form

[TABLE]

Define $\tilde{p}_{j}(x,\eta):=e^{u_{j}(x,\eta)}/\sum_{k=1}^{J}e^{u_{k}(x,\eta)}.$ Then,

[TABLE]

For example, if some $x^{\ell}$ affects only $u_{j_{\ell}}(x,\eta)$ for some $j_{\ell}$ , then

[TABLE]

Another important class of examples are those where $u_{j}(x,\eta)=\eta^{\prime}x^{j}+\xi_{j}$ for choice specific observable characteristics $x^{j}$ and constant $\xi_{j}.$ This example is similar to Berry, Levinsohn and Pakes (1995) where $x^{j}$ could be thought of as the characteristics of an object for choice $j$ , such as characteristics of the $j^{th}$ car type. Here an additional unit of some component of $x^{j}$ affects the utility the same for each alternative $j$ . In this class of examples,

[TABLE]

Here we see that the derivative of the $j^{th}$ choice probability with respect to the regressor vector $x^{k}$ for the $k^{th}$ alternative is an expectation of the random coefficients multiplied by a scalar $\partial p_{j}(u(x,\eta)\mid\eta)/\partial u_{k}$ . As in the binary case if $\eta$ is a constant vector $\beta_{0}$ then

[TABLE]

so that the derivative of the choice probability is proportional to $\beta_{0}$ for all $x^{k}$ . Also if $v$ is independent of $\eta$ so that $p_{j}(u\mid\eta)=p_{j}(u)$ , and each of the characteristic vectors is zero, then the scalar is a constant and

[TABLE]

Similarly to the binary case the derivative of the probability at the origin is a scalar multiple of the expectation of the random coefficients. Moreover, it can be shown as in the binary case that higher-order derivatives identify higher-order moments of $\eta$ , up to scale.

4 Control Functions

A model where it is possible to allow for nonindependence between $\varepsilon$ and $X$ is one where there is an observable or estimable control function $w$ satisfying

Assumption 4: (Control Function) $X$ * and $\varepsilon$ are independently distributed conditional on $w$ .*

As shown in Blundell and Powell (2004) and Imbens and Newey (2009), conditioning on a control function helps to identify objects of interest.111Berry and Haile (2010) considered an alternative approach based on the availability of “special regressors” and instrumental variables satisfying completeness conditions in multinomial choice demand models where the endogenous part of the unobserved heterogeneity is scalar. This approach identifies the entire distribution of random utilities. Here we show how a control function can be used to estimate averages of utility derivatives. These derivatives will be exactly analogous to those considered previously, except that we also condition on the control function.

The choices $Y_{j}$ are determined as before but now we consider choice probabilities that condition on $w$ as well as $X.$ These probabilities are given by

[TABLE]

Let $u:=(u_{1},...,u_{J})$ denote a $J\times 1$ vector of constants and

[TABLE]

This $p_{j}(u\mid\eta,w)$ is just the usual multinomial choice probability, conditioned on $\eta$ and $w.$ When the conditional density of $v$ given $\eta$ and $w$ is continuous, $p_{j}(u\mid\eta,w)$ will be continuously differentiable in each $u_{k}.$ Let $p_{jk}(u\mid\eta,w):=\partial p_{j}(u\mid\eta,w)/\partial u_{k}$ and $u(x,\eta):=(u_{1}(x,\eta),...,u_{J}(x,\eta))^{\prime}$ as before.

Theorem 5: If Assumptions 3 and 4 are satisfied, the conditional density $f_{v}(v\mid\eta,w)$ of $v$ given $\eta$ and $w$ is continuous in $v,$ and $p_{jk}(u\mid\eta,w),(j,k=1,...,J)$ are bounded, then $P_{j}(x,w)$ is differentiable in $x$ and

[TABLE]

Example 1 (cont.): Consider the multinomial logit model where $v$ is also independent of $w.$ Then,

[TABLE]

For example, if some $x^{\ell}$ affects only $u_{j_{\ell}}(x,\eta)$ for some $j_{\ell}$ , then

[TABLE]

We can also obtain result for the random coefficient model analogous to the previous section conditional on the control variable. We do not present these results for the sake of brevity.

As is known from the previous literature, integrating over the marginal distribution of the control function gives probability derivatives identical to those for $X$ and $\varepsilon$ independent, when a common support condition is satisfied:

Corollary 6: If Assumptions 3 and 4 are satisfied, the conditional density $f_{v}(v\mid\eta,w)$ of $v$ given $\eta$ and $w$ is continuous in $v$ and bounded, and the conditional support for $w$ given $X=x$ equals the marginal support for $w$ , then $P_{j}(x,w)$ is differentiable in $x$ and

[TABLE]

where $F_{w}(w)$ is the CDF of $w$ .

It is interesting to note that the common support condition is not needed for identification of interesting effects. Averages of utility derivatives are identified from probability derivatives, conditional on the control function, as in Theorem 5. Also, because $\eta$ is independent of $X$ conditional on $w$ , averages over $\eta$ conditional on $X$ can be identified by integrating the objects in Theorem 5 over the conditional distribution of $w$ given $X$ . This integration gives local average probability responses analogous to the local average response given in Altonji and Matzkin (2005). In addition, averaging over the joint distribution of $X$ and $w$ gives average derivatives analogous to those considered by Imbens and Newey (2009). None of these effects rely on the common support condition.

5 Panel Data

Panel data can also help us identify averages of utility derivatives when $X$ and $\varepsilon$ are not independent. Invariance over time of the distribution of $\varepsilon$ conditional on the observed $X$ for all time periods can allow us to identify utility derivative averages analogous to those we have considered. This invariance over time of the distribution of $\varepsilon$ conditional on regressors is the basis of previous panel identification results by Manski (1987), Honore (1992), Abrevaya (2000), Chernozhukov et al. (2013), Graham and Powell (2012), Chernozhukov et al. (2015), and is an important hypothesis in Hoderlein and White (2012). Pakes and Porter (2014) and Shi et al. (2017) have given identification results for multinomial choice models under this condition. These papers allow for some time effects while Evdokimov (2010) allowed for general time effects while imposing independence and additivity among disturbances.

In this Section we consider panel binary and multinomial choice models. We focus on the case of two time periods. We start with the panel version of the general nonseparable binary choice model of Section 2. Here $Y_{t}\in\{0,1\}$ is the binary choice variable and $X_{t}$ the vector of observed characteristics (regressors) at time $t$ . We consider the threshold crossing model

[TABLE]

where $\delta(X_{t},\varepsilon_{t})$ represents the difference in utility between choices [math] and $1$ at time $t$ , and $\varepsilon_{t}$ is a vector of unobserved heterogeneity at time $t$ which include time variant and time invariant components such as individual effects. The time stationarity of the difference in utility is important to our results though it may be possible to relax that condition similarly to Chernozhukov et al. (2015).

With panel data we can replace the assumption of independence of $\varepsilon_{t}$ and $X_{t}$ with the following time stationarity condition that is automatically satisfied by the time invariant components of $\varepsilon_{t}$ :

Assumption 5: (Time Stationarity) The distribution of $\varepsilon_{t}$ given * $\mathbf{X:=}(X_{1},X_{2})$ does not depend on $t.$ *

To identify averages of utility derivatives we use the choice probability conditional on the regressors for both time periods, given by

[TABLE]

where the CDF of $\varepsilon_{t},$ $F_{\varepsilon}$ , does not depend on $t$ by Assumption 5. Assume that $\delta(X_{t},\varepsilon_{t})$ and $F_{\varepsilon}(d\varepsilon\mid X_{1},X_{2})$ are differentiable in $X_{t}$ . Then an argument similar to Theorem 1 yields

[TABLE]

The first term is a scalar multiple of the average utility derivative conditional on the regressors at both periods and on being indifferent between the two choices at time $t$ . The second term is heterogeneity bias coming from the dependence between $X_{t}$ and $\varepsilon_{t}$ .

The next result shows that differences of derivatives of the choice probability identify up to a constant the average utility derivative on the diagonal where the regressors do not change over the two periods. As in Chernozhukov et al. (2015), time stationarity allows us to difference out the confounding effect of $X_{t}$ that acts through the correlation of $X_{t}$ with $\varepsilon_{t}$ .

Theorem 7: If Assumptions 2(i) and 5 are satisfied, $v_{t}$ is continuously distributed conditional on * $\eta_{t}$ and $\mathbf{X}$ * with a conditional density $f_{v}(v\mid\eta_{t},\mathbf{X})$ that is bounded and continuous in $v,$ the conditional density $f_{\eta}(\eta\mid X_{1},X_{2})$ * of* $\eta_{t}$ * given* $(X_{1},X_{2})$ is continuous in $X_{2}$ and there is $\delta>0$ such that

[TABLE]

then,

[TABLE]

We now turn to the multinomial choice model. Let $Y_{jt}$ denote the choice indicator, equal to $1$ if alternative $j$ is chosen in time period $t$ . We assume that choice is based on a time stationary utility function $U_{j}(x,\varepsilon)=u_{j}(x,\eta)+v_{j}$ having the additive form considered in the previous Section. Again the time stationarity of the utility may be possible to relax similarly to Chernozhukov et al. (2015).

It is assumed that the individual makes the choice that maximizes utility in each time period, so that

[TABLE]

To identify averages of utility derivatives we use again choice probabilities conditional on the regressors for both time periods, given by

[TABLE]

For a constant vector $u:=(u_{1},...,u_{J})$ let

[TABLE]

This is like the usual choice probability, as discussed earlier, only now it depends on $\mathbf{X}$ as well as $\eta_{t}.$ What allows us to identify derivative effects despite the dependence of $p_{j}$ on $\mathbf{X}$ is that $p_{j}$ does not depend on $t$ because of the time stationarity condition of Assumption 5. Time stationarity allows us again to difference out the confounding effect of $\mathbf{X}$ that acts through the correlation of $\mathbf{X}$ with $\varepsilon$ . By iterated expectations the choice probability is

[TABLE]

The difference over two time periods is

[TABLE]

where we have used the time stationarity in replacing $\eta_{1}$ by $\eta_{2}$ in $P_{j1}(\mathbf{X}).$ When we differentiate this with respect to $X_{2}$ the presence of $P_{j1}(\mathbf{X})$ removes all the derivatives with respect to $X_{2}$ except the utility derivatives, where $X_{1}=X_{2}.$

If the conditional density $f_{v}(v\mid\eta_{t},\mathbf{X)}$ is continuous in $v$ then $p_{j}(u\mid\eta_{t},\mathbf{X})$ will be continuously differentiable in $u.$ Let $p_{jk}(u\mid\eta_{t},\mathbf{X}):=\partial p_{j}(u\mid\eta_{t},\mathbf{X})/\partial u_{k}$ .

Theorem 8: If Assumptions 3 and 5 are satisfied, the conditional density $f_{v}(v\mid\eta_{t},\mathbf{X})$ of $v_{t}$ given $\eta_{t}$ and $\mathbf{X}$ is continuous in $v,$ $p_{jk}(u\mid\eta_{t},\mathbf{X}),(j,k=1,...,J)$ * are bounded, and $\mathrm{E}[p_{j}(u(x,\eta_{t})\mid\eta_{t},X_{1},X_{2})|X_{1},X_{2}\mathbf{]}$ is differentiable in * $x$ and $X_{2}$ * then,*

[TABLE]

Example 1 (cont.): Consider the multinomial logit where $v_{t}$ consists of i.i.d Type I extreme value random variables that are independent of $\eta_{t}$ and of $\mathbf{X.}$ Then,

[TABLE]

For example, if some $X_{2}^{\ell}$ affects only $u_{j_{\ell}}(X_{2},\eta)$ , then

[TABLE]

An important class of examples is a panel version of Berry (1994) where $u_{j}(x,\eta)=\eta^{\prime}x^{j}$ for choice specific observable characteristics $x^{j}$ . In this class of examples,

[TABLE]

Here we see that the derivative of the $j^{th}$ choice probability difference with respect to $X_{2}^{k}$ for the $k^{th}$ alternative is an expectation of the random coefficients multiplied by a scalar $p_{jk}(u(X_{2},\eta)\mid\eta)$ . The choice probabilities need not have the logit form for this result to hold. Also, analogous to the cross-section case, if $v_{t}$ is independent of $\eta_{t}$ conditional on $\mathbf{X}$ so that $p_{j}(u\mid\eta_{t},\mathbf{X})=p_{j}(u\mid\mathbf{X})$ , and each of the characteristic vectors is zero at both time periods, then the scalar is a constant and

[TABLE]

Hence the derivative of the probability at the origin is a scalar multiple of the expectation of the random coefficients conditional on the regressor being zero at both time periods. Again it can be shown as in the cross-section case that higher-order derivatives identify higher-order moments of $\eta$ conditional on $\mathbf{X}$ , up to scale.

Time stationary panel data provides a way of controlling for endogeneity of prices in imperfectly competitive markets where the price is one element of $X_{t}$ . The time stationarity condition of Assumption 5 allows for unobserved features of preferences corresponding to $\varepsilon_{t}$ to be correlated with $\mathbf{X}$ in unspecified ways, as long as that relationship is the same for each time period. In particular, as mentioned earlier, components of $\varepsilon$ that do not vary over time automatically satisfy this condition. In this sense Assumption 5 is a very general condition for preferences that do not vary over time. It can also be extended to settings where the dimension $t$ corresponds to different markets or locations instead of time periods.

Similar to the cross-section case, if $\eta$ is a constant vector $\beta_{0}$ then

[TABLE]

Thus we find that that the derivative of the choice probability is proportional to $\beta_{0}$ for all $X_{1}=X_{2}$ in a panel data multinomial choice model where $u_{j}(x,\eta)=\beta_{0}^{\prime}x^{j}$ .

Theorem 9: If Assumption 5 is satisfied, $U_{j}(x,\varepsilon)=\beta_{0}^{\prime}x^{j}+v_{j},$ and $\mathrm{E}[p_{j}(u(x,\eta_{t})\mid\eta_{t},X_{1},X_{2})|X_{1},X_{2}\mathbf{]}$ * is differentiable in $x$ * * and $X_{2}$ , then for each $j$ * and $k$ , equation (5.3) is satisfied. Also, if $\left.\mathrm{E}[p_{jk}(u(X_{1},\eta_{t})\mid\eta_{t},\mathbf{X})\mid\mathbf{X]}\right|_{X_{1}=X_{2}}\neq 0$ for some $j,$ $k,$ and $X_{1}$ , then $\beta_{0}$ is identified up to scale.

This gives an identification result for multinomial choice models in panel data. It shows that the vector of coefficients of continuous regressors in a multinomial choice model with additive fixed effect is identified up to scale from the diagonal where $X_{1}=X_{2}.$ This identification result holds even if $X_{t}$ is bounded, unlike that of Manski (1987). It can also hold even with $v_{t}$ having bounded support, unlike that of Shi et al. (2017). In independent work Chen and Wang (2017) has recently shown that in panel binary choice the entire vector $\beta_{0}$ can be identified up to scale if just one component of $X_{t}$ is continuously distributed.

6 Nonidentification Off the Diagonal

The panel data results show identification of utility derivatives on the diagonal where $X_{1}=X_{2}.$ We can also show that off the diagonal, where $X_{1}\neq X_{2},$ utility derivatives are not identified with two time periods. Specifically, off the diagonal one can obtain multiple values of conditional expectations of utility derivatives from the same the data.

To provide intuition we first show nonidentification for the smooth case where

[TABLE]

$X_{t}$ is a scalar, $\phi(x,\varepsilon)$ is continuously differentiable in $x$ , and the distribution of $\varepsilon_{t}$ given $\mathbf{X}=(X_{1},X_{2})^{\prime}$ is time stationary. Suppose that equation (6.1) is true. We can construct an alternative, observationally equivalent nonseparable model with time stationary disturbances as

[TABLE]

By construction $\tilde{\varepsilon}$ does not vary with $t$ , so that it is time stationary. Also, $Y_{t}=\tilde{\phi}(X_{t},\tilde{\varepsilon})$ so that the alternative model is observationally equivalent to the original one. Furthermore, the expected value of $\tilde{\phi}_{x}\left(X_{2},\tilde{\varepsilon}\right):=\partial_{x}\tilde{\phi}(x,\tilde{\varepsilon})|_{x=X_{2}}$ conditional on $\mathbf{X}$ is

[TABLE]

In contrast

[TABLE]

In general the expected derivative in equation (6.2) will not equal the expected derivative in equation (6.3) when $\mathrm{E}[\phi(x,\varepsilon_{2})\mid\mathbf{X}]$ is nonlinear in $x$ over the set where $X_{1}\neq X_{2}.$ Thus we have constructed an observationally equivalent nonseparable model with $\mathrm{E}[\tilde{\phi}_{x}\left(X_{2},\tilde{\varepsilon}\right)\mid\mathbf{X}]\neq\mathrm{E}[\phi_{x}\left(X_{2},\varepsilon_{2}\right)\mid\mathbf{X}],$ implying that $\mathrm{E}[\phi_{x}\left(X_{2},\varepsilon_{2}\right)\mid\mathbf{X}]$ is not identified, on the set where $X_{1}\neq X_{2}.$ The following is a precise statement of this nonidentification result.

Theorem 10: If i) $Y_{t}=\phi(X_{t},\varepsilon_{t}),$ $\varepsilon_{t}$ is time stationary conditional on $\mathbf{X}$ ; * ii) $\phi(x,\varepsilon)$ is continuously differentiable in $x$ with bounded derivative; and iii)*

[TABLE]

for $X_{1}\neq X_{2},$ then $\mathrm{E}[\phi_{x}\left(X_{2},\varepsilon_{2}\right)\mid\mathbf{X}]$ is not identified on the set $X_{1}\neq X_{2}.$

For example suppose $\phi(x,\varepsilon)$ is quadratic in $x$ with $\varepsilon=(\varepsilon_{a},\varepsilon_{b},\varepsilon_{c})^{\prime}$ and $\phi(x,\varepsilon)=\varepsilon_{a}+\varepsilon_{b}x+\varepsilon_{c}x^{2}.$ Then

[TABLE]

Theorem 9 implies that in this quadratic model $\mathrm{E}[\phi_{x}\left(X_{2},\varepsilon_{2}\right)\mid\mathbf{X}]$ is not identified on $X_{1}\neq X_{2}$ when $E[\varepsilon_{c2}\mid\mathbf{X]\neq 0.}$

It is interesting that the form of the alternative, observationally equivalent model $Y_{t}=\tilde{\varepsilon}_{a}+\tilde{\varepsilon}_{b}X_{t}$ is linear in $X_{t}$ . This is the model considered by Graham and Powell (2012). Observational equivalence of this model to the true model means that it is impossible to distinguish from the data a linear in $x$ model from a nonlinear one, when there is one regressor and two time periods. Furthermore, the proof given above also shows that the object estimated by the Graham and Powell (2012) estimator will be the expected difference quotient

[TABLE]

This could be an interesting object. Of course one might also be interested in the expected derivative on the diagonal given by $\left.\partial_{X_{2}}\mathrm{E}[Y_{2}-Y_{1}\mid\mathbf{X}]\right|_{X_{1}=X_{2}}$ ; see Hoderlein and White (2012) and Chernozhukov et al. (2015). It might be best to report both kinds of effects in practice, given the impossibility of distinguishing a linear from a nonlinear model when there is a scalar $X_{t}$ and two time periods.

We can give an analogous result for binary choice. Consider the binary choice panel model in (5.1) where $Y_{t}=1(\delta(X_{t},\varepsilon_{t})\geq 0)$ with scalar $X_{t}$ . Suppose that this model satisfies Assumption 2. As in the smooth case, we can construct an alternative, observationally equivalent nonseparable model with time stationary disturbances as

[TABLE]

Note that this model also satisfies Assumption 2 because $\tilde{\delta}\left(x,\tilde{\varepsilon}\right)$ is monotonic in $\tilde{\varepsilon}_{a}$ . The nonidentification result that we give here shows that the object of interest in Theorem 7 is different for the two observationally equivalent models when $X_{1}\neq X_{2}$ . Thus,

[TABLE]

in general. Here we use again the notation $g_{x}(X_{2},\varepsilon_{2}):=\partial_{x}g(x,\varepsilon_{2})|_{x=X_{2}}$ for $g=\delta,\tilde{\delta}$ . The result then follows by

[TABLE]

The following is a precise statement of this nonidentification result.

Theorem 11: Under the assumptions of Theorem 7 and

[TABLE]

*for $X_{1}\neq X_{2},$ then *

[TABLE]

is not identified on the set $X_{1}\neq X_{2}.$

7 Conclusion

Jerry Hausman pioneered the introduction of flexible forms of unobserved heterogeneity in structural economic models for multinomial choice. This paper follows this tradition by considering identification of nonseparable multinomial choice models with unobserved heterogeneity that is unrestricted in both the dimension and its interaction with observed attributes. Some of our results are local in nature. For example, we show that derivatives of choice probabilities identify average utility derivatives only for marginal units that are indifferent between two choices with cross-section data and for units that have time invariant attributes with time stationary panel data. It would be interesting to characterize minimal conditions that permit extending the identification of average utility derivatives to larger populations. We leave this extension to future work.

8 Appendix: Proofs of Theorems

Proof of Theorem 1: The proof is similar to the proof of Lemma 1 of Chernozhukov et al. (2015). Let $F_{v}(v\mid\eta)=\int_{-\infty}^{v}f_{v}(u\mid\eta)du$ . Under Assumptions 1 and 2,

[TABLE]

Differentiating with respect to $x$

[TABLE]

where the conditions of Assumption 2 allow us to differentiate under the integral. Note that by the inverse and implicit function theorems,

[TABLE]

Also, by a change of variable

[TABLE]

where $f_{\delta(x,\eta,v)}(\cdot\mid\eta)$ is the conditional density of $\delta(x,\eta,v)$ given $\eta$ . Then substituting in gives

[TABLE]

since

[TABLE]

by the Bayes rule. Q.E.D.

Proof of Theorem 2: Given in text.

Proof of Corollary 3: Given in text.

Proof of Theorem 4: By iterated expectations,

[TABLE]

Also by Assumption 3 and the chain rule, $p_{j}(u(x,\eta)\mid\eta)$ is continuously differentiable in $x$ with bounded derivative

[TABLE]

Interchanging the order of differentiation and integration is then allowed, and the conclusion follows. Q.E.D.

Proof of Theorem 5: By iterated expectations and independence of $v$ and $x$ given $w$

[TABLE]

Also, by $f(v\mid\eta,w)$ * *continuous in $v$ and bounded and the chain rule, $p_{j}(u(x,\eta)\mid\eta,w)$ is continuously differentiable in $x$ with bounded derivative

[TABLE]

Interchanging the order of differentiation and integration is then allowed, and the conclusion follows. Q.E.D.

Proof of Corollary 6: Given in text.

Proof of Theorem 7: The proof is similar to the proof of Theorem 2 of Chernozhukov et al. (2015). Let $H(x,X_{2})=\Pr(\delta(x,\varepsilon)\geq 0\mid X_{1},X_{2})$ . By the same argument as in the proof of Theorem 1 conditional on $(X_{1},X_{2}),$ $H(x,X_{2})$ is differentiable in $x$ with

[TABLE]

From (5.2), $H(x,X_{2})$ is also differentiable in $X_{2}$ with

[TABLE]

The result follows by

[TABLE]

taking differences with $t=1$ and $t=2$ , and evaluating at $X_{1}=X_{2}$ . Q.E.D.

Proof of Theorem 8: By iterated expectations,

[TABLE]

where $F(\eta\mid\mathbf{X})$ denotes the CDF of $\eta_{t}$ conditional on $\mathbf{X}$ . Also by Assumption 3 and the chain rule $p_{j}(u(x,\eta)\mid\eta,\mathbf{X})$ is continuously differentiable in $x$ with bounded derivative and

[TABLE]

It follows by the previous equation that the order of differentiating an integration can be interchanged to obtain

[TABLE]

Note that $P_{jt}(\mathbf{X)=}\mu_{j}(X_{t}\mid\mathbf{X)}$ . Let $\mu_{j}(x\mid\mathbf{X}):=\mathrm{E}[p_{j}(u(x,\eta_{t})\mid\eta_{t},\mathbf{X})\mid\mathbf{X].}$ Then by the chain rule we have

[TABLE]

Interchanging the order of differentiation and integration is then allowed, and the conclusion follows. Q.E.D.

Proof of Theorem 9: Given in text.

Proof of Theorem 10: Given in text.

Proof of Theorem 11: Given in text.

Acknowledgements

We thank the editor, referee, and participants at Cambridge-INET and Cemmap Panel Data Workshop for comments and Siyi Luo for capable research assistance. We gratefully acknowledge research support from the NSF. We appreciate the hospitality of the Cowles Foundation where much of the work for this paper was accomplished.

REFERENCES

Abrevaya, J. (2000): ”Rank Estimation of a Generalized Fixed-Effects Regression Model,” Journal of Econometrics 95, 1-23.

Ai, C. (1997): “A Semiparametric Maximum Likelihood Estimator”, Econometrica 65, 933-963.

Altonji, J., and R. Matzkin (2005): “Cross Section and Panel Data Estimators for Nonseparable Models with Endogenous Regressors”, Econometrica 73, 1053-1102.

Berry, S. (1994): ”Estimating Discrete Choice Models of Product Differentiation,” Rand Journal of Economics 25, 242-262.

Berry, S., P. Haile (2010): ”Nonparametric Identification of Multinomial Choice Demand Models with Heterogeneous Consumers,” Cowles Foundation Discussion Paper 1718.

Berry, S., J. Levinsohn, A. Pakes (1995): ”Automobile Prices in Market Equilibrium,” Econometrica 63, 841-890.

Blundell, R.W. and J.L. Powell (2004): ”Endogeneity in Semiparametric Binary Response Models,” Review of Economic Studies 71, 655–679.

Blomquist, S., A. Kumar, C.-Y. Liang, W.K. Newey (2014): ”Individual Heterogeneity, Nonlinear Budget Sets, and Taxable Income,” CEMMAP working paper CWP21/14

Burda, M., M. Harding, and J.A. Hausman (2008): ”A Bayesian Mixed Logit-Profit Model for Multinomial Choice,” Journal of Econometrics 147, 232-246.

Burda, M., M. Harding, and J.A. Hausman (2010): ”A Poisson Mixture Model of Discrete Choice,” Journal of Econometrics 166, 184-203.

Chen, S., and X. Wang (2017), “Semiparametric Estimation of a Panel Data Model without Monotonicity or Separability,” unpublished manuscript, Hong Kong University of Science and Technology.

Chernozhukov, V., Fernandez-Val, I., Hahn, J., and W. K. Newey (2013), “Average and Quantile Effects in Nonseparable Panel Models,” Econometrica, 81(2), pp. 535–580.

Chernozhukov, V., I. Fernandez-Val, S. Hoderlein, H. Holzman, and W.K. Newey (2015): ”Nonparametric Identification in Panels Using Quantiles,” *Journal of Econometrics *188, 378–392.

Chernozhukov, V., I. Fernandez-Val, Y. Luo (2015):“The Sorted Effects Method: Discovering Heterogenous Effects Beyond Their Averages,” working paper.

Evdokimov, K. (2010), “Identification and Estimation of a Nonparametric Panel Data Model with Unobserved Heterogeneity,” unpublished manuscript, Princeton University.

Gautier, E. and Y. Kitamura (2013): ”Nonparametric Estimation in Random Coefficients Binary Choice Models,” Econometrica 81, Pages 581–607.

Graham, B.W. and J.L. Powell (2012), “Identification and Estimation of Average Partial Effects in “Irregular” Correlated Random Coefficient Panel Data Models,” Econometrica 80 (5), pp. 2105–2152.

Hausman, J.A., and D. Wise (1978): ”A Conditional Probit Model for Qualitative Choice: Discrete Decisions Recognizing Interdependence and Heterogeneous Preferences,” Econometrica 46, 403-26.

Hausman, J.A., and W.K. Newey (2016): ”Individual Heterogeneity and Average Welfare,” Econometrica 84, 1225-1248.

Hoderlein, S., and E. Mammen (2007): “Identification of Marginal Effects in Nonseparable Models without Monotonicity,” Econometrica, 75, 1513 - 1519.

Hoderlein, S. and H. White, (2012), “Nonparametric identi cation in nonseparable panel data models with generalized fixed effects,” Journal of Econometrics 168, 300-314.

Honore, B.E. (1992): ”Trimmed Lad and Least Squares Estimation of Truncated and Censored Regression Models with Fixed Effects,” Econometrica 60, 533-565

Ichimura, H. (1993): ”Semiparametric Least Squares (SLS) and Weighted SLS Estimation of Single-Index Models,” Journal of Econometrics 58, 71-120.

Imbens, G. and W.K. Newey (2009): ”Identification and Estimation of Triangular Simultaneous Equations Models Without Additivity,” Econometrica 77, 1481-1512.

Manski, C.F. (1987): ”Semiparametric Analysis of Random Effects Linear Models from Binary Panel Data,” Econometrica 55, 357-362.

McFadden, D. (1974): ”Conditional Logit Analysis of Qualitative Choice Behavior,” in P. Zarembka (ed) Frontiers of Econometrics, Academic Press, 105-142.

McFadden, D.; K. Richter (1991) ”Stochastic Rationality and Revealed Stochastic Preference,” in J. Chipman, D. McFadden, K. Richter (eds) Preferences, Uncertainty, and Rationality, Westview Press, 161-186.

Pakes, A. and J. Porter (2014): ”Moment Inequalities for Semi-parametric Multinomial Choice with Fixed Effects,” Working paper, Harvard University.

Sasaki, Y. (2015): “What Do Quantile Regressions Identify for General Structural Functions?” EconometricTheory 31, 1102-1116.

Shi, X., Shum, M., and W. Song (2017): ”Estimating Semi-parametric Panel Multinomial Choice Models using Cyclic Monotonicity,” Working paper, University of Wisconsin-Madison.

Stoker, T. (1986): ”Consistent Estimation of Scaled Coefficients,” Econometrica 54, 1461-1482.