Multivariate Conditional Transformation Models

Nadja Klein; Torsten Hothorn; Luisa Barbanti; Thomas Kneib

arXiv:1906.03151·stat.ME·June 27, 2023

Multivariate Conditional Transformation Models

Nadja Klein, Torsten Hothorn, Luisa Barbanti, Thomas Kneib

PDF

TL;DR

This paper introduces a flexible, likelihood-based framework for multivariate conditional transformation models that capture complex dependencies and nonlinear effects of covariates, improving upon existing simplistic models.

Contribution

It proposes a general, scalable framework for multivariate conditional transformation models that allows covariate-dependent dependence structures and nonlinear effects.

Findings

01

Framework scales beyond bivariate responses

02

Empirical benefits shown in childhood undernutrition analysis

03

Allows flexible, interpretable modeling of complex multivariate data

Abstract

Regression models describing the joint distribution of multivariate response variables conditional on covariate information have become an important aspect of contemporary regression analysis. However, a limitation of such models is that they often rely on rather simplistic assumptions, e.g. a constant dependency structure that is not allowed to vary with the covariates or the restriction to linear dependence between the responses only. We propose a general framework for multivariate conditional transformation models that overcomes these limitations and describes the entire distribution in a tractable and interpretable yet flexible way conditional on nonlinear effects of covariates. The framework can be embedded into likelihood-based inference, including results on asymptotic normality, and allows the dependence structure to vary with covariates. In addition, the framework scales well…

Equations133

\BODY

\BODY

h (\boldmath Y) = (h_{1} (\boldmath Y), \dots, h_{J} (\boldmath Y))^{⊤} = d (Z_{1}, \dots, Z_{J})^{⊤} = \boldmath Z \in \mathds R^{J} .

h (\boldmath Y) = (h_{1} (\boldmath Y), \dots, h_{J} (\boldmath Y))^{⊤} = d (Z_{1}, \dots, Z_{J})^{⊤} = \boldmath Z \in \mathds R^{J} .

\frac{\partial h ( \boldmath y )}{\partial \boldmath y} > 0.

\frac{\partial h ( \boldmath y )}{\partial \boldmath y} > 0.

f_{\boldmath Y} (\boldmath y) = [j = 1 \prod J f_{Z} (h_{j} (\boldmath y))] \cdot \frac{\partial h ( \boldmath y )}{\partial \boldmath y} .

f_{\boldmath Y} (\boldmath y) = [j = 1 \prod J f_{Z} (h_{j} (\boldmath y))] \cdot \frac{\partial h ( \boldmath y )}{\partial \boldmath y} .

h_{j} (\boldmath y) = h_{j} (y_{1}, \dots, y_{J}) = h_{j} (y_{1}, \dots, y_{j})

h_{j} (\boldmath y) = h_{j} (y_{1}, \dots, y_{J}) = h_{j} (y_{1}, \dots, y_{j})

\frac{\partial h ( \boldmath y )}{\partial \boldmath y} = j = 1 \prod J \frac{\partial h _{j} ( y _{1} , \dots , y _{j} )}{\partial y _{j}} .

\frac{\partial h ( \boldmath y )}{\partial \boldmath y} = j = 1 \prod J \frac{\partial h _{j} ( y _{1} , \dots , y _{j} )}{\partial y _{j}} .

h_{j} (y_{1}, \dots, y_{j}) = λ_{j 1} \tilde{h}_{1} (y_{1}) + \dots + λ_{j j} \tilde{h}_{j} (y_{j})

h_{j} (y_{1}, \dots, y_{j}) = λ_{j 1} \tilde{h}_{1} (y_{1}) + \dots + λ_{j j} \tilde{h}_{j} (y_{j})

h_{j} (y_{1}, \dots, y_{j}) = λ_{j 1} \tilde{h}_{1} (y_{1}) + \dots + λ_{j, j - 1} \tilde{h}_{j - 1} (y_{j - 1}) + \tilde{h}_{j} (y_{j}) .

h_{j} (y_{1}, \dots, y_{j}) = λ_{j 1} \tilde{h}_{1} (y_{1}) + \dots + λ_{j, j - 1} \tilde{h}_{j - 1} (y_{j - 1}) + \tilde{h}_{j} (y_{j}) .

\frac{\partial h ( \boldmath y )}{\partial \boldmath y} = j = 1 \prod J \frac{\partial h ~ _{j} ( y _{j} )}{\partial y _{j}} > 0,

\frac{\partial h ( \boldmath y )}{\partial \boldmath y} = j = 1 \prod J \frac{\partial h ~ _{j} ( y _{j} )}{\partial y _{j}} > 0,

f_{\boldmath Y} (\boldmath y) = j = 1 \prod J f_{Z} (λ_{j 1} \tilde{h}_{1} (y_{1}) + \dots + λ_{j, j - 1} \tilde{h}_{j - 1} (y_{j - 1}) + \tilde{h}_{j} (y_{j})) \frac{\partial h ~ _{j} ( y _{j} )}{\partial y _{j}} .

f_{\boldmath Y} (\boldmath y) = j = 1 \prod J f_{Z} (λ_{j 1} \tilde{h}_{1} (y_{1}) + \dots + λ_{j, j - 1} \tilde{h}_{j - 1} (y_{j - 1}) + \tilde{h}_{j} (y_{j})) \frac{\partial h ~ _{j} ( y _{j} )}{\partial y _{j}} .

Λ = 1 λ_{21} λ_{31} ⋮ λ_{J 1} 1 λ_{32} ⋮ λ_{J 2} 1 \dots ⋱ λ_{J, J - 1} 01 .

Λ = 1 λ_{21} λ_{31} ⋮ λ_{J 1} 1 λ_{32} ⋮ λ_{J 2} 1 \dots ⋱ λ_{J, J - 1} 01 .

\tilde{h}_{j} (Y_{j}) = Φ_{0, σ_{j}^{2}}^{- 1} (F_{j} (Y_{j})) = \tilde{Z}_{j}

\tilde{h}_{j} (Y_{j}) = Φ_{0, σ_{j}^{2}}^{- 1} (F_{j} (Y_{j})) = \tilde{Z}_{j}

\mathds P (\boldmath Y \leq \boldmath y)

\mathds P (\boldmath Y \leq \boldmath y)

\tilde{Z}_{j} / σ_{j} \sim N (0, 1) and Φ_{0, σ_{j}^{2}} (\tilde{z}_{j}) = Φ_{0, 1} (\tilde{z}_{j} / σ_{j})

\tilde{Z}_{j} / σ_{j} \sim N (0, 1) and Φ_{0, σ_{j}^{2}} (\tilde{z}_{j}) = Φ_{0, 1} (\tilde{z}_{j} / σ_{j})

F_{j} (y_{j}) = \mathds P (Y_{j} \leq y_{j}) = Φ_{0, Σ} (\infty, \dots, \infty, \tilde{h}_{j} (y_{j}), \infty, \dots, \infty) = Φ_{0, σ_{j}^{2}} (\tilde{h}_{j} (y_{j}))

F_{j} (y_{j}) = \mathds P (Y_{j} \leq y_{j}) = Φ_{0, Σ} (\infty, \dots, \infty, \tilde{h}_{j} (y_{j}), \infty, \dots, \infty) = Φ_{0, σ_{j}^{2}} (\tilde{h}_{j} (y_{j}))

\mathds E (Y_{j}) = \int F_{j}^{- 1} (Φ_{0, σ_{j}^{2}} (\tilde{z})) ϕ_{0, σ_{j}^{2}} (\tilde{z}) d \tilde{z},

\mathds E (Y_{j}) = \int F_{j}^{- 1} (Φ_{0, σ_{j}^{2}} (\tilde{z})) ϕ_{0, σ_{j}^{2}} (\tilde{z}) d \tilde{z},

h_{j} (y_{1}, \dots, y_{j}) =  = 1 \sum j - 1 λ_{j } Φ_{0, σ_{}^{2}}^{- 1} [F_{Z} {\tilde{h}_{} (y_{})}] + Φ_{0, σ_{j}^{2}}^{- 1} [F_{Z} {\tilde{h}_{j} (y_{j})}] .

h_{j} (y_{1}, \dots, y_{j}) =  = 1 \sum j - 1 λ_{j } Φ_{0, σ_{}^{2}}^{- 1} [F_{Z} {\tilde{h}_{} (y_{})}] + Φ_{0, σ_{j}^{2}}^{- 1} [F_{Z} {\tilde{h}_{j} (y_{j})}] .

F_{j} (y_{j}) = Φ_{0, Σ} (\infty, \dots, \infty, Φ_{0, σ_{j}^{2}}^{- 1} {F_{Z} [\tilde{h}_{j} (y_{j})]}, \infty, \dots, \infty) = F_{Z} (\tilde{h}_{j} (y_{j})) .

F_{j} (y_{j}) = Φ_{0, Σ} (\infty, \dots, \infty, Φ_{0, σ_{j}^{2}}^{- 1} {F_{Z} [\tilde{h}_{j} (y_{j})]}, \infty, \dots, \infty) = F_{Z} (\tilde{h}_{j} (y_{j})) .

\displaystyle\mathcal{H}=\bigg{\{}h:\text{$\mathds{R}$}^{J}

\displaystyle\mathcal{H}=\bigg{\{}h:\text{$\mathds{R}$}^{J}

ℓ_{i} (\boldmath θ) = - \frac{1}{2} j = 1 \sum J ( = 1 \sum j - 1 λ_{j } \boldmath a_{} (y_{i })^{⊤} \boldmath ϑ_{} + \boldmath a_{j} (y_{ij})^{⊤} \boldmath ϑ_{j})^{2} + lo g (\boldmath a_{j}^{'} (y_{ij})^{⊤} \boldmath ϑ_{j})

ℓ_{i} (\boldmath θ) = - \frac{1}{2} j = 1 \sum J ( = 1 \sum j - 1 λ_{j } \boldmath a_{} (y_{i })^{⊤} \boldmath ϑ_{} + \boldmath a_{j} (y_{ij})^{⊤} \boldmath ϑ_{j})^{2} + lo g (\boldmath a_{j}^{'} (y_{ij})^{⊤} \boldmath ϑ_{j})

\frac{\partial ℓ _{i} ( \boldmath θ )}{\partial \boldmath ϑ _{k}}

\frac{\partial ℓ _{i} ( \boldmath θ )}{\partial \boldmath ϑ _{k}}

\frac{\partial ℓ _{i} ( \boldmath θ )}{\partial λ _{\tilde{k} k}}

\hat{\boldmath θ}_{n}

\hat{\boldmath θ}_{n}

\hat{F}_{\boldmath Y} (\boldmath y) = Φ_{0, \hat{Σ}} (\boldmath a_{1} (y_{1})^{⊤} \hat{\boldmath ϑ}_{1}, \dots, \boldmath a_{J} (y_{J})^{⊤} \hat{\boldmath ϑ}_{J}) .

\hat{F}_{\boldmath Y} (\boldmath y) = Φ_{0, \hat{Σ}} (\boldmath a_{1} (y_{1})^{⊤} \hat{\boldmath ϑ}_{1}, \dots, \boldmath a_{J} (y_{J})^{⊤} \hat{\boldmath ϑ}_{J}) .

{\boldmath θ ∣ ∣ \boldmath θ - \boldmath θ_{0} ∣ > ϵ} sup \mathds E_{\boldmath θ_{0}} [lo g {f_{\boldmath Y} (\boldmath Y ∣ \boldmath θ)}] < \mathds E_{\boldmath θ_{0}} [lo g {f_{\boldmath Y} (\boldmath Y ∣ \boldmath θ_{0})}] .

{\boldmath θ ∣ ∣ \boldmath θ - \boldmath θ_{0} ∣ > ϵ} sup \mathds E_{\boldmath θ_{0}} [lo g {f_{\boldmath Y} (\boldmath Y ∣ \boldmath θ)}] < \mathds E_{\boldmath θ_{0}} [lo g {f_{\boldmath Y} (\boldmath Y ∣ \boldmath θ_{0})}] .

\mathds E_{\boldmath θ_{0}} \boldmath θ sup \frac{\partial lo g ( f _{\boldmath Y} ( \boldmath Y ∣ \boldmath θ ))}{\partial \boldmath θ}^{2} < \infty.

\mathds E_{\boldmath θ_{0}} \boldmath θ sup \frac{\partial lo g ( f _{\boldmath Y} ( \boldmath Y ∣ \boldmath θ ))}{\partial \boldmath θ}^{2} < \infty.

(\mathds E_{\boldmath θ_{0}} (- \frac{\partial ^{2} lo g ( f _{\boldmath Y} ( \boldmath Y ∣ \boldmath θ ))}{\partial \boldmath θ \partial \boldmath θ ^{⊤}}))^{- 1} .

(\mathds E_{\boldmath θ_{0}} (- \frac{\partial ^{2} lo g ( f _{\boldmath Y} ( \boldmath Y ∣ \boldmath θ ))}{\partial \boldmath θ \partial \boldmath θ ^{⊤}}))^{- 1} .

\mathds E_{\boldmath θ_{0}} [\frac{\partial lo g ( f _{\boldmath Y} ( \boldmath Y ∣ \boldmath θ ))}{\partial \boldmath θ}] [\frac{\partial lo g ( f _{\boldmath Y} ( \boldmath Y ∣ \boldmath θ ))}{\partial \boldmath θ}]^{⊤}

\mathds E_{\boldmath θ_{0}} [\frac{\partial lo g ( f _{\boldmath Y} ( \boldmath Y ∣ \boldmath θ ))}{\partial \boldmath θ}] [\frac{\partial lo g ( f _{\boldmath Y} ( \boldmath Y ∣ \boldmath θ ))}{\partial \boldmath θ}]^{⊤}

\boldmath Z_{1, b}, \dots, \boldmath Z_{nb}, b = 1, \dots, B, \boldmath Z_{ib} \sim N (0, \boldmath I_{J}), i = 1, \dots, n .

\boldmath Z_{1, b}, \dots, \boldmath Z_{nb}, b = 1, \dots, B, \boldmath Z_{ib} \sim N (0, \boldmath I_{J}), i = 1, \dots, n .

\tilde{\boldmath Z}_{1 b}, \dots, \tilde{\boldmath Z}_{nb}, \tilde{\boldmath Z}_{ib} = \hat{Λ}^{- 1} \boldmath Z_{ib} .

\tilde{\boldmath Z}_{1 b}, \dots, \tilde{\boldmath Z}_{nb}, \tilde{\boldmath Z}_{ib} = \hat{Λ}^{- 1} \boldmath Z_{ib} .

\boldmath Y_{1 b}, \dots, \boldmath Y_{nb}, Y_{ij b} = \hat{\tilde{h}}_{j}^{- 1} (\tilde{Z}_{ij b}), j = 1, \dots . J .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\floatsetup

[figure]capposition=bottom\floatsetup[overpic]capposition=bottom

\stackMath \NewEnvironLalign

[TABLE]

\MakeShortVerb

Multivariate Conditional Transformation Models

Nadja Klein ${}^{1\mbox{}^{\star}}$ , Torsten Hothorn2, Luisa Barbanti2 and Thomas Kneib3

1Humboldt-Universität zu Berlin, 2Universität Zürich,

3Georg-August-Universität Göttingen

Abstract

Regression models describing the joint distribution of multivariate response variables conditional on covariate information have become an important aspect of contemporary regression analysis. However, a limitation of such models is that they often rely on rather simplistic assumptions, e.g. a constant dependency structure that is not allowed to vary with the covariates or the restriction to linear dependence between the responses only. We propose a general framework for multivariate conditional transformation models that overcomes these limitations and describes the entire distribution in a tractable and interpretable yet flexible way conditional on nonlinear effects of covariates. The framework can be embedded into likelihood-based inference, including results on asymptotic normality, and allows the dependence structure to vary with covariates. In addition, the framework scales well beyond bivariate response situations, which were the main focus of most earlier investigations. We illustrate the application of multivariate conditional transformation models in a trivariate analysis of childhood undernutrition and demonstrate empirically that our approach can be beneficial compared to existing benchmarks such that complex truly multivariate data-generating processes can be inferred from observations.

Key words: Constrained optimization; copula; marginal distributions; multivariate regression; most likely transformations; normalizing flows; seemingly unrelated regression.

$\mbox{}^{\star}$ Correspondence should be directed to Prof. Dr. Nadja Klein at Humboldt Universität zu Berlin, Unter den Linden 6, 10099 Berlin. Email: [email protected].

1 Introduction

In a broad sense, regression models describe the distribution of a response conditional on a set of covariates. Such models are a versatile tool to understand how changes in the covariates propagate to changes in the distribution of the response. Distributional and multivariate regression models have received much interest during the last decade. Rather than focusing on the conditional mean, distributional regression strives to describe relevant features of the complete conditional distribution of a usually univariate response by flexible functions of the covariates. Multivariate regression models employ covariates to express the joint conditional distribution of a multivariate response. Known for half a century, transformation models have recently received renewed interest in statistics as an important technique for distributional regression and, under the term normalizing flows, in machine learning (Papamakarios et al., 2019) for modelling high-dimensional responses in an unconditional way. The core idea of flows or transformation models is to apply a data-driven transformation to the response such that the transformed variable is standard normal or follows some other convenient distribution. In this paper, we propose a framework of multivariate conditional transformation models (MCTMs) that apply this principle to define a novel class of multivariate distributional regression models. We review relevant developments in multivariate distributional regression first before highlighting some special features of the new method.

The most prominent multivariate regression model is seemingly unrelated regression (SUR), which uses a vector of correlated normal error terms to combine several linear model regression specifications with a common correlation structure that does not depend on any of the covariates (Zellner, 1962). By construction, the model is restricted to capture linear dependencies. Lang et al. (2003) extended SUR models by replacing the frequently used linear predictor with a structured additive predictor, while retaining the assumption that the linear correlation structure does not depend on the covariates. Multivariate probit models use a latent SUR model for a multivariate set of latent utilities that, via thresholds, are transformed to the observed binary response vector (Heckman, 1978). The approach of Klein, Kneib, Klasen and Lang (2015) embeds bivariate SUR-type specifications into generalised additive models for location, scale and shape (GAMLSS, Rigby and Stasinopoulos, 2005) by allowing all distribution parameters, including the correlations, to be related to additive predictors.

Beyond SUR-type models, copulas provide a flexible approach to the construction of multivariate distributions and regression models. As a major advantage, the model-building process is conveniently decomposed into the specification of the marginals and the selection of an appropriate copula function that defines the dependence structure (see Joe, 1997; Nelson, 2006, for reviews on copula models and their properties).

There is a rich literature on conditional copula modelling. To name just a few, Veraverbeke et al. (2011) use kernel methods to estimate the copula parameter after having determined the marginal distributions empirically. Assuming that the marginal distributions are known, a copula can be fitted based on a local likelihood using the approach of Acar et al. (2011). Bayesian inference in bivariate conditional copula models with homoscedastic Gaussian marginals has been proposed in Sabeti et al. (2014) and Levi and Craiu (2018).

Analogous to GAMLSS, bivariate copula models with parametric marginal distributions, one-parameter copulas, as well as joint semiparametric specifications for the predictors of all parameters of both marginal and copula models have been developed by Marra and Radice (2017) in a penalised likelihood framework and by Klein and Kneib (2016) using a Bayesian approach. Following these lines, Marra and Radice (2020) recently extended the framework to copula link-based survival models, while Sun and Ding (2019) develop a copula-based semiparametric regression method for bivariate data under general interval censoring. Alternatives to simultaneous estimation are two-step procedures that first estimate the marginals and then the copula given the marginals and have been proposed by e.g. Vatter and Chavez-Demoulin (2015) and Yee (2015) for parametric marginal distributions and bivariate one-parameter and copulas. However, these approaches are mostly limited to the bivariate case. Vatter and Nagler (2018) recently proposed a sequential method for conditional pair-copula constructions.

Nonparametric attempts to simultaneously study multivariate response variables have been reported in the context of multivariate quantiles. Because no natural ordering exists beyond univariate settings, definitions of multivariate quantiles are challenging and there has been considerable debate regarding their desirable properties (see Serfling, 2002, for an introduction to the different definitions). For example, one group of approaches draws on the concept of data depths (see for example Mosler, 2010), utilising options for multivariate depth functions based on distances such as Mahalanobis and Oja depths, weighted distances or on half-spaces. However, potential quantile crossings need further investigations to ensure a coherent model for the joint distribution, because single quantiles only relate to local properties of a response to covariates. For more information on depth functions and multivariate quantiles, we refer the reader to Chernozhukov et al. (2017) and Carlier et al. (2016, 2017).

MCTMs constitute a novel and coherent approach to multivariate regression analysis which is in many aspects different to existing approaches in copula or nonparametric regression. In particular, this framework makes six important contributions, none of which are available simultaneously in any existing method to regression for multivariate responses:

i.

MCTMs allow for direct estimation and inference of the entire multivariate conditional cumulative distribution function (CDF) $F_{\text{\boldmath$ Y $}}(\text{\boldmath$ y $}\mid\text{\boldmath$ x $})=\text{$ \mathds{P} $}(\text{\boldmath$ Y $}\leq\text{\boldmath$ y $}\mid\text{\boldmath$ x $})$ of a $J$ -dimensional response vector $Y$ given covariate information $x$ under rather weak assumptions. A key feature of MCTMs is that they extend likelihood-based inference in univariate conditional transformation models (CTMs, Hothorn et al., 2018) to the multivariate situation in a natural way. 2. ii.

MCTMs can capture nonlinear aspects of covariates on all aspects of the distribution, e.g. marginal moments, marginal and joint quantiles, dependence structures etc. As in the case of copulas, a feature of the model specification process is that joint distributions are constructed by their decomposition into marginals and the dependence structure. Most existing approaches assume a constant dependence structure not varying over the covariate space. 3. iii.

Model estimation can be performed simultaneously for all model components, thus avoiding the need for two-step estimators that are commonly applied in most copula-based approaches. 4. iv.

Theoretical results on optimality properties, such as consistency and asymptotic normality are available, building on the achievements in univariate CTMs. 5. v.

Unlike multivariate GAMLSS, MCTMs neither require strong parametric assumptions nor separate the model estimation process into local properties, as in multivariate quantile regression. 6. vi.

The method scales well to situations beyond the bivariate case $J=2$ and readily allows for the determination of both the marginal distributions of subsets of the response vector and the conditional distributions of some response elements, given the others. MCTMs are not equivalent to copulas, however, Gaussian copulas (Song, 2000) with arbitrary marginal distributions are treated as a special case in this paper. Both the marginal distributions and the correlation parameters of the copula can depend on covariates when such a copula model is specified by means of an MCTM.

The paper is structured as follows: Section 2 provides details on the specification of multivariate transformation models for the unconditional case of absolutely continuous responses. Likelihood-based inference and optimality properties are derived in Section 3, along with an illustration on multivariate density estimation with highly non-Gaussian marginal distributions. Section 4 considers how multivariate conditional transformation models may depend on covariates, and the approach is illustrated by a trivariate analysis of childhood undernutrition indicators. Section 5 presents simulation-based empirical evidence on the performance of MCTMs, including examples with up to 10 response dimensions. Finally, Section 6 proposes directions for future research.

2 Multivariate Transformation Models

2.1 Basic Model Setup

First, unconditional transformation models are developed for the joint multivariate distribution of a $J$ -dimensional, absolutely continuous random vector $\text{\boldmath$ Y $}=(Y_{1},\ldots,Y_{J})^{\top}\in\text{$ \mathds{R} $}^{J}$ with density $f_{\text{\boldmath$ Y $}}(\text{\boldmath$ y $})$ and CDF $F_{\text{\boldmath$ Y $}}(\text{\boldmath$ y $})=\text{$ \mathds{P} $}(\text{\boldmath$ Y $}\leq\text{\boldmath$ y $})$ . These unconditional models are then extended to the regression case in Section 4.

The key component of multivariate transformation models is an unknown, bijective, strictly monotonically increasing transformation function $h:\text{$ \mathds{R} $}^{J}\rightarrow\text{$ \mathds{R} $}^{J}$ . This function maps the vector $Y$ , whose distribution is unknown and shall be estimated from data, to a set of $J$ independent and identically distributed, absolutely continuous random variables $Z_{j}\sim\mathbb{P}_{Z},j=1,\dots,J$ with an a priori defined distribution $\mathbb{P}_{Z}$ , such that

[TABLE]

For an absolutely continuous distribution $\mathbb{P}_{Z}$ with log-concave density $f_{Z}$ , it can easily be shown that a unique, monotonically increasing transformation function $h$ exists for arbitrary, absolutely continuous distributions of $Y$ (Hothorn et al., 2018). Thus, the model class is effectively limited only by the flexibility of the specific choice of $h$ in an actual model. As a default for $\mathbb{P}_{Z}$ , we consider the standard normal distribution, i.e. $Z_{j}\sim\operatorname{N}(0,1)$ with $\mathbb{P}_{Z}=\operatorname{N}(0,1)$ and thus $f_{Z}=\phi_{0,1}$ and $F_{Z}=\Phi_{0,1}$ . Enforcing independent standard normality of the transformed response $h(\text{\boldmath$ Y $})$ is computationally attractive and allows the dependence structure to be described by a Gaussian copula (see Section 2.3). Alternative choices for $\mathbb{P}_{Z}$ are discussed in Section 2.6.

Under this transformation model, the task of estimating the distribution of $Y$ simplifies to the task of estimating $h$ . Because $h$ is strictly monotonically increasing in each element, it has a positive definite Jacobian, i.e.

[TABLE]

The density of $Y$ implied by the transformation model is then

[TABLE]

This form of an unconditional multivariate density with $\mathbb{P}_{Z}=\operatorname{N}(0,1)$ is called a normalizing flow in machine learning (Papamakarios et al., 2019). However, in this generality, the model is cumbersome in terms of both interpretation and tractability. Thus, in the following, we introduce simplified parameterisations of $h$ that lead to interpretable models.

2.2 Models with Recursive Structure

In a first step, we impose a triangular structure on the transformation function $h$ by assuming

[TABLE]

i.e. the $j$ th component of the transformation function depends only on the first $j$ elements of its argument $y$ . Consequently, this model formulation depends inherently on the ordering of the elements in $Y$ . Because any multivariate distribution can be factored into a sequence of conditional distributions, the triangular structure does not pose a limitation in the representation of $J$ -dimensional continuous distributions, as long as the transformation functions are appropriately chosen. Furthermore, the triangular structure of $h$ considerably simplifies the determinant of the Jacobian (2), which reduces to

[TABLE]

In a second step, we assume that the triangulary structured transformation functions are linear combinations of marginal transformation functions $\tilde{h}_{j}:\text{$ \mathds{R} $}\rightarrow\text{$ \mathds{R} $}$ , i.e.

[TABLE]

where each $\tilde{h}_{j}$ increases strictly monotonically and $\lambda_{jj}>0$ for all $j=1,\ldots,J$ to ensure the bijectivity of $h$ . Because the last coefficient, $\lambda_{jj}$ , cannot be separated from the marginal transformation function $\tilde{h}_{j}(y_{j})$ , we use the restriction $\lambda_{jj}\equiv 1$ . Thus, our parameterisation of the transformation function $h$ finally reads

[TABLE]

Each of the marginal transformation functions $\tilde{h}_{j}(y_{j})$ includes an intercept, such that no additional intercept term can be inserted in (3). The Jacobian of $h$ now further simplifies to

[TABLE]

and the model-based density function for $Y$ is therefore

[TABLE]

Summarising the model specification, our multivariate transformation model is characterised by a set of marginal transformations $\tilde{h}_{j}(y_{j})$ , $j=1,\ldots,J$ , each applying to only a single component of the vector $Y$ , and by a lower triangular $(J\times J)$ matrix of transformation coefficients

[TABLE]

Under the standard normal reference distribution $\mathbb{P}_{Z}=\operatorname{N}(0,1)$ , the coefficients in $\mathbf{\Lambda}$ characterise the dependence structure via a Gaussian copula, while the marginal transformation functions $\tilde{h}_{j}$ allow the generation of arbitrary marginal distributions for the components of $Y$ . Furthermore, the entries of $\mathbf{\Lambda}$ have the interpretation of entries in the inverse precision matrix of the correlation matrix between the marginally transformed components of $Y$ , as we derive in the following.

2.3 Relation to Gaussian Copula Models

The relationship between multivariate transformation models and Gaussian copulas can be made more precise by defining random variables $\tilde{Z}_{j}=\tilde{h}_{j}(Y_{j})$ . Under a standard normal reference distribution $\mathbb{P}_{Z}=\operatorname{N}(0,1)$ , the vector $\text{\boldmath$ \tilde{Z} $}=(\tilde{Z}_{1},\ldots,\tilde{Z}_{J})^{\top}$ follows a zero mean multivariate normal distribution $\text{\boldmath$ \tilde{Z} $}\sim\operatorname{N}_{J}(\mathbf{0}_{J},\mathbf{\Sigma})$ with covariance matrix $\mathbf{\Sigma}=\mathbf{\Lambda}^{-1}\mathbf{\Lambda}^{-\top}$ . As a consequence, the elements of $\tilde{Z}$ are marginally normally distributed as $\tilde{Z}_{j}\sim\operatorname{N}(0,\sigma_{j}^{2})$ , where the variances $\sigma_{j}^{2}$ can be determined from the diagonal elements of $\mathbf{\Sigma}$ .

For the transformation functions $\tilde{h}_{j}$ , the explicit representation

[TABLE]

is obtained, where $F_{j}(\cdot)$ is the univariate marginal CDF of $Y_{j}$ . In summary,

[TABLE]

and therefore the CDF of $Y$ has exactly the same structure as a Gaussian copula, except that our representation relies on a different parameterisation of $\mathbf{\Sigma}$ through $\mathbf{\Sigma}=\mathbf{\Lambda}^{-1}\mathbf{\Lambda}^{-\top}$ rather than a covariance matrix with unit diagonal. This is compensated for by the inclusion of univariate Gaussians with variances different from those acting on the marginals. However, because

[TABLE]

unconditional MCTMs with $\mathbb{P}_{Z}=\operatorname{N}(0,1)$ are equivalent to a Gaussian copula with flexible marginal distributions. This is no longer the case of the conditional case of regression considered in Section 4, where our approach can capture nonlinear aspects of covariates on all aspects of the distribution, e.g. marginal moments, marginal and joint quantiles, dependence structures etc.

2.4 Some Properties of the Dependence Structure

We highlight some properties of our MCTM that can be of interest in applied studies.

$\bullet$

The transformed vector $\tilde{\text{\boldmath$ Z $}}$ is jointly multivariate normally distributed such that pairwise dependencies are restricted to linear dependence through linear correlations. Importantly however, $Y$ is allowed to have a nonlinear dependence structure due to the inverse marginal transformation functions $Y_{j}=\tilde{h}_{j}^{-1}(\tilde{Z}_{j})$ . We illustrate this feature in Figure 1 which shows a bivariate scatterplot with one marginally normally and one marginally gamma distributed response component for the vector $(Y_{1},Y_{2})^{\top}$ (on the left) together with the bivariate normally distributed variables $\tilde{\text{\boldmath$ Z $}}$ (on the right).

$\bullet$

Assuming $\text{$ \mathds{P} $}_{Z}=\operatorname{N}(0,1)$ as reference distribution, the entries in $\mathbf{\Lambda}$ determine the conditional independence structure between the transformed responses $\tilde{Z}_{j}$ and therefore, implicitly, also the observed responses $Y_{j}$ as it is for a multivariate Gaussian distribution, see the Appendix Part A.1 for details.

$\bullet$

Rather than looking at linear correlations, common measures of dependence in the context of multivariate modelling are Spearman’s rho $\rho^{S}$ , Kendall’s tau $\tau^{K}$ and lower/upper quantile dependence $\lambda^{L}/\lambda^{U}$ . These can computed in closed form using the results known for a Gaussian copula, see again the Appendix Part A.1 for formulas. One appealing property of these measures is that they are invariant with respect to monotonic transformations of the marginals and we will use the $\rho^{S}$ later in our trivariate application on childhood undernutrition.

2.5 Model-Implied Marginal and Conditional Distributions

The relationship of unconditional transformation models and Gaussian copulas can now be employed to facilitate the derivation of model-implied marginal and conditional distributions. The univariate marginal distributions of elements $Y_{j}$ are given by

[TABLE]

but more general versions (i.e. marginals of a subvector of $Y$ ) and conditional distributions are also easily obtained, see the Appendix Part A.2.

Finally, using the marginal CDFs and densities, the marginal quantiles or moments can be derived. The latter can be computed by solving simple univariate numerical integrals, for example:

[TABLE]

for the marginal mean.

2.6 Alternative Reference Distributions

Although we have discussed our model specification in the context of a normal reference distribution and a Gaussian copula, these choices can be readily modified. In particular, if a reference distribution $\mathbb{P}_{Z}\neq\operatorname{N}(0,1)$ is chosen, the transformation function has to be modified to

[TABLE]

We therefore obtain $J$ independent random variables $\Phi^{-1}_{0,\sigma_{j}^{2}}\left\{F_{Z}\left[\tilde{h}_{j}(Y_{j})\right]\right\}\sim\operatorname{N}(0,\sigma^{2}_{j})$ . The model then implies marginal distributions

[TABLE]

Attractive alternative choices for the reference distribution are $F_{Z}^{-1}=\text{logit}$ and $F_{Z}^{-1}=\text{cloglog}$ , because regression coefficients can be interpreted as log-odds ratios and log-hazard ratios (in fact, in the latter case, the marginal model is then a Cox proportional hazards model), respectively.

Extensions beyond the Gaussian copula structure are also conceivable when the linear combination of marginal transformations is replaced by nonlinear specifications. However, those types of models easily lead to identification problems and do not provide direct links to existing parametric copula classes. Accordingly, we leave this topic for future research.

3 Transformation Analysis

This section defines the maximum likelihood estimator and establishes its consistency and asymptotic normality based on suitable parameterisations of the marginal transformation functions $\tilde{h}_{j}$ . It closes with an illustration on bivariate density estimation for highly non-Gaussian data.

3.1 Parameterisation of the Transformation Functions

Following Hothorn et al. (2018), the marginal transformation functions $\tilde{h}_{j}(y_{j})$ are parameterised as linear combinations of the basis-transformed argument $y_{j}$ , such that $\tilde{h}_{j}(y_{j})=\text{\boldmath$ a $}_{j}(y_{j})^{\top}\text{\boldmath$ \vartheta $}_{j}$ is monotonically increasing. The $P_{j}$ -dimensional basis functions $\text{\boldmath$ a $}_{j}:\text{$ \mathds{R} $}\rightarrow\text{$ \mathds{R} $}^{P_{j}}$ with basis coefficients $\text{\boldmath$ \vartheta $}_{j}$ and corresponding derivative $\tilde{h}^{\prime}_{j}(y_{j})=\text{\boldmath$ a $}^{\prime}_{j}(y_{j})^{\top}\text{\boldmath$ \vartheta $}_{j}>0$ are problem-specific, see Hothorn et al. (2018) for suitable choices in different applications. Because marginal transformation functions $\tilde{h}_{j}(y_{j})$ and therefore also the plug-in estimators $\hat{F}_{j}$ of the marginal CDF should be smooth with respect to $y_{j}$ , in principle any polynomial or spline-based basis is a suitable choice for $\text{\boldmath$ a $}_{j}$ . The empirical results of Sections 4 and 5 rely on Bernstein polynomials of order $M$ ; suitable choices of this parameter are discussed in Section 5.1. The basis functions $\text{\boldmath$ a $}_{j}(y_{j})$ are then densities of beta distributions, a choice that is computationally appealing because strict monotonicity can be formulated as a set of linear constraints on the components of the parameters $\text{\boldmath$ \vartheta $}_{j}$ , see Curtis and Ghosh (2011); Farouki (2012) for details. Furthermore, Bernstein polynomials of sufficiently large order $M$ can uniformly approximate any function over an interval as a result of the Weierstrass approximation theorem. Hothorn et al. (2018) investigate the choice of $M$ for univariate CTMs.

3.2 Inference

In the following, we denote the set of parameters describing all marginal transformation functions $\tilde{h}_{j},j=1,\dots,J$ as $\text{\boldmath$ \vartheta $}=(\text{\boldmath$ \vartheta $}_{1}^{\top},\ldots,\text{\boldmath$ \vartheta $}_{J}^{\top})^{\top}\in\text{$ \mathds{R} $}^{\sum_{j=1}^{J}P_{j}}$ , while $\lambda$ contains all unknown elements of $\mathbf{\Lambda}$ , such that $\text{\boldmath$ \theta $}=(\text{\boldmath$ \vartheta $}^{\top},\text{\boldmath$ \lambda $}^{\top})^{\top}$ comprises all unknown model parameters. The parameter space is denoted as $\Theta=\{\text{\boldmath$ \theta $}|h\in\mathcal{H}\}$ , where

[TABLE]

is the space of all strictly monotonic triangular transformation functions. Consequently, the problem of estimating the unknown transformation function $h$ , and thus the unknown distribution function $F_{\text{\boldmath$ Y $}}$ , reduces to the problem of estimating the parameter vector $\theta$ . With the construction of multivariate transformation models, this is conveniently achieved using likelihood-based inference. For $\mathbb{P}_{Z}=\operatorname{N}(0,1)$ , the log-likelihood contribution of a given datum $\text{\boldmath$ y $}_{i}=(y_{i1},\dots,y_{iJ})^{\top}\in\text{$ \mathds{R} $}^{J}$ , $i=1,\ldots,n$ is

[TABLE]

with corresponding score contributions

[TABLE]

for $k=1,\dots,J$ , $1\leq k<\tilde{k}\leq J$ (and zero otherwise) and with $\lambda_{jj}\equiv 1$ . We furthermore define $\mathcal{F}_{i}(\text{\boldmath$ \theta $})=-\frac{\partial^{2}\ell_{i}(\text{\boldmath$ \theta $})}{\partial\text{\boldmath$ \theta $}\partial\text{\boldmath$ \theta $}^{\top}}$ as the $i$ th contribution to the observed Fisher information. Explicit expressions for the entries are given in Appendix Part B. Despite the estimation of a fairly complex multivariate distribution with a Gaussian copula dependence structure and arbitrary marginals, the log-likelihood contributions have a very simple form. In addition, the log-concavity of $f_{Z}$ ensures the concavity of the log-likelihood and thus the existence and uniqueness of the estimated transformation function $\hat{h}$ .

Definition 1.

(Maximum likelihood estimator.) The maximum likelihood estimator (MLE) for the parameters of a multivariate transformation model is given by

[TABLE]

Based on the maximum likelihood estimator $\hat{\text{\boldmath$ \theta $}}_{n}$ , maximum likelihood estimators for the marginal and joint CDFs are also obtained, by plugging in $\hat{\text{\boldmath$ \theta $}}_{n}$ . Specifically, the estimated marginal CDFs are given by $\hat{F}_{j}(y_{j})=\Phi_{0,\hat{\sigma}_{j}^{2}}(\text{\boldmath$ a $}_{j}(y_{j})^{\top}\hat{\text{\boldmath$ \vartheta $}}_{j})$ , where $\hat{\sigma}_{j}^{2}$ is the $j$ th diagonal entry of $\hat{\mathbf{\Sigma}}$ . The estimated joint CDF reads

[TABLE]

3.3 Parametric Inference

In this section, we discuss likelihood-based inference and establish asymptotic results for multivariate transformation models based the theoretical results derived in Hothorn et al. (2018) for univariate conditional transformation models. Assume $\text{\boldmath$ Y $}_{1},\ldots,\text{\boldmath$ Y $}_{n}\overset{\mbox{\scriptsize{i.i.d.}}}{\sim}F_{\text{\boldmath$ Y $},\text{\boldmath$ \theta $}_{0}}$ where $\text{\boldmath$ \theta $}_{0}$ denotes the true parameter vector, then the following assumptions are made:

(A1)

The parameter space $\mathbf{\Theta}$ is compact.

(A2)

$\text{$ \mathds{E} $}_{\text{\boldmath$ \theta $}_{0}}[\sup_{\text{\boldmath$ \theta $}\in\mathbf{\Theta}}[\log\{f_{\text{\boldmath$ Y $}}(\text{\boldmath$ Y $}|\text{\boldmath$ \theta $})\}]]<\infty$ , where

[TABLE]

(A3)

[TABLE]

(A4)

$\text{$ \mathds{E} $}_{\text{\boldmath$ \theta $}_{0}}(\mathcal{F}(\text{\boldmath$ \theta $}))$ is nonsingular.

(A5)

$0<f_{Z}<\infty$ , $\sup|f^{\prime}_{Z}|<\infty$ , $\sup|f_{Z}^{\prime\prime}|<\infty$ .

Remark 1.

Assumption (A1) is made for convenience, and relaxations of such a condition are given in Vaart (1998). The assumptions in (A2) are rather weak: the first one holds if the functions $a$ are not arbitrarily ill-posed, and the second one holds if the function $\text{$ \mathds{E} $}_{\text{\boldmath$ \theta $}_{0}}[\sup_{\text{\boldmath$ \theta $}\in\mathbf{\Theta}}[\log\{f_{\text{\boldmath$ Y $}}(\text{\boldmath$ Y $}|\text{\boldmath$ \theta $})\}]]$ is strictly convex in $\theta$ (if the assumption would not hold, we would still have convergence to the set $\text{$ \mathds{E} $}_{\text{\boldmath$ \theta $}_{0}}[\sup_{\text{\boldmath$ \theta $}\in\mathbf{\Theta}}[\log\{f_{\text{\boldmath$ Y $}}(\text{\boldmath$ Y $}|\text{\boldmath$ \theta $})\}]]$ ). Assumptions (A3)–(A5) are needed to derive the asymptotic distribution.

Corollary 1.

Assuming (A1)–(A2), the sequence of estimators $\hat{\text{\boldmath$ \theta $}}_{n}$ converges in probability $\hat{\text{\boldmath$ \theta $}}_{n}\overset{\text{$ \mathds{P} $}}{\to}\text{\boldmath$ \theta $}_{0}$ for $n\to\infty$ .

The proof of Corollary 1 follows from Theorem 5.8 of Vaart (1998).

Corollary 2.

Assuming (A1)–(A5), the sequence of estimators $\sqrt{n}(\hat{\text{\boldmath$ \theta $}}_{n}-\text{\boldmath$ \theta $}_{0})$ is asymptotically normally distributed with covariance matrix

[TABLE]

Proof.

By further assumption, $\sqrt{f_{\text{\boldmath$ Y $}}}$ is continuously differentiable in $\theta$ for all $y$ and

[TABLE]

is continuous in $\theta$ , due to (6),(7). Thus, $F_{Y,\text{\boldmath$ \theta $}_{0}}$ is differentiable in quadratic mean by Lemma 7.6 of Vaart (1998). Based on Assumptions (A3)–(A5) and Corollary 1, the claim hence follows from Theorem 5.39 of Vaart (1998). ∎

Remark 2.

Similar as in the univariate case (Hothorn et al., 2018), Corollaries 1, 2 also extend to the conditional regression models considered in Section 4.

3.4 Parametric Bootstrap

The asymptotic results allow, at least in principle, the derivation of confidence intervals also for transformed model parameters, by using the Delta-rule. However, many quantities of practical interest, such as the correlation matrix of the implied Gaussian copula or the densities of marginal distributions, are indeed highly nonlinear functions of the parameter vector $\theta$ . Accordingly, a parametric bootstrap is a more promising alternative.

Because the proposed multivariate transformation models allow a direct evaluation of estimated joint CDFs, drawing bootstrap samples from the joint distribution is straightforward. For $\mathbb{P}_{Z}=\operatorname{N}(0,1)$ and with estimated marginal transformation functions $\hat{\tilde{h}}_{j}(y_{j})=\text{\boldmath$ a $}_{j}(y_{j})^{\top}\hat{\text{\boldmath$ \vartheta $}}_{j},j=1,\dots,J$ and estimated covariance matrix $\hat{\mathbf{\Sigma}}=\hat{\mathbf{\Lambda}}^{-1}\hat{\mathbf{\Lambda}}^{-\top}$ , the parametric bootstrap can be implemented by Algorithm 1.

The inverse $\hat{\tilde{h}}_{j}^{-1}$ exists because the estimated marginal distribution function is strictly monotonically increasing. For simple basis functions $\text{\boldmath$ a $}_{j}$ (for example, linear functions), the inverse can be computed analytically. For more complex basis functions, numerical inversion has to be applied.

3.5 Illustration: Bivariate Density Estimation

Unconditional MCTMs can be employed for multivariate density estimation. For the famous 1920s cars data (Ezekiel, 1930) consisting of speed and distance needed to stop for $50$ cars, the bivariate distribution was estimated from an unconditional bivariate transformation model with $\mathbb{P}_{Z}=\operatorname{N}(0,1)$ , order $M=6$ of Bernstein polynomials for the two transformation functions, and a constant parameter $\lambda\in\mathbb{R}$ . The model is equivalent to a Gaussian copula, however, the marginal distributions are highly non-Gaussian and were estimated by maximum likelihood simultaneously with the correlation parameter $\lambda$ . The fit of the bivariate density contours and marginal densities is given in Figure 2, which shows that the dependence between speed and distance is clearly nonlinear. We obtained $\hat{\lambda}=-1.633$ (SE $0.273$ ), which corresponds to a Pearson correlation of $0.853$ (thus a rank correlations $\rho^{S}=0.8415$ ) and hence a highly positive correlation between speed and distance after transformation to normality.

4 Extensions to Multivariate Regression

4.1 Multivariate Conditional Transformation Models

Multivariate regression models, i.e. models for conditional multivariate distributions given a specific configuration of covariates $\text{\boldmath$ X $}=\text{\boldmath$ x $}$ , can be derived from the unconditional multivariate transformation models introduced in Section 2. The transformation function $h$ has to be extended to include a potential dependency on covariates $X$ , and the corresponding joint CDF $F_{\text{\boldmath$ Y $}\mid\text{\boldmath$ X $}=\text{\boldmath$ x $}}$ is defined by a conditional transformation function $h(\text{\boldmath$ y $}\mid\text{\boldmath$ x $})=(h_{1}(\text{\boldmath$ y $}\mid\text{\boldmath$ x $}),\ldots,h_{J}(\text{\boldmath$ y $}\mid\text{\boldmath$ x $}))^{\top}$ . By extending the unconditional transformation function (3), we define the $J$ components of a multivariate conditional transformation function given covariates $x$ as

[TABLE]

where $\lambda_{j\jmath}(\text{\boldmath$ x $})$ and $\tilde{h}_{j}(y_{j}\mid\text{\boldmath$ x $})$ are again expressed in terms of basis function expansions.

For the marginal (with respect to the response $y_{j}$ ) conditional (given covariates $\text{\boldmath$ x $})$ transformation functions, this leads to a parameterisation

[TABLE]

where the basis functions $\text{\boldmath$ c $}_{j}(y_{j},\text{\boldmath$ x $})$ , in general, depend on both element $y_{j}$ of the response and the covariates $x$ . These can, for example, be constructed as a composition of the basis functions $\text{\boldmath$ a $}_{j}(y_{j})$ for only $y_{j}$ from the previous section combined with a basis $\text{\boldmath$ b $}_{j}(\text{\boldmath$ x $})$ depending exclusively on $x$ . Specifically, a purely additive model results from $\text{\boldmath$ c $}_{j}=(\text{\boldmath$ a $}_{j}^{\top},\text{\boldmath$ b $}_{j}^{\top})^{\top}$ , and a flexible interaction from the tensor product $\text{\boldmath$ c $}_{j}=(\text{\boldmath$ a $}_{j}^{\top}\otimes\text{\boldmath$ b $}_{j}^{\top})^{\top}$ . Response-varying coefficients in distributional regression, or time-varying effects in survival analysis, correspond to a basis $\text{\boldmath$ c $}_{j}=(\text{\boldmath$ a $}_{j}^{\top}\otimes(1,\text{\boldmath$ x $}^{\top})^{\top})^{\top}$ , see also Section 4.2. A simple linear transformation model for the marginal conditional distribution can be parameterised as

[TABLE]

with parameters $\text{\boldmath$ \vartheta $}_{j}=(\text{\boldmath$ \vartheta $}_{j,1}^{\top},\text{\boldmath$ \beta $}_{j}^{\top})^{\top}$ . The model restricts the impact of the covariates to a linear shift $\text{\boldmath$ x $}^{\top}\text{\boldmath$ \beta $}_{j}$ . For arbitrary choices of $\mathbb{P}_{Z}$ , the marginal distribution, given covariates $\text{\boldmath$ X $}=\text{\boldmath$ x $}$ , is then a marginal linear transformation model

[TABLE]

and, consequently, the regression coefficients $\text{\boldmath$ \beta $}_{j}$ can be directly interpreted as marginal log-odds ratios ( $F_{Z}^{-1}=\text{logit}$ ) or log-hazard ratios ( $F_{Z}^{-1}=\text{cloglog}$ ; this is a marginal Cox model). Details of the parameterisations $\text{\boldmath$ c $}_{j}(y_{j},\text{\boldmath$ x $})^{\top}\text{\boldmath$ \vartheta $}_{j}$ for the marginal transformation functions and a discussion of the practical aspects in different areas of application are provided in Hothorn et al. (2018).

For practical applications, an important and attractive feature of multivariate transformation models for multivariate regression is the possible dependency of $\mathbf{\Lambda}$ on covariates $x$ . Thus, the dependence structure of $Y$ potentially changes as a function of $x$ , if suggested by the data. This feature is implemented by covariate-dependent coefficients of $\mathbf{\Lambda}(\text{\boldmath$ x $})$ . A simple linear model of the form

[TABLE]

is one option. The case $\text{\boldmath$ \gamma $}_{j\jmath}=\mathbb{0}$ implies that the correlation between $Y_{j}$ and $Y_{\jmath}$ does not depend on $x$ . More complex forms of additive models would also be conceivable. Of course, the number of parameters grows quadratically in $J$ , such that models that are too complex may require additional penalisation terms in the likelihood.

4.2 Application: Trivariate Conditional Transformation Models for Undernutrition in India

To illustrate several practical aspects of the parameterisation and interpretation of MCTMs, we present a trivariate analysis of undernutrition in India in the following. Childhood undernutrition is among the most urgent problems in developing and transition countries. A rich database available from Demographic and Health Surveys (DHS, https://dhsprogram.com/) provides nationally representative information about the health and nutritional status of populations in many of those countries. Here we use data from India that were collected in 1998. Overall, the data set comprised 24,316 observations, after pre-processing of the data. For the latter, we use the same steps as in Fahrmeir and Kneib (2011), see the documentation available at http://www.smoothingbook.org together with further details on the pre-processing steps. We used three indicators, stunting, wasting and underweight, as the trivariate response vector, where stunting refers to stunted growth, measured as an insufficient height of a child with respect to age, while wasting and underweight refer to insufficient weight for height and insufficient weight for age, respectively. Hence stunting is an indicator of chronic undernutrition, wasting reflects acute undernutrition and underweight reflects both. Our aim was to model the joint distribution of stunting, wasting and underweight conditional upon the age of the child. To the best of our knowledge, there is no implementation available that could estimate the dependence structure and the marginal distributions nonparametrically and conditional on covariates beyond a trivariate normal distribution (which is implemented in the R add-on package (Yee, 2020)).

Model Specification.

The focus of our analysis was the variation in the trivariate undernutrition process with respect to the age of the child (in months). To be flexible in the marginal distributions and the dependence structure, we specify response-varying marginal models for $\tilde{h}_{j}(y_{j}\mid\text{age})$ of the form

[TABLE]

while the coefficients of $\Lambda$ are parameterised through

[TABLE]

We choose the normal reference distribution $\mathbb{P}_{Z}=\operatorname{N}(0,1)$ and the basis functions $\text{\boldmath$ a $}_{j}$ and $\text{\boldmath$ b $}(\text{age})$ are Bernstein polynomials of order six (Section 5.1 gives a rational for choosing this default). Furthermore, the parameters $\text{\boldmath$ \vartheta $}_{j,1}$ were estimated under the constraint $\text{\boldmath$ D $}\text{\boldmath$ \vartheta $}_{j,1}>\mathbf{0}$ , where $D$ is a difference matrix. This leads to monotonically increasing estimated marginal transformation functions $\tilde{h}_{j}$ (Hothorn et al., 2018). No such shape constraint was applied to functions of age, i.e. the parameters $\text{\boldmath$ \beta $}_{j}$ and $\text{\boldmath$ \gamma $}_{j\jmath}$ were estimated unconstrained.

Results for Marginal Distributions.

Figure 3 depicts the estimated marginal conditional CDFs $F_{j}(y_{j}\mid\text{age})$ (first row) and marginal densities $f_{j}(y_{j}\mid\text{age})$ (second row), with the different colours indicating the ages of the children. Clearly, the shapes of the marginals differ for the three indicators, where the differences are mostly restricted to a simple shift effect for stunting, while varying amounts of asymmetry are present for stunting and even more complex changes in the shape of the distribution are identified for underweight.

Results for the Dependence Structure.

Figure 4 depicts the conditional rank correlations $\rho^{S}$ between stunting, wasting and underweight as functions of age along with the point estimates and 95% confidence intervals obtained from $B=1,000$ parametrically drawn bootstrap samples (see Algorithm 1 in Section 3.4). The rank correlation between stunting and wasting is initially negative around $-0.4$ for young children and then approaches zero with increasing age of the child. This finding is in line with the study of Klein, Kneib, Klasen and Lang (2015), who reported the results of a bivariate analysis based on normal and $t$ distributions. In our study, the remaining dependencies were positive and the variation in the rank correlation over age was stronger for the relationship between wasting and underweight compared to stunting and underweight, which varied only between $0.6$ and $0.75$ whereas the variations of the remaining rank correlations explained by the age of the child were more substantial.

The parametric bootstrap requires re-estimation of the model $B$ times which can be time consuming without parallelisation. A faster alternative in cases where $n$ is large is to draw samples $\hat{\text{\boldmath$ \theta $}}_{(b)}$ , $b=1,\ldots,B$ from the asymptotic normal distribution derived in Corollary 2 with mean vector equal to the MLE from (8) and covariance matrix (9). These samples can then be used to compute $\hat{\tilde{h}}_{(1)},\ldots,\hat{\tilde{h}}_{(B)}$ , $\hat{\mathbf{\Sigma}}_{(1)},\ldots,\hat{\mathbf{\Sigma}}_{(B)}$ . Because in our example $n$ is rather large we compared both versions of confidence intervals and found them to be very similar (compare Figure A in Appendix Part C).

5 Empirical Evaluation

In this section, we provide empirical evidence on the performance of our MCTMs. In Section 5.1 we demonstrate that the performance of our flexible transformation model was highly competitive relative to two parametric and correctly specified alternatives. Hence, our model is useful not only when parametric assumptions about the marginals are questionable. In Section 5.2, a trivariate example demonstrates that our model is also applicable to situations beyond the bivariate case, as underpinned by five and ten-dimensional illustrations in Section 5.3. To the best of our knowledge, there are currently no directly competing models that allow for a similar flexibility.

5.1 Bivariate Simulation

Simulation Design.

We simulated $R=100$ data sets of size $n=1,000$ , following a method similar to that used in the parametric bootstrap procedure:

Covariate values $x$ were simulated as i.i.d. variables, where $x\sim\operatorname{U}[-0.9,0.9]$ . 2. 2.

The latent variables $\tilde{\text{\boldmath$ z $}}_{ir}\in\text{$ \mathds{R} $}^{2}$ were generated as

[TABLE]

with

[TABLE]

such that

[TABLE] 3. 3.

From the latent variables, the observed responses were computed as

[TABLE]

where $\sigma_{i2}^{2}=1+x_{i}^{4}$ and $F_{1}$ and $F_{2}$ are the CDFs of two Dagum distributions with parameters $a_{1}=\exp(2),b_{1}=\exp(1),p_{1}=\exp(1.3)$ and $a_{2}=\exp(1.8),b_{2}=\exp(0),p_{2}=\exp(0.9)$ , respectively. Note that the CDF of an unconditional Dagum distribution (Kleiber, 1996) reads

[TABLE]

This model specification is equivalent to a Gaussian copula model with Dagum marginals, but by its construction, the first marginal is independent of the covariate $x$ , while the scale parameter $b_{2}$ of the second marginal varies as a function of $x$ .

As competitors for MCTMs, we considered Bayesian structured additive distributional regression models (Klein, Kneib, Lang and Sohn, 2015), as implemented in the software package BayesX (Belitz et al., 2015), and vector generalised additive models (VGAM, Yee, 2015), as implemented in the corresponding R add-on package (Yee, 2020). For VGAM and BayesX, we employed the true specification, i.e. a Gaussian copula with correlation parameter $\rho(x_{i})=\nicefrac{{-\lambda(x_{i})}}{{\sqrt{1+\lambda(x_{i})^{2}}}}$ and Dagum marginals, in which the parameter $b_{2}$ of the second marginal depends on $x$ but the first marginal as well as the parameters $a_{2}$ and $p_{2}$ did not. For BayesX, both the predictor for $b_{2}$ and the correlation parameter $\rho$ of the Gaussian copula were specified using cubic B-splines with $20$ inner knots on an equidistant grid in the range of $x$ with a second-order random walk prior (following suggested default values by Lang and Brezger, 2004); the other parameters of the marginals were estimated as constants.

Because VGAM does not allow for simultaneous estimation of the marginals and the dependence structure, we first estimated the Dagum margins with constant parameters $a_{1},b_{1},p_{1},a_{2},p_{2}$ and covariate-dependent parameters $b_{2}$ . The copula predictor was then estimated with plugged-in estimates of the margins, using cubic B-splines according to the sm.ps function of the package.

For the multivariate transformation models (denoted as MCTM-6/6), we employed Bernstein polynomials of order six (as in Section 4.2) for both the transformation functions ( $\tilde{h}_{1}$ and $\tilde{h}_{2}$ ) and the parameter $\lambda$ . Because of the monotonicity constraints on $\tilde{h}_{1}$ and $\tilde{h}_{2}$ , the order of the corresponding Bernstein polynomials can be larger without decreasing model performance (Hothorn et al., 2018; Hothorn, 2020). In contrast, too large values of a Bernstein polynomial for $\lambda$ will result in overly erratic estimates with negative impact on model performance. We demonstrate this effect empircially by two additional MCTMs with order $M=3$ for $\lambda$ and with orders $M=6$ (MCTM-6/3), and $12$ (MCTM-12/3), for the transformation functions $\tilde{h}_{1}$ and $\tilde{h}_{2}$ .

Measures of Performance.

Let $\hat{\lambda}^{(r)}(x)$ be the estimate of the lower triangular element of $\mathbf{\Lambda}^{(r)}$ obtained from data replicate $r=1,\ldots,R$ . To evaluate the performance of the three competing methods, we investigated the function estimates $\hat{\lambda}^{(r)}(x)$ relative to $\lambda(x)$ as well as the root mean squared errors $\mbox{RMSE}(\lambda,\hat{\lambda}^{(r)})=\sqrt{(\lambda(x_{g})-\hat{\lambda}^{(r)}(x_{g}))^{2}}$ on a grid of length $G=100$ within the range of $x$ .

Results.

Figure 5 shows the estimates for $\lambda(x)=x^{2}$ of the $100$ simulated data sets for BayesX (first panel), VGAM (second panel), and MCTM (last three panels). All three models reproduced the general functional form correctly. However, BayesX yielded the most reasonable smoothing properties, while VGAM has the wiggliest curves. Although BayesX and VGAM employed the correct model specification in terms of the parametric distribution assumption for the marginal distributions and the correlation parameter, the performance of MCTM is competitive in terms of the RMSE (Figure 6) without the requirement to either estimate the marginal distributions in a first step and plug the empirical copula data in to obtain the dependence structure (as for VGAM) or to specify predictors for parametric marginal distributions (as for BayesX). Both requirements are restrictive in practice because typically it is impossible to pick the ‘correct’ parametric distribution that exactly matches the marginal distributions of the underlying random variables.

A larger value of $M=12$ for the transformation functions did not lead to degraded performance, however, a less flexible parameterisation of $\lambda$ was better able to recover the quadratic function.

5.2 Trivariate Simulation

Simulation Design.

We employed a similar setting as in the previous section, with $R=100$ data sets of size $n=1,000$ , $x\overset{\mbox{\scriptsize{i.i.d.}}}{\sim}\operatorname{U}[-0.9,0.9]$ , but Steps 2 and 3 of the simulation design in Section 5.1 were extended to three dimensions. The latent variables $\tilde{\text{\boldmath$ z $}}_{ir}\in\text{$ \mathds{R} $}^{3}$ were generated as

[TABLE]

with

[TABLE]

Consequently,

[TABLE]

with

[TABLE]

To compute $\text{\boldmath$ y $}_{ir}$ in Step 3, we additionally chose $F_{3}$ to be the CDF of another Dagum distribution with parameters $a_{3}=\exp(1.5)$ , $b_{3}=\exp(-0.9)$ and $p_{3}=\exp(1)$ , such that

[TABLE]

Note that the marginals of $y_{2}$ and $y_{3}$ (or more precisely their marginal parameters $b_{2}$ and $b_{3}$ ) depend on the covariate $x$ .

Results.

Figure 7 shows the function estimates for the three parameters $\lambda_{21}(x_{i})$ , $\lambda_{31}(x_{i})$ and $\lambda_{32}(x_{i})$ . The grey lines indicate overall convincing results for all replicates, without any problematic outliers even though the estimation errors increased with the increasing complexity of the functional form. We omit the RMSE plot because it yielded qualitatively the same results.

5.3 Higher-Dimensional Responses

To investigate the performance of MCTMs in higher response dimensions, we conducted an experiment with $5$ - and $10$ -dimensional responses. Specifically, we employ the settings from Section 5.2 but add $2$ and $7$ marginally Dagum distributed components assuming independence, i.e. $\lambda_{ij}(x)=0$ for $i>j,i>3$ . The function estimates for the three parameters $\lambda_{21}(x_{i})$ , $\lambda_{31}(x_{i})$ and $\lambda_{32}(x_{i})$ is qualitatively similar to the results in Figure 7, see Figure 8, while the true zero components of $\Lambda$ are identified correctly (not shown in the Figure). Furthermore the scale of the RMSE does not increase but is for all replicates and all $\lambda_{ij}(x)<0.2$ as before, so that we omit the additional plot. Overall, these results indicate very satisfying results even for high-dimensional response situations.

6 Summary and Discussion

Renewed interest in transformation models (Manuguerra and Heller, 2010; McLain and Ghosh, 2013; Chernozhukov et al., 2013; Hothorn et al., 2014; Liu et al., 2017; Hothorn et al., 2018; Garcia et al., 2019) has been motivated by the combination of model flexibility, parameter interpretability, and broad applicability of this class of regression models. Rather than assuming a specific distribution of the response, transformation models rely on a suitable transformation of the response into an a priori defined reference distribution. The problem of directly estimating a distribution is replaced by the problem of estimating this transformation function. However, in many cases, conceptually and computationally simple solutions to this problem exist (Hothorn, 2020).

The MCTMs introduced herein apply this core principle to multivariate regression. Similar technical approaches have been used in discriminant analysis (Lin and Jeon, 2003, refer to transnormal models), quantile regression (Fan et al., 2016), receiver-operating characteristic curve analysis (Lyu et al., 2019), and are ubiquitous in neural networks as flows, but the generality of multivariate transformation models for regression purposes had yet to be fully developed. MCTMs enjoy the same flexibility, parameter interpretability and broad applicability as their univariate counterparts. The models are highly adaptive and, in our simulation experiments, performed akin to parametric models that exactly matched the data-generating process. While the parametric models are often restricted to bivariate responses, our MCTMs work well far beyond that as illustrated empirically for up to ten dimensions. Appropriate model parameterisations allow both the marginal distributions and the joint distribution to depend on covariates. An important application of this new model class is the estimation of conditional dependencies, while accounting for covariate effects in the marginal distributions.

Conceptually, our framework carries over to multivariate random vectors that are discrete or censored. In particular, both discrete and censored data can be interpreted as incomplete information $\underaccent{\bar}{\yvec}_{i}<\text{\boldmath$ y $}_{i}\leq\bar{\text{\boldmath$ y $}}_{i}$ , where, instead of exact observations $\text{\boldmath$ y $}_{i}\in\text{$ \mathds{R} $}^{J}$ , only the upper and lower boundaries $\underaccent{\bar}{\yvec}_{i}$ and $\bar{\text{\boldmath$ y $}}_{i}$ respectively, are observed. For discrete data, the underlying rationale would be that discrete realisations are obtained by discretisation from an underlying continuous process. For censored data, the interval boundaries result from the censoring mechanism, and random right censoring, left censoring as well as interval censoring can be handled by appropriate choices of $\underaccent{\bar}{\yvec}_{i}$ and $\bar{\text{\boldmath$ y $}}_{i}$ .

For $\mathbb{P}_{Z}=\operatorname{N}(0,1)$ , the log-likelihood contributions are then given by

[TABLE]

where $\tilde{h}(\text{\boldmath$ y $})=(\tilde{h}_{1}(y_{1}),\dots,\tilde{h}_{J}(y_{J}))$ and $\mathbf{\Sigma}=\mathbf{\Lambda}^{-1}\mathbf{\Lambda}^{-\top}$ . Numerical approximations need to be applied in evaluating these likelihood contributions. For $J>2$ , the quasi-Monte-Carlo algorithm by Genz (1992) seems especially appropriate because it relies on the Cholesky factor $\mathbf{\Lambda}$ of the precision matrix rather than the covariance matrix $\mathbf{\Sigma}$ . Nonetheless, the simplicity and explicit structure of the score contributions from the previous sections do not carry over to these more general cases, and the numerical evaluation of the log-likelihood becomes more demanding. We will investigate these challenges in some future work.

More complex models, for example, models featuring additive or spatial effects are conceptually easy to integrate if present in the data because they only require the addition of suitable penalty terms to the log-likelihood. In addition, the analytic expressions for score and Fisher information functions presented herein apply only to a standard normal reference distribution; adaptations to the general case beyond linear dependence structures are still needed.

Computational Details

A reference implementation of conditional and unconditional multivariate transformation models is available in package tram (Hothorn et al., 2020). Augmented Lagrangian Minimization implemented in the auglag() function of package alabama (Varadhan, 2015) was used for optimising the log-likelihood, with starting values obtained from marginal transformation models.

Source code for the reproduction of the empirical results presented in Sections 4 and 5 is distributed as part of this package; the two illustrations can be executed from within R

install.packages("tram") library("tram") example(mmlt) demo("undernutrition")

Empirical results were obtained using R (version 4.0.2., R Core Team, 2020), a developer version of BayesX (Belitz et al., 2015), tram (version 0.5-1, Hothorn et al., 2020), and VGAM (version 1.1-3, Yee, 2020). Source code for simulations is available from

system.file("simulations",package="tram")

Acknowledgements

The authors thank two referees and an associate editor for helpful comments that improved the manuscript. The authors gratefully acknowledge funding through the Emmy Noether grant KL 3037/1-1 (Nadja Klein) from the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation), and SNF grant 200021-184603 from the Swiss National Science Foundation (Torsten Hothorn).

Appendix Part A Some Properties of the MCTM

Part A.1 Dependence Structure

Let $\text{\boldmath$ P $}=\mathbf{\Sigma}^{-1}=\mathbf{\Lambda}^{\top}\mathbf{\Lambda}$ be the precision matrix of the distribution of $\tilde{Z}$ , and $\text{\boldmath$ Y $}_{-j\jmath}$ and $\text{\boldmath$ \tilde{Z} $}_{-j\jmath}$ denote the vectors of the observed and transformed responses, excluding elements $j$ and $\jmath$ . Let furthermore $\text{\boldmath$ R $}=\text{\boldmath$ S $}\mathbf{\Sigma}\text{\boldmath$ S $}$ , $\text{\boldmath$ S $}=\operatorname{diag}(\sigma_{1}^{-1},\ldots,\sigma_{J}^{-1})$ , be the corresponding correlation matrix and $C_{\scriptsize{Ga}}^{j\jmath}$ the bivariate sub-Gaussian copula of components $j$ and $\jmath$ and correlation matrix entry $R[j,\jmath]$ .

**Conditional Independence. **The entries in $\mathbf{\Lambda}$ determine the conditional independence structure between the transformed responses $\tilde{Z}_{j}$ (and therefore, implicitly, also the observed responses $Y_{j}$ ), i.e.

[TABLE]

Proof.

Since $\text{$ \mathds{P} $}(\text{\boldmath$ Y $}\leq\text{\boldmath$ y $})=\text{$ \mathds{P} $}(\text{\boldmath$ \tilde{Z} $}\leq\text{\boldmath$ \tilde{z} $})$ holds, the dependence structure of $Y$ is that of $\tilde{\text{\boldmath$ Z $}}$ . However, $\tilde{\text{\boldmath$ Z $}}~{}\sim\operatorname{N}_{J}(\mathbf{0}_{J},\Sigma)$ with $\mathbf{\Sigma}=\mathbf{\Lambda}^{-1}\mathbf{\Lambda}^{-\top}$ . Hence, the result is a direct consequence of Theorem 2.2 of Rue and Held (2005) ∎

Measures of Dependence. **

(i)

For $q\in(0,1)$ and $U_{j}=F_{j}(\tilde{h}_{j}^{-1}(\tilde{Z}_{j}))$ , $U_{\jmath}=F_{j}(\tilde{h}_{\jmath}^{-1}(\tilde{Z}_{\jmath}))$ , the lower and upper quantile dependence are

[TABLE]

(ii)

The lower and upper extremal tail dependence

[TABLE]

(iii)

Spearman’s rho is

[TABLE]

where $R[j,\jmath]$ is as defined above and is a function of $\mathbf{\Lambda}$ .

(iv)

Kendall’s tau is

[TABLE]

Proof.

The proofs for (i) and (ii) can be found in Coles et al. (1999), and for (iii) and (iv) in Fang et al. (2002). ∎

Part A.2 Details on Marginal and Conditional Distributions

Assume that the random vector $Y$ is partitioned into $\text{\boldmath$ Y $}_{\mathcal{I}}$ and $\text{\boldmath$ Y $}_{\mathcal{I}^{\mathcal{C}}}$ , where $\mathcal{I}\subset\{1,\ldots,J\}$ is a non-empty set of $I$ indices $j_{1},\ldots,j_{I}$ and $\mathcal{I}^{\mathcal{C}}=\{1,\ldots,J\}\setminus\mathcal{I}$ is its complement, consisting of all indices not contained in $\mathcal{I}$ . The vectors $\text{\boldmath$ \tilde{Z} $}_{\mathcal{I}}$ and $\text{\boldmath$ \tilde{Z} $}_{\mathcal{I}^{\mathcal{C}}}$ can be similarly defined. The marginal distribution of $\text{\boldmath$ \tilde{Z} $}_{\mathcal{I}}$ is then given by $\operatorname{N}_{I}(\mathbf{0}_{I},\mathbf{\Sigma}_{\mathcal{I}})$ , where $\mathbf{\Sigma}_{\mathcal{I}}$ is the submatrix of $\mathbf{\Sigma}$ containing the elements related to the subset $\mathcal{I}$ . We can therefore deduce both the marginal CDF and the density of $\text{\boldmath$ Y $}_{\mathcal{I}}$ as

[TABLE]

where $\tilde{z}_{j_{i}}=\tilde{h}_{j_{i}}(y_{j_{i}})$ and

[TABLE]

We use a similar process for the conditional distribution of $\text{\boldmath$ Y $}_{\mathcal{I}}$ , given $\text{\boldmath$ Y $}_{\mathcal{I}^{\mathcal{C}}}=\text{\boldmath$ y $}_{\mathcal{I}^{\mathcal{C}}}$ , and note that

[TABLE]

From the rules for multivariate normal distributions we furthermore obtain

[TABLE]

with

[TABLE]

where $\mathbf{\Sigma}_{\mathcal{I}^{\mathcal{C}}}$ and $\mathbf{\Sigma}_{\mathcal{I},\mathcal{I}^{\mathcal{C}}}$ denote the sub-blocks of $\mathbf{\Sigma}$ corresponding to the respective index sets. Therefore, the conditional density of $\text{\boldmath$ Y $}_{\mathcal{I}}$ given $\text{\boldmath$ Y $}_{\mathcal{I}^{\mathcal{C}}}$ is

[TABLE]

Appendix Part B Observed Fisher Information

Let $\mathcal{F}_{i}(\text{\boldmath$ \theta $})=-\frac{\partial^{2}l_{i}(\text{\boldmath$ \theta $})}{\partial\text{\boldmath$ \theta $}\partial\text{\boldmath$ \theta $}^{\top}}=-\frac{\partial^{2}l_{i}(\text{\boldmath$ \theta $})}{\partial(\text{\boldmath$ \vartheta $}^{\top},\text{\boldmath$ \lambda $}^{\top})^{\top}\partial(\text{\boldmath$ \vartheta $}^{\top},\text{\boldmath$ \lambda $}^{\top})}$ . Then the elements of the observed Fisher information are given by

[TABLE]

and

[TABLE]

Appendix Part C Fast Alternative to Parametric Boostrap

As discussed in Sectio 3.4, a fast alternative to the parametric bootstrap procedure for cases in which $n$ is large is to draw samples $\hat{\text{\boldmath$ \theta $}}_{(b)}$ , $b=1,\ldots,B$ from the asymptotic normal distribution of $\theta$ from Corollary 2. We here present the Figure A showing the $95\%$ confidence intervals of the the conditional rank correlations $\rho^{S}$ between stunting, wasting and underweight as functions of age of the parametric bootstrap and the fast alternative.

Bibliography54

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Acar et al. (2011) Acar, E. F., Craiu, R. V. and Yao, F. (2011). Dependence calibration in conditional copulas: A nonparametric approach, Biometrics 67 : 445–453.
3Belitz et al. (2015) Belitz, C., Brezger, A., Klein, N., Kneib, T., Lang, S. and Umlauf, N. (2015). Bayes X - Software for Bayesian inference in structured additive regression models. Version 3.0.2. Anonymous, read-only SVN access to the Bayes X source code is available via https://svn.gwdg.de/svn/bayesx/ with user name ”anonymous” and empty password. Full details at www.bayesx.org.
4Carlier et al. (2016) Carlier, G., Chernozhukov, V. and Galichon, A. (2016). Vector quantile regression: An optimal transport approach, The Annals of Statistics 44 (3): 1165–1192.
5Carlier et al. (2017) Carlier, G., Chernozhukov, V. and Galichon, A. (2017). Vector quantile regression beyond the specified case, Journal of Multivariate Analysis 161 : 96–102.
6Chernozhukov et al. (2013) Chernozhukov, V., Fernández-Val, I. and Melly, B. (2013). Inference on counterfactual distributions, Econometrica 81 (6): 2205–2268.
7Chernozhukov et al. (2017) Chernozhukov, V., Galichon, A., Hallin, M. and Henry, M. (2017). Monge-Kantorovich depth, quantiles, ranks and signs, The Annals of Statistics 45 (1): 223–256.
8Coles et al. (1999) Coles, S., Heffernan, J. and Tawn, J. (1999). Dependence measures for extreme value analyses, Extremes 2 : 339––365.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Multivariate Conditional Transformation Models

Abstract

1 Introduction

2 Multivariate Transformation Models

2.1 Basic Model Setup

2.2 Models with Recursive Structure

2.3 Relation to Gaussian Copula Models

2.4 Some Properties of the Dependence Structure

2.5 Model-Implied Marginal and Conditional Distributions

2.6 Alternative Reference Distributions

3 Transformation Analysis

3.1 Parameterisation of the Transformation Functions

3.2 Inference

Definition 1**.**

3.3 Parametric Inference

Remark 1**.**

Corollary 1**.**

Corollary 2**.**

Proof.

Remark 2**.**

3.4 Parametric Bootstrap

3.5 Illustration: Bivariate Density Estimation

4 Extensions to Multivariate Regression

4.1 Multivariate Conditional Transformation Models

4.2 Application: Trivariate Conditional Transformation Models for Undernutrition in India

Model Specification.

Results for Marginal Distributions.

Results for the Dependence Structure.

5 Empirical Evaluation

5.1 Bivariate Simulation

Simulation Design.

Measures of Performance.

Results.

5.2 Trivariate Simulation

Simulation Design.

Results.

5.3 Higher-Dimensional Responses

6 Summary and Discussion

Computational Details

Acknowledgements

Appendix Part A Some Properties of the MCTM

Part A.1 Dependence Structure

Proof.

Proof.

Part A.2 Details on Marginal and Conditional Distributions

Appendix Part B Observed Fisher Information

Appendix Part C Fast Alternative to Parametric Boostrap

Definition 1.

Remark 1.

Corollary 1.

Corollary 2.

Remark 2.