TL;DR
This paper introduces a noise regularization technique for neural network-based conditional density estimation, improving model generalization and outperforming existing methods across multiple datasets.
Contribution
A novel, model-agnostic noise regularization method for CDE that enhances generalization and is proven to be asymptotically consistent.
Findings
Noise regularization outperforms other regularization methods.
Method is effective across seven datasets and three CDE models.
Neural CDE becomes preferable over non- and semi-parametric approaches.
Abstract
Modelling statistical relationships beyond the conditional mean is crucial in many settings. Conditional density estimation (CDE) aims to learn the full conditional probability density from data. Though highly expressive, neural network based CDE models can suffer from severe over-fitting when trained with the maximum likelihood objective. Due to the inherent structure of such models, classical regularization approaches in the parameter space are rendered ineffective. To address this issue, we develop a model-agnostic noise regularization method for CDE that adds random perturbations to the data during training. We demonstrate that the proposed approach corresponds to a smoothness regularization and prove its asymptotic consistency. In our experiments, noise regularization significantly and consistently outperforms other regularization methods across seven data sets and three CDE…
| Euro Stoxx | NYC Taxi | Boston | Concrete | Energy | ||
|---|---|---|---|---|---|---|
| MDN | noise (ours) | 3.940.03 | 5.250.04 | -2.490.11 | -2.920.08 | -1.040.09 |
| weight decay | 3.780.06 | 5.070.04 | -3.290.32 | -3.330.14 | -1.210.10 | |
| l1 reg. | 3.190.19 | 5.000.05 | -4.010.36 | -3.870.29 | -1.440.22 | |
| l2 reg. | 3.160.21 | 4.990.04 | -4.640.52 | -3.840.26 | -1.550.26 | |
| Bayes | 3.260.43 | 5.080.03 | -3.460.47 | -3.190.21 | -1.250.23 | |
| KMN | noise (ours) | 3.920.01 | 5.390.02 | -2.520.08 | -3.090.06 | -1.620.06 |
| weight decay | 3.850.03 | 5.310.02 | -2.690.15 | -3.150.06 | -1.790.12 | |
| l1 reg. | 3.760.04 | 5.390.02 | -2.750.13 | -3.250.07 | -1.820.10 | |
| l2 reg. | 3.710.05 | 5.370.02 | -2.660.13 | -3.180.07 | -1.790.13 | |
| Bayes | 3.330.02 | 4.470.02 | -3.400.11 | -4.080.05 | -3.650.07 | |
| NFN | noise (ours) | 3.900.01 | 5.200.03 | -2.480.11 | -3.030.13 | -1.210.08 |
| weight decay | 3.820.06 | 5.190.03 | -3.120.39 | -3.120.14 | -1.220.16 | |
| l1 reg. | 3.500.10 | 5.120.05 | -12.612.8 | -3.910.52 | -1.290.16 | |
| l2 reg. | 3.500.09 | 5.130.05 | -14.29.60 | -3.990.66 | -1.340.19 | |
| Bayes | 3.340.33 | 5.100.03 | -5.992.45 | -3.550.46 | -1.110.22 |
| Euro Stoxx | NCY Taxi | Boston | Conrete | Energy | |
|---|---|---|---|---|---|
| num. train obs. | 2536 | 8000 | 405 | 824 | 615 |
| MDN (ours) | 4.000.03 | 5.410.02 | -2.390.02 | -2.890.03 | -1.040.05 |
| KMN (ours) | 3.980.03 | 5.420.02 | -2.440.02 | -3.060.03 | -1.590.09 |
| NFN (ours) | 4.000.03 | 5.120.03 | -2.400.04 | -2.930.02 | -1.230.06 |
| LSCDE | 3.440.10 | 4.850.02 | -2.780.00 | -3.630.00 | -2.160.02 |
| CKDE R.O.T. | 3.360.01 | 4.870.02 | -3.120.03 | -3.780.02 | -2.900.01 |
| CKDE CV-ML | 3.870.01 | 5.270.06 | -2.760.26 | -3.350.13 | -1.140.02 |
| NKDE R.O.T | 3.160.02 | 4.340.04 | -3.520.05 | -4.080.02 | -3.350.03 |
| NKDE CV-ML | 3.410.02 | 4.930.08 | -3.340.13 | -3.930.05 | -2.210.12 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Noise Regularization for Conditional Density Estimation
Jonas Rothfuss
Fabio Ferreira
Simon Boehm
Simon Walther
Maxim Ulrich
Tamim Asfour
Andreas Krause
Abstract
Capturing statistical relationships beyond the conditional mean is crucial in many applications. To this end, conditional density estimation (CDE) aims to learn the full conditional probability density from data. Though expressive, neural network based CDE models can suffer from severe over-fitting when trained with the maximum likelihood objective. Their particular structure renders classical regularization in the parameter space ineffective. To address this challenge, we propose a model-agnostic noise regularization method for CDE that adds carefully controlled random perturbations to the data during training. We prove that the proposed approach corresponds to a smoothness regularization and establish its asymptotic consistency. Our extensive experiments show that noise regularization consistently outperforms other regularization methods across a range of neural CDE models. Furthermore, we demonstrate the effectiveness of noise regularized neural CDE over classical non- and semi-parametric methods, even when training data is scarce.
Machine Learning, ICML
1 Introduction
While regression analysis aims to describe the conditional mean of a response given inputs , many problems such as risk management and planning under uncertainty require gaining insight about deviations from the mean and their associated likelihood. The stochastic dependency of on can be captured by modeling the conditional probability density . Inferring such a density function from a set of observations is typically referred to as conditional density estimation (CDE) and is the focus of this paper.
In the recent machine learning literature, there has been a resurgence of interest in flexible density models based on neural networks (Dinh et al., 2017; Ambrogioni et al., 2017; Kingma & Dhariwal, 2018). Since this line of work mainly focuses on the modelling of images based on large scale data sets, over-fitting and noisy observations are of minor concern in this context. In contrast, we are interested in CDE in settings where data may be scarce and noisy. When combined with maximum likelihood estimation, the flexibility of such high-capacity models results in over-fitting and poor generalization. While regression typically assumes a Gaussian noise model, CDE uses expressive distribution families to model deviations from the conditional mean. Hence, the over-fitting problem tends to be even more severe in CDE than in regression. Standard regularization of the neural network weights such as weight decay (Pratt & Hanson, 1989) has been shown effective for regression and classification. However, in the context of CDE, the output of the neural network merely controls the parameters of a density model such as a Gaussian Mixture or Normalizing Flow. This makes the standard regularization methods in the parameter space less effective and hard to analyze.
The lack of an effective regularization scheme renders neural network based CDE impractical in most scenarios where data is scarce. As a result, classical non- and semi-parametric CDE tends to be the primary method in application areas such as econometrics (Zambom & Dias, 2013).
To address this issue, we propose and analyze the use of noise regularization, an approach well-studied in the context of regression and classification, for the purpose of CDE. By adding small, carefully controlled, random perturbations to the data during training, the conditional density estimate is smoothed and tends to generalize better. In fact, we show that adding noise during maximum likelihood estimation is equivalent to a penalty on large second derivatives in the training point locations which results in an inductive bias towards smoother density estimates. Moreover, under mild regularity conditions, we show that the proposed regularization scheme is consistent, converging to the unbiased maximum likelihood estimator. This does not only support the soundness of the proposed method but also provides insight in how to set the regularization intensity relative to the data dimensionality and training set size.
Overall, the proposed noise regularization scheme is easy to implement and agnostic to the parameterization of the CDE model. We empirically demonstrate its effectiveness on three different neural network based models. The experimental results show that noise regularization outperforms other regularization methods consistently across various data sets. In a comprehensive benchmark study, we demonstrate that, with noise regularization, neural network based CDE is able to significantly improve upon state-of-the art non-parametric estimators, even when only 400 training observations are available. This is a relevant, and perhaps surprising, finding since non-parametric CDE is considered one of the primary approaches for settings with scarce and noisy data (Zambom & Dias, 2013). By using non-parametric regularization for training parametric high-capacity models, we are able to combine inductive biases from both worlds, making neural networks the preferable CDE method, even for low-dimensional and small-scale tasks.
2 Background
Density Estimation.
Let be a random variable with probability density function (PDF) defined over the domain . Given a collection of observations sampled from , the goal is to find a good estimate of the true density function . In parametric estimation, the PDF is assumed to belong to a parametric family where the density function is described by a finite dimensional parameter . The standard method for estimating is maximum likelihood estimation (MLE), wherein is chosen so that the likelihood of the data is maximized. This is equivalent to minimizing the Kullback-Leibler divergence between the empirical data distribution (i.e., mixture of point masses in the observations ) and the parametric distribution :
[TABLE]
From a geometric perspective, (1) can be viewed as an orthogonal projection of onto w.r.t. the KL-divergence. Hence, (1) is also commonly referred to as an M-projection (Murphy, 2012; Nielsen, 2018). In contrast, non-parametric density estimators make implicit smoothness assumptions through a kernel function. The most popular non-parametric method, kernel density estimation (KDE), places a symmetric density function , the so-called kernel, on each training data point (Rosenblatt, 1956; Parzen, 1962). The resulting density estimate reads as One popular choice of is a Gaussian . Beyond the appropriate choice of , a central challenge is the selection of the bandwidth parameter which controls the smoothness of the estimated PDF (Li & Racine, 2007).
Conditional Density Estimation (CDE).
Let be a pair of random variables with respective domains and and realizations and . Let denote the conditional probability density of given . Typically, is referred to as a dependent variable (explained variable) and as conditional (explanatory) variable. Given a dataset of observations drawn from the joint distribution , the aim of conditional density estimation (CDE) is to find an estimate of the true conditional density .
In the context of CDE, the KL-divergence objective is expressed as expectation over :
[TABLE]
Corresponding to (1), we refer to the minimization of (2) w.r.t. as conditional M-projection. Given a dataset drawn i.i.d. from , the conditional MLE following from (2) can be stated as
[TABLE]
3 Related work
The first part of this section discusses work in the field of CDE, focusing on high-capacity models that make little prior assumptions. The second part relates our approach to previous regularization and data augmentation methods.
Non-parametric CDE.
A vast body of literature in statistics studies nonparametric kernel density estimators (KDEs) (Rosenblatt, 1956; Parzen, 1962) and the associated bandwidth selection problem, which concerns choosing the appropriate amount of smoothing (Silverman, 1982; Hall et al., 1992; Cao et al., 1994). To estimate conditional probabilities, previous work proposes to estimate both the joint and marginal probability separately with KDE and then computing the conditional probability as their ratio (Hyndman et al., 1996; Li & Racine, 2007). Other approaches combine non-parametric elements with parametric elements (Tresp, 2001; Sugiyama & Takeuchi, 2010; Dutordoir et al., 2018). Despite their theoretical appeal, non-parametric density estimators suffer from poor generalization in regions where data is sparse (e.g., tail regions) (Scott & Wand, 1991).
CDE based on neural networks.
Most work in machine learning focuses on flexible parametric function approximators for CDE. In our experiments, we use the work of Bishop (1994) and Ambrogioni et al. (2017), who propose to use a neural network to control the parameters of a mixture density model. A recent trend in machine learning are latent density models such as cGANs (Mirza & Osindero, 2014) and cVAEs (Sohn et al., 2015). Although such methods have been shown successful for estimating distributions of images, the PDF of such models is intractable. More promising in this sense are normalizing flows (Rezende & Mohamed, 2015; Dinh et al., 2017; Trippe & Turner, 2018), since they provide the PDF in tractable form. We employ a neural network controlling the parameters of a normalizing flow as our third CDE model to showcase the empirical efficacy of our regularization approach.
Regularization.
Since neural network based CDE models suffer from severe over-fitting when trained with the MLE objective, they require proper regularization. Classical regularization of the parameters such as weight decay (Pratt & Hanson, 1989; Krogh & Hertz, 1992; Nowlan & Hinton, 1992), /-penalties (Mackay, 1992; Ng, 2004) and Bayesian priors (Murray & Edwards, 1993; Hinton & Van Camp, 1993) have been shown to work well in the regression and classification setting. However, in the context of CDE, it is less clear what kind of inductive bias such a regularization imposes on the density estimate. In contrast, our regularization approach is agnostic w.r.t. parametrization and is shown to penalize strong variations of the log-density function. Regularization methods such as dropout are closely related to ensemble methods (Srivastava et al., 2014). Thus, they are orthogonal to our work and can be freely combined with noise regularization.
Adding noise during training.
Adding noise during training is a common scheme that has been proposed in various forms. This includes noise on the neural network weights or activations (Wan et al., 2013; Srivastava et al., 2014; Gal & Uk, 2016) and additive noise on the gradients for scalable approximate inference (Welling & Teh, 2011; Chen et al., 2014). While this line of work corresponds to noise in the parameter space, other research suggests to augment the training data through random and/or adversarial transformations (Sietsma & Dow, 1991; Burges & Schölkopf, 1996; Goodfellow et al., 2015; Yuan et al., 2017). Our approach transforms the observations by adding small random perturbations. While this form of regularization has been studied in the context of regression and classification (Holmstrom & Koistinen, 1992a; Webb, 1994; Bishop, 1995; Natarajan et al., 2013; Maaten et al., 2013), this paper focuses on the regularization of CDE. In particular, we build on top of the results of Webb (1994) showing that training with noise corresponds to a penalty on strong variations of the log-density and extend previous consistency results for regression of Holmstrom & Koistinen (1992a) to the more general setting of CDE. To our best knowledge, this is also the first paper to evaluate the empirical efficacy of noise regularization for density estimation.
4 Noise Regularization
When considering expressive families of conditional densities, standard maximum likelihood estimation of the model parameters is ill suited. As can be observed in Figure 1, simply minimizing the negative log-likelihood of the data leads to severe over-fitting. Hence, it is necessary to impose additional inductive bias, for instance, in the form of regularization. Unlike in regression or classification, the form of inductive bias imposed by popular regularization techniques such as weight decay (Krogh & Hertz, 1991) is less clear in the CDE setting, where the neural network weights often only indirectly control the probability density through a unconditional density model, e.g., a Gaussian Mixture.
4.1 The algorithm
We propose to add noise perturbations to the data points during the optimization of the log-likelihood objective. This can be understood as replacing the original data points by random variables and where the perturbation vectors are sampled from noise distributions and . Further, we choose the noise to be zero centered as well as i.i.d among the data dimensions, with standard deviation :
[TABLE]
This can be seen as a form of data augmentation, where “synthetic” data is generated by randomly perturbing the original data. Since the supply of noise vectors is technically unlimited, an arbitrary large augmented data set can be generated by repetitively sampling data points from , and adding a random perturbation vector to the respective data point. This procedure is formalized in Algorithm 1.
For notational brevity, we set , and denote . The presented noise regularization approach is agnostic to whether we are concerned with unconditional or conditional MLE. Thus, the generic notation also allows us to generalize the results to both settings (derived in the remainder of the paper).
When considering highly flexible parametric families such as Mixture Density Networks (MDNs) (Bishop, 1994), the maximum likelihood solution in line 6 of Algorithm 1 is no longer tractable. In such case, one typically resorts to stochastic optimization techniques such as mini-batch gradient descent and variations thereof. The generic procedure in Algorithm 1 can be transformed into a simple extension of mini-batch gradient descent on the MLE objective (see Algorithm 2). Specifically, each mini-batch is perturbed with i.i.d. noise before computing the MLE objective function (forward pass) and the respective gradients (backward pass).
4.2 Variable Noise as Smoothness Regularization
Intuitively, the previously presented variable noise can be interpreted as “smearing” the data points during the maximum likelihood estimation. This alleviates the jaggedness of the density estimate arising from an un-regularized maximum likelihood objective in flexible density classes. We will now give this intuition a formal foundation, by mathematically analyzing the effect of the noise perturbations.
Before discussing the particular effects of randomly perturbing the data during conditional maximum likelihood estimation, we first analyze noise regularization in a more general case. Let be a loss function over a set of data points , which can be partitioned into a sum of losses , corresponding to each data point : The expected loss , resulting from adding random perturbations, can be approximated by a second order Taylor expansion around . Using the assumption about in (4), the expected loss an be written as
[TABLE]
where is the loss without noise and \mathbf{H}^{(i)}=\frac{\partial^{2}l}{\partial z^{2}}(z)\big{\rvert}_{z_{i}} the Hessian of w.r.t , evaluated at . Assuming that the noise is small in its magnitude, is negligible. This effect has been observed earlier by Webb (1994) and Bishop (1994). See Appendix A for derivations.
When concerned with maximum likelihood estimation of a conditional density , the loss function coincides with the negative conditional log-likelihood . Let the standard deviation of the additive data noise , be and respectively. Maximum likelihood estimation (MLE) with data noise is equivalent to minimizing the loss
[TABLE]
Hereby, the first term corresponds to the standard MLE objective, while the other two terms constitute a form of smoothness regularization. The second term of penalizes concavity of the conditional log density estimate w.r.t. . As the MLE objective pushes the density estimate towards high densities and strong concavity in the data points , the regularization term counteracts this tendency to over-fit, thus smoothing the fitted distribution. The third term penalizes large negative second derivatives w.r.t. the conditional variable , thereby regularizing the sensitivity of the density estimate to changes in the conditional variable. The intensity of the noise regularization can be controlled through the variance ( and ) of the random perturbations.
Figure 1 illustrates the effect of the introduced noise regularization scheme on MDN estimates. Plain maximum likelihood estimation (left) leads to strong over-fitting, resulting in a spiky distribution that generalizes poorly beyond the training data. In contrast, training with noise regularization (center and right) results in smoother density estimates that are closer to the true conditional density.
4.3 Consistency of Noise Regularization
We now establish asymptotic consistency results for the proposed noise regularization. In particular, we show that, under some regularity conditions, concerning integrability and decay of the noise regularization, the solution of Algorithm 1 converges to the asymptotic MLE solution.
Let a continuous function of and . Moreover, we assume that the parameter space is compact. In the classical MLE setting, the idealized loss, corresponding to a (conditional) M-projection of the true data distribution onto the parametric family, reads as
[TABLE]
As we typically just have a finite number of samples from , the respective empirical estimate is used as training objective. Note that we now define the loss as function of , and, for fixed , treat as a random variable. Under some regularity conditions, one can invoke the uniform law of large numbers to show consistency of the empirical ML objective in the sense that (see Appendix B for details).
In case of the presented noise regularization scheme, the maximum likelihood estimation is performed using on the augmented data rather than the original data . For our analysis, we view Algorithm 1 from a slightly different angle. In fact, the data augmentation procedure of uniformly selecting a data point from and perturbing it with a noise vector drawn from can be viewed as drawing i.i.d. samples from a kernel density estimate Hence, MLE with variable noise can be understood as
forming a kernel density estimate of the data, 2. 2.
followed by a (conditional) M-projection of onto the parametric family.
Hereby, step 2 aims to find the that minimizes the following objective:
[TABLE]
Since (6) is generally intractable, samples are drawn from the kernel density estimate, forming the following Monte Carlo approximation of (6) which corresponds to the loss in line 6 Algorithm 1:
[TABLE]
We are concerned with the consistency of the training procedure in Algorithm 1, similar to the classical MLE consistency result discussed above. Hence, we need to show that as . We begin our argument by decomposing the problem into easier sub-problems. In particular, the triangle inequality is used to obtain the following upper bound:
[TABLE]
Note that is based on samples from the KDE, which are obtained by adding random noise vectors to our original training data. Since we can sample an unlimited amount of such random noise vectors, can be chosen arbitrarily high. This allows us to make arbitrary small by the uniform law of large numbers. In order to make small in the limit , the sequence of bandwidth parameters needs to be chosen appropriately. Such results can then be combined using a union bound argument. In the following we outline the steps leading us to the desired results. In that, the proof methodology is similar to Holmstrom & Koistinen (1992b). While they show consistency results for regression with a quadratic loss function, our proof deals with generic and inherently unbounded log-likelihood objectives and thus holds for a much more general class of learning problems. The full proofs can be found in the Appendix.
Initially, we have to make asymptotic integrability assumptions that ensure that the expectations in and are well-behaved in the limit (see Appendix C for details). Given respective integrability, we are able to obtain the following proposition.
Proposition 1
Suppose the regularity conditions (26) and (27) in Appendix C are satisfied, and that
[TABLE]
Then,
[TABLE]
almost surely.
In (9) we find conditions on the asymptotic behavior of the smoothing sequence . These conditions also give us valuable guidance on how to properly choose the noise intensity in line 4 of Algorithm 1 (see Section 4.4 for discussion). The result in (10) demonstrates that, under the discussed conditions, replacing the empirical data distribution with a kernel density estimate still results in an asymptotically consistent maximum likelihood objective. However, as previously discussed, is intractable and, thus, replaced by its sample estimate . Since we can draw an arbitrary amount of samples from , we can approximate with arbitrary precision. Given a fixed data set of size , this means that almost surely, by (27) and the uniform law of large numbers. Since our original goal was to also show consistency for , this result is combined with Proposition 1, obtaining the following consistency theorem.
Theorem 1
Suppose the regularity conditions (26) and (27) are satisfied, fulfills (9) and is compact. Then,
[TABLE]
almost surely.
In that, used to denote the limit superior (“lim sup”) of a sequence. Training a (conditional) density model with noise regularization means minimizing w.r.t. . As result of this optimization, one obtains a parameter vector , which we hope is close to the minimizing parameter of the ideal objective function . In the following, we establish consistency results, similar to Theorem 1, in the parameter space. For that, we first have to formalize the concept of closeness and optimality in the parameter space. Since a minimizing parameter of may not be unique, we define as the set of global minimizers of , and as the distance of an arbitrary parameter to . Based on these definitions, it can be shown that Algorithm 1 is consistent in a sense that the minimizer of converges almost surely to the set of optimal parameters .
Theorem 2
Suppose the regularity conditions (26) and (27) are satisfied, fulfills (9) and is compact. For and , let be a global minimizer of the empirical objective . Then
[TABLE]
almost surely.
Note that Theorem 2 considers global optimizers, but equivalently holds for compact neighborhoods of a local minimum (see discussion in Appendix C).
4.4 Choosing the noise intensity
After discussing the properties of noise regularization, we are interested in how to properly choose the noise intensity , for different training data sets. Ideally, we would like to choose so that is minimized, which is practically not feasible since is intractable. Inequality (28) gives as an upper bound on this quantity, suggesting to minimize distance between the kernel density estimate and the data distribution . This is in turn a well-studied problem in the kernel density estimation literature (see e.g., Devroye & Luc (1987)). Unfortunately, general solutions of this problem require knowing which is not the case in practice. Under the assumption that and the kernel function are Gaussian, the optimal bandwidth can be derived as (Silverman, 1986). In that, denotes the estimated standard deviation of the data, the number of data points and the dimensionality of . This formula is widely known as the rule of thumb and often used as a heuristic for choosing .
In addition, the conditions in (9) give us further intuition. The first condition tells us that needs to decay towards zero as becomes large. This reflects the general theme in machine learning that the more data is available, the less inductive bias / regularization should be imposed. The second condition suggests that the bandwidth decay must happen at a rate slower than . For instance, the rule of thumb fulfills these two criteria and thus constitutes a useful guideline for selecting . However, for highly non-Gaussian data distributions, the respective may decay too slowly and a faster decay rate such as may be appropriate.
5 Experiments
We now perform a detailed experimental analysis of the proposed method, aiming to empirically validate the theoretical arguments outlined previously and investigating the practical efficacy of our regularization approach. In all experiments we use Gaussian perturbations, i.e., . Since one of the key features of our noise regularization scheme is that it is agnostic to the choice of model, we evaluate its performance on three different neural network based CDE models: Mixture Density Networks (MDN) (Bishop, 1994), Kernel Mixture Networks (KMN) (Ambrogioni et al., 2017) and Normalizing Flows Networks (NFN) (Rezende & Mohamed, 2015; Trippe & Turner, 2018).
In our experiments, we consider both simulated as well as real-world data sets. In particular, we simulate data from a 4-dimensional Gaussian Mixture () and a Skew-Normal distribution whose parameters are functionally dependent on (). In terms of real-world data, we use the following three data sources. EuroStoxx: Daily returns of the Euro Stoxx 50 index conditioned on various stock return factors. NYC Taxi: Drop-off locations of Manhattan taxi trips conditioned on the pickup location, weekday and time. UCI datasets: Boston Housing , Concrete and Energy datasets from the UCI machine learning repository (Dua & Graff, 2017). The reported scores are test log-likelihoods, averaged over at least 5 random seeds alongside the respective standard deviation. For further details regarding the data sets and simulated data, we refer to Appendix E. The experiment data and code is available on our supplementary website111https://sites.google.com/view/noisereg/.
5.1 Noise intensity schedules
We complement the discussion in Section 4.4 with an empirical investigation of different schedules of . In particular, we compare a) the rule of thumb b) a square root decay schedule c) a constant bandwidth and d) no noise regularization, i.e. . Figure 2 plots the respective test log-likelihoods against an increasing training set size for the two simulated densities Gaussian Mixture and Skew Normal.
First, we observe that bandwidth rates that conform with the decay conditions seem to converge in performance to the non-regularized maximum likelihood estimator (red) as becomes large. This validates the theoretical result of Theorem 1. Second, a fixed bandwidth across (green), violating (9), imposes asymptotic bias and thus saturates in performance vastly before its counterparts. Third, as hypothesized, the relatively slow decay of through the rule of thumb works better for data distributions that have larger similarities to a Gaussian, i.e., in our case the Skew Normal distribution. In contrast, the highly non-Gaussian data from the Gaussian Mixture requires faster decay rates like the square root decay schedule. Most importantly, noise regularization substantially improves the estimator’s performance when only little training data is available.
5.2 Regularization Comparison
We now investigate how the proposed noise regularization scheme compares to classical regularization techniques. In particular, we consider an and -penalty on the neural network weights as regularization term, the weight decay technique of Loshchilov & Hutter (2019)222Note that an regularizer and weight decay are not equivalent since we use the adaptive learning rate technique Adam. See Loshchilov & Hutter (2019) for details., as well a Bayesian neural network (Neal, 2012) trained with variational inference using a Gaussian prior and posterior (Blei et al., 2017). First, we study the performance of the regularization techniques on our two simulation benchmarks. Figure 3 depicts the respective test log-likelihood across different training set sizes. For each regularization method, the regularization hyper-parameter has been optimized via grid search.
As one would expect, the importance of regularization, i.e., performance difference to un-regularized model, decreases as the amount of training data becomes larger. The noise regularization scheme yields similar performance across the CDE models while the other regularizers vary greatly in their performance depending on the different models. This reflects the fact that noise regularization is agnostic to the parameterization of the CDE model while regularizers in the parameter space are dependent on the internal structure of the model. Most importantly, noise regularization performs well across all models and sample sizes. In the great majority of configurations it outperforms the other methods. Especially, when little training data is available, noise regularization ensures a moderate test error while the other methods mostly fail to do so.
Next, we consider real world data. We use 5-fold cross-validation on the training set to select the parameters for each regularization method. The test log-likelihoods, reported in Table 1, are averages over 3 different train/test splits and 5 seeds each for initializing the neural networks. The held out test set amounts to 20% of the respective data set. Consistent with the results of the simulation study, noise regularization outperforms the other methods across the great majority of data sets and CDE models.
5.3 Conditional Density Estimator Benchmark Study
We benchmark neural network based density estimators against state-of-the art CDE approaches. While neural networks are the obvious choice when a large amount of training data is available, we pose the question how such estimators compete against popular non-parametric methods in small data regimes. In particular, we compare to Conditional Kernel Density Estimation (CKDE) (Li & Racine, 2007), -Neighborhood Kernel Density Estimation (NKDE), and Least-Squares Conditional Density Estimation (LSCDE) (Sugiyama & Takeuchi, 2010). For the kernel density estimation based methods CKDE and NKDE, we perform bandwidth selection via the rule of thumb (R.O.T) (Silverman, 1982; Sheather & Jones, 1991) and via maximum likelihood leave-one-out cross-validation (CV-ML) (Rudemo, 1982; Hall et al., 1992). In case of LSCDE, MDN, KMN and NFN, the respective hyper-parameters are selected via 5-fold cross-validation grid search on the training set. Note that, in contrast to Section 5.2 which focuses on regularization parameters, the grid search here extends to more hyper-parameters.
The respective test log-likelihood scores are listed in Table 2. For the majority of data sets, the three neural network based methods outperform all of the non- and semi-parametric methods. In summary, neural network based CDE models, trained with noise regularization, work surprisingly well, even when training data is scarce and fairly low-dimensional such as in case of the Boston Housing data set . We hypothesize this may be due to the fact, that, when using non-parametric regularization for training parametric high-capacity models, we combine favorable inductive biases of both KDE and neural networks.
6 Conclusion
The paper proposes a regularization technique for neural CDE that adds random perturbations to the data during training. It can be seamlessly integrated into standard stochastic optimization and is model-agnostic. We show that the proposed noise regularization inherits an inductive bias from non-parametric KDE, effectively smoothing the estimated conditional density. Our experiments demonstrate that it consistently outperforms other regularization methods across models and datasets. Moreover, empirical results suggest that, when trained with noise regularization, neural network based models are able to significantly improve upon state-of-the art non-parametric estimators. This makes neural CDE the preferable method, even in settings where training data is low-dimensional and scarce.
Acknowledgments
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program grant agreement No 815943. Simon Walter has been supported by the Konrad-Adenauer-Stiftung. Finally, we thank Sara van der Geer for her advise regarding the consistency proofs.
Appendix A Derivation Smoothness Regularization
Let be a loss function over a set of data points , which can be partitioned into a sum of losses corresponding to each data point :
[TABLE]
Also, let each be perturbed by a random noise vector with zero mean and i.i.d. elements, i.e.
[TABLE]
The resulting loss can be approximated by a second order Taylor expansion around
[TABLE]
Assuming that the noise is small in its magnitude, may be neglected. The expected loss under follows directly from (15):
[TABLE]
Using the assumption about in (14) we can simplify (16) as follows:
[TABLE]
In that, is the loss without noise and \mathbf{H}^{(i)}=\nabla_{z}^{2}l(z)\big{\rvert}_{z_{i}} the Hessian of at . With we denote the elements of the column vector .
Appendix B Vanilla conditional MLE objective is uniformly consistent
The objective function corresponding to a conditional M-projection.
[TABLE]
The sample equivalent:
[TABLE]
Corollary 1
Let be a compact set and and continuous in for all such that . Then, as , we have
[TABLE]
Proof. The corollary follows directly from the uniform law of large numbers.
Appendix C Consistency Proofs
Lemma 1
Suppose for some there exists a constant such that
[TABLE]
and there exists an such that for all there exists a constant such that
[TABLE]
almost surely. Then, the inequality
[TABLE]
where is a constant holds with probability 1 for all .
Proof of Lemma 1 Using Hoelder’s inequality and the nonnegativity of and , we obtain
[TABLE]
Employing the regularity conditions (26) and (27) and writing , it follows that such that
[TABLE]
with probability 1.
Lemma 1 states regularity conditions ensuring that the expectations in and are well-behaved in the limit. In particular, (26) and (27) imply uniform and absolute integrability of the log-likelihoods under the respective probability measures induced by and . Since we are interested in the asymptotic behavior, it is sufficient for (27) to hold for large enough with probability 1.
Inequality (28) shows that we can make small by reducing the -distance between the true density and the kernel density estimate . There exists already a vast body of literature, discussing how to properly choose the kernel and the bandwidth sequence so that . We employ the results in Devroye (1983) for our purposes, leading us to Proposition 1.
Proof of Proposition 1. Let denote the event that inequality (28) holds for some constant . From our regularity assumptions it follows that . Given that holds, we just have to show that . Then, the upper bound in (28) tends to zero and we can conclude our proposition.
For any let denote the event
[TABLE]
wherein is a kernel density estimate obtained based on samples from . Under the conditions in (9) we can apply Theorem 1 of Devroye (1983), obtaining an upper bound on the probability that (29) does not hold, i.e. such that for all .
Since we need both and for to hold, we consider the intersection of the events . Using a union bound argument it follows that such that . Note that we can simply choose for this to hold. Hence, and by the Borel-Cantelli lemma we can conclude that
[TABLE]
holds with probability 1.
Proof of Theorem 1. The inequality in (8) implies that for any ,
[TABLE]
Let be fixed but arbitrary and denote
[TABLE]
It is important to note that is a random variable that depends on the samples as well as on the randomness inherent in Algorithm 1. We define as the indices sampled uniformly from and as the sequence of perturbation vectors sampled from . Let , and be probability measures of the respective random sequences.
If we fix to be equal to an arbitrary sequence , then is fixed and we can treat as the regular difference between a sample estimate and expectation under . By the regularity condition , the compactness of and the continuity of in , we can invoke the uniform law of large numbers to show that
[TABLE]
with probability 1.
Now we want to show that (33) also holds with probability 1 for random training samples . First, we write as a deterministic function of random variables:
[TABLE]
This allows us to restate the result in (33) as follows:
[TABLE]
In that denotes an indicator function which returns if is true and [math] else. Next we consider the probability that the convergence in (33) holds for random :
[TABLE]
Note that we can move outside of the inner integrals, since is independent from and . Hence, we can conclude that (33) also holds, which we denote as event , with probability 1 for random training data.
From Proposition 1 we know, that
[TABLE]
with probability 1. We denote the event that (36) holds as . Since , we can use a union bound argument to show that . From (33) and (31) it follows that for any ,
[TABLE]
with probability 1. Finally, we combine this result with (36), obtaining that
[TABLE]
almost surely, which concludes the proof.
Proof of Theorem 2. The proof follows the argument used in Theorem 1 of White (1989). In the following, we assume that (11) holds. From Theorem 1 we know that this is the case with probability 1. Respectively, we only consider realizations of our training data and noise samples , for which the convergence in (11) holds (see proof of Theorem 1 for details on this notation).
For such realization, let be minimizers of . Also let and for any , be increasing sequences of positive integers. Define and . Due to the compactness of and the Bolzano-Weierstrass property thereof, there exists a limit point and increasing subsequences so that as .
From the triangle inequality, it follows that for any there exists so that
[TABLE]
given the convergence established in Theorem 1 and the continuity of in . Next, the result above is extended to
[TABLE]
which again holds for large enough. This due to (39), since is the minimizer of , and by Theorem 1. Because can be made arbitrarily small, as . Because is arbitrary, must be in . In turn, since , and were chosen arbitrarily, every limit point of a sequence must be in .
In the final step, we proof the theorem by contradiction. Suppose that (12) does not hold. In this case, there must exist an and sequences , and such that for all and . However, by the previous argument the limit point of the any sequence must be in . That is a contradiction to . Since the random sequences , , where chosen from a set with probability mass of 1, we can conclude our proposition that
[TABLE]
almost surely.
Discussion of Theorem 2. Note that, similar to , does not have to be unique. In case there are multiple minimizers of , we can chose one of them arbitrarily and the proof of the theorem still holds. Theorem 2 considers global optimizers over a set of parameters , which may not be attainable in practical settings. However, the application of the theorem to the context of local optimization is straightforward when is chosen as a compact neighborhood of a local minimum of (Holmstrom & Koistinen, 1992b). If we set and restrict minimization over to the local region, then converges to as in the sense of Theorem 2.
Appendix D Conditional density estimation models
D.1 Mixture Density Network
Mixture Density Networks (MDNs) combine conventional neural networks with a mixture density model for the purpose of estimating conditional distributions (Bishop, 1994). In particular, the parameters of the unconditional mixture distribution are outputted by the neural network, which takes the conditional variable as input.
For our purpose, we employ a Gaussian Mixture Model (GMM) with diagonal covariance matrices as density model. The conditional density estimate follows as weighted sum of Gaussians
[TABLE]
wherein denote the weight, the mean and the variance of the k-th Gaussian component. All the GMM parameters are governed by the neural network with parameters and input .
The mixing weights must resemble a categorical distribution, i.e. it must hold that and . To satisfy the conditions, the softmax linearity is used for the output neurons corresponding to . Similarly, the standard deviations must be positive, which is ensured by a sofplus non-linearity. Since the component means are not subject to such restrictions, we use a linear output layer without non-linearity for the respective output neurons.
For the experiments in 5.2 and 5.1, we set and use a neural network with two hidden layers of size 32.
D.1.1 Kernel Mixture Network
While MDNs resemble a purely parametric conditional density model, a closely related approach, the Kernel Mixture Network (KMN), combines both non-parametric and parametric elements (Ambrogioni et al., 2017). Similar to MDNs, a mixture density model of is combined with a neural network which takes the conditional variable as an input. However, the neural network only controls the weights of the mixture components while the component centers and scales are fixed w.r.t. to . For each of the kernel centers, different scale/bandwidth parameters are chosen. As for MDNs, we employ Gaussians as mixture components, wherein the scale parameter directly coincides with the standard deviation.
Let be the number of kernel centers and the number of different kernel scales . The KMN conditional density estimate reads as follows:
[TABLE]
As previously, the weights correspond to a softmax function. The scale parameters are learned jointly with the neural network parameters . The centers are initially chosen by k-means clustering on the in the training data set. Overall, the KMN model is more restrictive than MDN as the locations and scales of the mixture components are fixed during inference and cannot be controlled by the neural network. However, due to the reduced flexibility of KMNs, they are less prone to over-fit than MDNs.
For the experiments in 5.2 and 5.1, we set and . The respective neural network has two hidden layers of size 32.
D.2 Normalizing Flow Network
The Normalizing Flow Network (NFN) is similar to the MDN and KMN in that a neural network takes the conditional variable as its input and outputs parameters for the distribution over . For the NFN, the distribution is given by a Normalizing Flow (Rezende & Mohamed, 2015). It works by transforming a simple base distribution and an accordingly distributed random variable through a series of invertible, parametrized mappings into a successively more complex distribution . The PDF of samples can be evaluted using the change-of-variable formula:
[TABLE]
The Normalizing Flows from Rezende & Mohamed (2015) were introduced in the context of posterior estimation in variational inference. They are optimized for fast sampling while the likelihood evaluation for externally provided data is comparatively slow. To make them useful for CDE, we invert the direction of the flows, defining a mapping from the transformed distribution to the base distribution by setting .
We experimented with three types of flows: planar flows, radial flows as parametrized by Trippe & Turner (2018) and affine flows . We have found that one affine flow combined with multiple radial flows performs favourably in most settings.
For the experiments in 5.2 and 5.1, we used a standard Gaussian as the base distribution that is transformed through one affine flow and ten radial flows. The respective neural network has two hidden layers of size 32.
Appendix E Simulated densities and datasets
E.1 SkewNormal
The data generating process resembles a bivariate joint-distribution, wherein follows a normal distribution and a conditional skew-normal distribution (Anděl et al., 1984). The parameters of the skew normal distribution are functionally dependent on . Specifically, the functional dependencies are the following:
[TABLE]
Accordingly, the conditional probability density corresponds to the skew normal density function:
[TABLE]
In that, denotes the density, and the cumulative distribution function of the standard normal distribution. The shape parameter controls the skewness and kurtosis of the distribution. We set and , giving a negative skewness that decreases as increases. This distribution will allow us to evaluate the performance of the density estimators in presence of skewness, a phenomenon that we often observe in financial market variables. Figure 4(a) illustrates the conditional skew normal distribution.
E.2 Gaussian Mixture
The joint distribution follows a Gaussian Mixture Model in with 5 Gaussian components, i.e. . We assume that and can be factorized, i.e.
[TABLE]
When and can be factorized as in (50), the conditional density can be derived in closed form:
[TABLE]
wherein the mixture weights are a function of :
[TABLE]
For details and derivations we refer the interested reader to Guang Sung (2004) and Gilardi et al. (2002). The weights are sampled from a uniform distribution and then normalized to sum to one. The component means are sampled from a spherical Gaussian with zero mean and standard deviation of . The covariance matrices and are sampled from a Gaussian with mean 1 and standard deviation 0.5, and then projected onto the cone of positive definite matrices.
Since we can hardly visualize a 4-dimensional GMM, Figure 4(b) depicts a 2-dimensional equivalent, generated with the procedure explained above.
E.3 Euro Stoxx 50 data
The Euro Stoxx 50 data comprises 3169 trading days, dated from January 2003 until June 2015. The goal is to predict the conditional probability density of 1-day log-returns, conditioned on 14 explanatory variables. These conditional variables comprise classical return factors from finance as well as option implied moments. For details, we refer to Rothfuss et al. (2019). Overall, the target variable is one-dimensional, i.e. , whereas the conditional variable constitutes a 14-dimensional vector, i.e. .
E.4 NYC Taxi data
We follow the setup in Dutordoir et al. (2018). The dataset contains records of taxi trips in the Manhattan area operated in January 2016. The objective is to predict spatial distributions of the drop-off location, based on the pick-up location, the day of the week, and the time of day. In that, the two temporal features are represented as sine and cosine with natural periods. Accordingly, the target variable is 2-dimensional (longitude and latitude of dropoff-location) whereas the conditional variable is 6-dimensional. From the ca. 1 million trips, we randomly sample 10,000 trips to serve as training data.
E.5 UCI
Boston Housing
Concerns the value of houses in the suburban area of Boston. Conditional variables are mostly socio-economic as well as geographical factors. For more details see https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
Concrete
The task is to predict the compressive strength of concrete given variables describing the conrete composition. For more details see https://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/
Energy
Concerns the energy efficiency of homes. The task is to predict the cooling load based on features describing the build of the respective house. For more details see https://archive.ics.uci.edu/ml/datasets/energy+efficiency
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Ambrogioni et al. (2017) Ambrogioni, L., Güçlü, U., van Gerven, M. A. J., and Maris, E. The Kernel Mixture Network: A Nonparametric Method for Conditional Density Estimation of Continuous Random Variables. 2017. URL http://arxiv.org/abs/1705.07111 .
- 2Anděl et al. (1984) Anděl, J., Netuka, I., and Zvára, K. On Threshold Autoregressive Processes. Kybernetica , 20(2):89–106, 1984. URL https://dml.cz/bitstream/handle/10338.dmlcz/124493/Kybernetika_20-1984-2_1.pdf .
- 3Bishop (1994) Bishop, C. M. Mixture Density Networks. 1994.
- 4Bishop (1995) Bishop, C. M. Training with Noise is Equivalent to Tikhonov Regularization. Neural Computation , 7(1):108–116, 1995. ISSN 0899-7667. doi: 10.1162/neco.1995.7.1.108 . URL http://www.mitpressjournals.org/doi/10.1162/neco.1995.7.1.108 .
- 5Blei et al. (2017) Blei, D. M., Kucukelbir, A., and Mc Auliffe, J. D. Variational Inference: A Review for Statisticians. Journal of the American Statistical Association , 112(518):859–877, 2017. doi: 10.1080/01621459.2017.1285773 . URL https://doi.org/10.1080/01621459.2017.1285773 . · doi ↗
- 6Burges & Schölkopf (1996) Burges, C. J. C. and Schölkopf, B. Improving the accuracy and speed of support vector machines. In NIPS , pp. 375–381. MIT Press, 1996. URL https://dl.acm.org/citation.cfm?id=2999034 .
- 7Cao et al. (1994) Cao, R., Cuevas, A., and González Manteiga, W. A comparative study of several smoothing methods in density estimation. Computational Statistics & Data Analysis , 17(2):153–176, 2 1994. ISSN 0167-9473. doi: 10.1016/0167-9473(92)00066-Z . URL https://www.sciencedirect.com/science/article/pii/016794739200066 Z?via%3Dihub .
- 8Chen et al. (2014) Chen, T., Fox, E. B., and Guestrin, C. Stochastic Gradient Hamiltonian Monte Carlo. In ICML , 2014. URL https://arxiv.org/pdf/1402.4102.pdf .
