Evaluating Bregman Divergences for Probability Learning from Crowd

F. A. Mena (Universidad T\'ecnica Federico Santa Mar\'ia; Chile); R.; \~Nanculef (Universidad T\'ecnica Federico Santa Mar\'ia; Chile)

arXiv:1901.10653·cs.LG·January 31, 2019

Evaluating Bregman Divergences for Probability Learning from Crowd

F. A. Mena (Universidad T\'ecnica Federico Santa Mar\'ia, Chile), R., \~Nanculef (Universidad T\'ecnica Federico Santa Mar\'ia, Chile)

PDF

Open Access

TL;DR

This paper explores the use of Bregman divergences as objective functions for training probabilistic models from crowd-sourced data, emphasizing the importance of careful optimization in neural networks.

Contribution

It introduces models that utilize Bregman divergences for probability distribution learning from crowdsourcing data, highlighting optimization considerations.

Findings

01

Proper objective function selection is crucial for effective learning.

02

Optimization strategies significantly impact model performance.

03

Bregman divergences can effectively model crowd-derived probability distributions.

Abstract

The crowdsourcing scenarios are a good example of having a probability distribution over some categories showing what the people in a global perspective thinks. Learn a predictive model of this probability distribution can be of much more valuable that learn only a discriminative model that gives the most likely category of the data. Here we present differents models that adapts having probability distribution as target to train a machine learning model. We focus on the Bregman divergences framework to used as objective function to minimize. The results show that special care must be taken when build a objective function and consider a equal optimization on neural network in Keras framework.

Tables3

Table 1. Table 1: Results of percentage macro F1 metric in both set. B 𝐵 B refers to Bregman divergences.

B	Objective function	train	test
	MSE	83,910	47,747
	RMSE	75,773	40,804
	Reverse KL	38,107	29,519
	Cross Entropy	80,075	45,352
	Jensen-Shannon	75,486	41,543
$✓$	Forward KL	79,144	46,005
$✓$	Itakura-Saito	16,584	17,067
$✓$	Generalized I	78,944	48,934
$✓$	Squared Euclidean	80,333	39,156

Table 2. Table 2: Results of percentage NDCG metric in both set. B 𝐵 B refers to Bregman divergences. Bold represent the four best results in each set.

B	Objective function	train	test
	MSE	96,847	94,708
	RMSE	96,823	94,559
	Reverse KL	93,914	93,080
	Cross Entropy	97,476	94,711
	Jensen-Shannon	97,276	94,955
$✓$	Forward KL	97,403	94,580
$✓$	Itakura-Saito	91,867	91,805
$✓$	Generalized I	97,364	94,779
$✓$	Squared Euclidean	96,824	94,462

Table 3. Table 3: Results of percentage Accuracy on ranking decrease metric in both set. B 𝐵 B refers to Bregman divergences. Bold represent the four best results in each set.

B	Objective function	train	test
	MSE	28,835	16,851
	RMSE	28,252	17,214
	Reverse KL	20,175	15,862
	Cross Entropy	30,652	17,267
	Jensen-Shannon	30,625	18,657
$✓$	Forward KL	31,007	16,788
$✓$	Itakura-Saito	11,262	11,256
$✓$	Generalized I	30,659	17,276
$✓$	Squared Euclidean	28,350	16,700

Equations30

p_{ij} = \frac{r _{ij}}{\sum _{l} r _{i l}}

p_{ij} = \frac{r _{ij}}{\sum _{l} r _{i l}}

M S E (p_{i}, \overset{p}{^}_{i}) = \frac{1}{K} j \sum K (p_{ij} - \overset{p}{^}_{ij})^{2}

M S E (p_{i}, \overset{p}{^}_{i}) = \frac{1}{K} j \sum K (p_{ij} - \overset{p}{^}_{ij})^{2}

R M S E (p_{i}, \overset{p}{^}_{i}) = \frac{1}{K} j \sum K (p_{ij} - \overset{p}{^}_{ij})^{2}

R M S E (p_{i}, \overset{p}{^}_{i}) = \frac{1}{K} j \sum K (p_{ij} - \overset{p}{^}_{ij})^{2}

H (p_{i}, \overset{p}{^}_{i}) = j \sum K - p_{ij} lo g \overset{p}{^}_{ij}

H (p_{i}, \overset{p}{^}_{i}) = j \sum K - p_{ij} lo g \overset{p}{^}_{ij}

K L (\overset{p}{^}_{i} ∣∣ p_{i}) = j \sum K \overset{p}{^}_{ij} lo g \frac{p ^ _{ij}}{p _{ij}}

K L (\overset{p}{^}_{i} ∣∣ p_{i}) = j \sum K \overset{p}{^}_{ij} lo g \frac{p ^ _{ij}}{p _{ij}}

J S (p_{i}, \overset{p}{^}_{i}) = \frac{1}{2} j \sum K K L (p_{ij} ∣∣ m_{ij}) + K L (\overset{p}{^}_{ij} ∣∣ m_{ij})

J S (p_{i}, \overset{p}{^}_{i}) = \frac{1}{2} j \sum K K L (p_{ij} ∣∣ m_{ij}) + K L (\overset{p}{^}_{ij} ∣∣ m_{ij})

d_{Φ} (x, y) = Φ (x) - Φ (y) - ⟨ x - y, \nabla (y)⟩

d_{Φ} (x, y) = Φ (x) - Φ (y) - ⟨ x - y, \nabla (y)⟩

S S E (p_{i}, \overset{p}{^}_{i}) = j \sum K (p_{ij} - \overset{p}{^}_{ij})^{2}

S S E (p_{i}, \overset{p}{^}_{i}) = j \sum K (p_{ij} - \overset{p}{^}_{ij})^{2}

K L (p_{i} ∣∣ \overset{p}{^}_{i}) = j \sum K p_{ij} lo g \frac{p _{ij}}{p ^ _{ij}}

K L (p_{i} ∣∣ \overset{p}{^}_{i}) = j \sum K p_{ij} lo g \frac{p _{ij}}{p ^ _{ij}}

G e n I (p_{i} ∣∣ \overset{p}{^}_{i}) = j \sum K p_{ij} lo g \frac{p _{ij}}{p ^ _{ij}} - (p_{ij} - \overset{p}{^}_{ij})

G e n I (p_{i} ∣∣ \overset{p}{^}_{i}) = j \sum K p_{ij} lo g \frac{p _{ij}}{p ^ _{ij}} - (p_{ij} - \overset{p}{^}_{ij})

I S (p_{i}, \overset{p}{^}_{i}) = j \sum K \frac{p _{ij}}{p ^ _{ij}} - lo g \frac{p _{ij}}{p ^ _{ij}} - 1

I S (p_{i}, \overset{p}{^}_{i}) = j \sum K \frac{p _{ij}}{p ^ _{ij}} - lo g \frac{p _{ij}}{p ^ _{ij}} - 1

Δ (t) = \frac{∣ l os s ^{(t)} - l os s ^{(t + 1)} ∣}{l os s ^{(t)}}

Δ (t) = \frac{∣ l os s ^{(t)} - l os s ^{(t + 1)} ∣}{l os s ^{(t)}}

F_{1}^{M} = \frac{1}{K} j \sum K 2 \frac{P _{j} \cdot R _{j}}{P _{j} + R _{j}}

F_{1}^{M} = \frac{1}{K} j \sum K 2 \frac{P _{j} \cdot R _{j}}{P _{j} + R _{j}}

a c c_{r ank} = \frac{1}{N} i \sum N \frac{1}{K} k \sum K \frac{I ( y _{i}^{(k)} = y ^ _{i}^{(k)} )}{k}

a c c_{r ank} = \frac{1}{N} i \sum N \frac{1}{K} k \sum K \frac{I ( y _{i}^{(k)} = y ^ _{i}^{(k)} )}{k}

H (p, q) = H (p) + K L (p ∣∣ q)

H (p, q) = H (p) + K L (p ∣∣ q)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Mechanics and Entropy · Gaussian Processes and Bayesian Inference · Forecasting Techniques and Applications

Full text

Evaluating Bregman Divergences for Probability Learning from Crowd

Francisco Mena

Univ. Técnica Federico Santa María

Depto de Informática

Santiago, Chile

&Ricardo Ñanculef

Univ. Técnica Federico Santa María

Depto de Informática

Santiago, Chile Contact to [email protected]

Abstract

The crowdsourcing scenarios are a good example of having a probability distribution over some categories showing what the people in a global perspective thinks. Learn a predictive model of this probability distribution can be of much more valuable that learn only a discriminative model that gives the most likely category of the data. Here we present differents models that adapts having probability distribution as target to train a machine learning model. We focus on the Bregman divergences framework to used as objective function to minimize. The results show that special care must be taken when build a objective function and consider a equal optimization on neural network in Keras framework.

1 Introduction

Know the probability distribution over different categories (Discrete variables) on some domain can be very valuable when one want a predictive model that can give some prediction together with the uncertain of it. The classic machine learning models try to learn a discriminative model over only the truth category of the data.

When face a problem of learn all the probability distribution over the categories of some data, the classic machine learning method may not fit well. This is a different objective, because the model need to learn to give a prediction even on the least likely categories and with this give the uncertain, i.e. avoid to assign a priori zero probability to the categories that the data does not belong.

The crowdsourcing platform, such as Amazom Mechanical Turk (AMT)111http://www.mturk.com, allows one to obtain various annotations over some dataset, with multiple annotations (not the same) per data. With this one can group all the annotations in one vector of repeats and then normalize to get probabilities of each category representing what a common/regular annotator thinks or behave. Then, in this scenario a problem with two categories $(dog,cat)$ , an image of a big cat can have probability distribution of $(0.8,0.2)$ that show how an annotator behave, also how she can get confused and give a wrong annotation over the image. The experimental work of [Snow et al., 2008] over different text datasets, shows that multiple inexpert annotators can perform similar to expert annotators, having a strong correlation between them. So we can assumed that multiple annotators can have a standard good behave.

In this application we need to measure some function among two probability distribution as a cost function to optimize, the commonly used in the state of the art is the KL divergence [Thomas, 1991], that measure the difference between two pdfs. Here we explore different dissimilarity measures between vectors (probabilities among different categories). We focus on the domain of the Bregman divergences [Bregman, 1967] that measure dissimilarity between objects and by itself is not a metric.

We measure different functions trying to understand what kind of objective functions work better. The results report that Bregman divergences and similar objectives function turned out not behave in the same way. We suspect that this is because the approximate optimization that does the neural network framework, as the stochastic optimization or the approximate functional derivatives.

The paper structure is as follow: In Section 2 we formally we define the problem and what we are facing. In the next section (3) we present the models proposed to solved the problem and compare between them, while the evaluation metrics to compare is in Section 4. Section 5 present the Related work and in Section 6 we show the results of all the experimentation. Finally the conclusion are shown in Section 7.

2 Problem

The task is given a dataset of $N$ pairs $\{(x_{i},p_{i})\}_{i=1}^{N}$ , with $x_{i}\in\mathbb{R}^{d}$ the data and $p_{i}\in\mathbb{R}^{K}$ the vector of repeats normalized, aka a vector of the probabilities of each category $K$ on the data, we need to learn a model that maps the data to the probabilities $f:\mathbb{R}^{d}\rightarrow\mathbb{R}^{K},f(x_{i})=\hat{p}_{i}$ .

This is a different objective that the classic machine learning that trying to learn only the most probable category ( $argmax$ ). Here we have a loss function that use all the probability distribution to learn $\ell(p_{i},f(x_{i}))$ and jointly learn correctly the uncertainty of the predictions.

To build the data probabilities, in where some cases come with the data, we assumed that the data and the annotators model correctly the probability of each category, aka ground truth. With $r_{ij}$ the vector of repeats that store the number of times that the data $i$ was annotated by category $j$ .

[TABLE]

3 Models

The proposed models trying to solve the problem are the well studied deep neural network functions [LeCun et al., 2015] to model $f$ with different objective functions that adapts probabilities. Particularly here we work with the Bregman divergences [Bregman, 1967]. Here we define the objective function to compare, which are evaluated on every pair of examples in a batch and then merge together with an arithmetic average, as a standard neural network.

3.1 Based on Keras

Firstly we define some common used in Keras [Chollet et al., 2015] metrics to evaluate deep learning models.

•

Mean Squared Error (MSE):

[TABLE]

•

Root Mean Squared Error (RMSE):

[TABLE]

•

Cross-entropy

[TABLE]

•

Reverse KL:

[TABLE]

•

Jensen Shanon divergence [Lin, 1991] (also known as symmetric KL):

[TABLE]

With $m_{ij}=\frac{p_{ij}+\hat{p}_{ij}}{2}$

3.2 Bregman divergences

Here we define the divergence (inverse to similarity) of Bregman [Chen et al., 2008, Banerjee et al., 2005] to measure the difference between two probability distribution. These functions come from a family that share some properties, due they are derived from a general framework/structure.

Given $\Phi$ , a strictly convex differentiable function , the Bregman divergence $d_{\Phi}$ is define as:

[TABLE]

With $\langle a,b\rangle$ the inner product between $a$ and $b$ . As can be seen the order that is given to $d_{\Phi}$ matters, so as it does not fulfill the symmetry or triangular inequality properties, is not a defined as a metric. Nonetheless, it has some other properties that are good for optimization purpose:

•

Convex: on his first argument $x$ .

•

Non-negative: $d_{\Phi}(x,y)\geq 0$ for every $x,y$ .

•

Duality: If $\Phi$ has a convex conjugate can be used.

•

The median as a minimum in random scenario: Given a set of random vectors, the minimum of $d_{\Phi}(x,y)$ for $y$ , given any function $\Phi$ and $x$ , is the median of the vectors [Banerjee et al., 2005].

Then, the divergence $d_{\Phi}(p_{i},\hat{p}_{i})$ with different $\Phi$ functions:

•

Square Euclidean Distance (also known as sum of square error/sse), for $\Phi(p_{i})=||p_{i}||^{2}$ :

[TABLE]

•

Forward KL, for negative entropy function, $\Phi(p_{i})=\sum_{j}^{K}p_{ij}\log p_{ij}$

[TABLE]

•

Generalized I divergence, similar to Forward KL but generalized to the positive reals222Forward KL is for a domain of discrete values.

[TABLE]

•

Itakura Saito distance, for $\Phi(p_{i})=-\log p_{i}$ :

[TABLE]

We hope that a evaluating function (objective function) based on probabilities achieved a best behavior that a standard evaluation function for continuous variables.

4 Metrics

In order to fair comparison between the effect of different objective functions, we use some normalized metrics across the different evaluation function used in training:

•

Convergence delta

[TABLE]

With $t$ the instant during training, analogous to epochs.

•

Macro F1 score between the category with high probability:

[TABLE]

With $P_{j}$ and $R_{j}$ the precision and recall over category $j$ , respectively.

•

Normalized discounted cumulative gain (NDCG), a metric from learning to rank [Cao et al., 2007], with the objective to measure the order of predicted probabilities.

•

Accuracy on Ranking Decrease:

[TABLE]

With $y_{i}^{(k)}$ the real category of data $i$ in position $k$ and $I$ the indicator function.

5 Related Work

Since the Bregman divergences was proposed [Bregman, 1967], several works recently has studied the benefits of this divergences. For instances, [Vemuri et al., 2011] propose a different way to find the optimum or t-center for the objective, as find the representative in a cluster algorithm, for example for MSE the center is the mean. Here he present a robust and efficient formulate to seek a center of a different formulation of Bregman divergences.

The work of [Banerjee et al., 2005] use the Bregman divergence to measure data distance and cluster data. This work is closed related because he used the Bregman divergences as objective functions as us but on unsupervised scenario. The good results shown here say that the Bregman divergences can be powerful on recognize pattern and is a good dissimilarity measure to cluster data. Some Bregman divergences has been applied in order to train a GAN (Generative Adversarial Network) [Nowozin et al., 2016], also another unsupervised scenario where it shows that train with divergences can be done and get some advantages.

Another application of the Bregman divergence is the one of [Sugiyama et al., 2012], in which porpose a new efficient way to estimate the ratio of probability densities through the framework of Bregman.

Since the Bregman divergences has shown as a strong framework of dissimilarity measure we focus on this framework to base our different objective functions.

6 Experiments

The experiments was realized with deep neural network, trained with a GPU, GeForce GTX 1060 (6 GB) and we repeat the experiment 4 times to normalize the random initialize and optimization of the neural network. we set Adam optimizer [Kingma and Ba, 2014] with the Glorot initialize of weights [Glorot and Bengio, 2010]. The batch size is set to 128 and a limit of 20 epochs. We used standard training and test set split of (70/30)% respectively.

6.1 Data

The first data used is an image data known as GalaxyZoo555www.galaxyzoo.org. This project start by astronomer of the Oxford University in 2007 [Lintott et al., 2008] in where ask people in thoroughfare that classify their dataset of million of galaxies (thanks to SDSS666www.sdss.org). The worked dataset is a small subset of this with he probabilities about different morphologies of the galaxy through thousand of volunteers (annotators).

We work with the Kaggle777www.kaggle.com dataset of this, which are 60 thousand RGB images which we re-size to 100x100 pixels. The categories also are a subset of all the answers/annotations, which correspond to 7 answer to question about the galaxy morphology.

How round is the smooth of the galaxy?

(a)

Completely round 2. (b)

between 3. (c)

Cigar shaped 2. 2.

What type of disk is the galaxy?

(a)

A view edge-on disk 2. (b)

Spiral tight 3. (c)

Spiral medium 4. (d)

Spiral loose 5. (e)

Normal disk 3. 3.

Is it a Galaxy?

(a)

Is a Star or artifact

The categories are mutually exclusive so only one can be given. The one with higher probability over all the dataset are Normal disk and Smooth between.

The second data what we used is a text data also provided by Kaggle platform, the Stock tweets emotion888www.kaggle.com/fernandojvdasilva/stock-tweets-ptbr-emotions. Here we also has multiple annotations by every tweet (wrote in Portuguese) about the emotion express in there.

The 9 categories represent the emotion of the tweet: joy, sadness, trust, disgust, surprise, anticipation, anger, fear and neutral, where this last one is the category with higher probability over the dataset.

6.1.1 Architectures

The model to work and process the images is similar to the presented by the winner of the competition and presented in Figure 4. Is a convolutional model [LeCun et al., 1995] of 3 convolutional blocks, $C\rightarrow P$ , with $P$ the max pooling layer of pool size 2 and $C$ the convolution of kernel size 3 and number of filters 32, 64 and 128 respectively. This is followed by two dense layers with 512 units and activation function ReLU for all.

The model that process the text data is a standard recurrent neural network of two layers with Gated Recurrent Unit (GRU) [Chung et al., 2014] as gates.

In both models there is a final dense layer with softmax activation that gives the predictive probabilities over the categories.

6.2 Results and Discussion

As a prior analysis we show the result and compare the effect of the different objective function over the GalaxyZoo dataset. We present the general behavior of the loss function in Figure 2. Some comments about this is that the scale of the Itakura Saito distance is much higher that the rest (magnitude close to 50), while the numeric metrics from regression as MSE, RMSE and Euclidean has a lower domain, less than $0.3$ . Another observation is that the behavior of Forward KL is practically the same to Generalized I, also Cross Entropy and Jensen-Shannon have similar curvature in the progress of objective function. On the other hand, the curvature of the loss function MSE is similar to RMSE. This is produced due the similitude of the objective function, because the differences are some multiplicative constant.

We show a fair comparison of the different scale of objective function in the delta convergence (Equation 12) in Figure 3. It can be seen that each objective function converge in a way, for example Itakura Saito distance is the first in stop the variation (fast convergence). In the second place it is the Cross Entropy, converging in epoch 3, followed by RMSE in epoch 7. The last in converge with a limit set to $0.05$ of variation, after epoch 15, are Generalized I, MSE and Euclidean. Here is shown that no pattern can be found between probabilities loss function and numeric continuous loss function except between the last in converge, because its functions share the subtract between the real and the predicted values $(p-\hat{p})$ . In this cases the derivative becomes proportional directly to the model and may cause that the model keeps learning on those epochs.

The Table 1 shows the result of the macro F1 metric in both sets of data. It can bee seen that the worst result correspond to the Itakura Saito distance, followed by Reverse KL. The objective function that achieved the best generalization regard to this evaluation metric is the Generalized I, followed by MSE, Forward KL and Cross Entropy, in that order. It is good to point out that MSE and Cross Entropy achieved very good result on training data, showing that these model has a very good generalization rate (difference between test and train). The objective function with high overfitting phenomena is the Euclidean, which surprisingly only differs from MSE in a constant normalization factor.

The results on the ranking metric, NDCG, are shown in Table 2. Similar to the previous reported results, the worst behave are with Itakura Saito distance and Reverse KL, also happen with the best results (highlight with bold text). Generalized I, Cross Entropy and Jensen-Shannon show the best generalization and performance on test set, also is MSE. This results show that the objective function based on probabilities are the best in sort the categories (based on the probabilities), so if this were the objective, this functions are optimal.

As we shown an standard alternative metric to evaluate how the model gives the probabilities of each category, Table 3 measure Accuracy on Ranking Decrease (Equation 14). Again some results are repeated, as the worst behave is with Itakura Saito distance and Reverse KL. The objective function with higher score in training set is Forward KL and Jensen-Shannon is the one that generalizes better. Another functions that still has good result are again the based on probabilities: Cross Entropy and Generalized I.

Also the architecture of Figure 4 was test over the data and the result maintain. The change is that Jensen-Shannon divergence stands over training and test set based on macro F1 metric.

About the results on the text dataset (Stock tweets emotions) we obtain similar results so we dont show it here. There is an exception with respect to Reverse KL, that overpass trough all other objective function. This can be because this is a highly unbalanced dataset and this objective function act as a regularized by itself, thanks to maximize the entropy of the prediction and minimize the Cross Entropy between the prediction and real, as you can see in decompose Equation 5. Reverse KL does not have the expected result, of act as an regularize, on GalaxyZoo.

Summarizing all the experimented we have that the Bregman divergence functions as a family, that shared the same properties, does not necessary have a good behavior. This could be due to the fact that this are not metrics, and the missing properties of symmetric and triangular inequality can improve the behavior. For example making KL symmetric (Jensen-Shannon divergence) improve some results on the test set.

The commonly used objective function for classification with probabilities, Cross Entropy, turned out to stand in a good way, with good results on different metrics and convergence. While Generalized I, despite not being very studied or chosen in works, present a very good behavior on the metrics and the best generalization.

Despite that Cross Entropy is the same in optimization that Forward KL, except by the entropy of the real probability $H(p)$ that does not depend of the parameters of the model.

[TABLE]

$H(p)$ turns out zero on the partial derivative. However, this two function reach different values in the optimization, Forward KL stay below of Cross Entropy in the results. Similar case is between MSE and Squared Euclidean distance (SSE), in where the difference is only a multiplicative constant, $\frac{1}{K}$ , and achieve different results. This could be because the stochastic optimization of the algorithms, as they have the same global minimum, or because the functional derivative of Keras does not have a good precision.

Here the results show that a slightly change on the objective function of the model can change drastically the results. Since the objective is the same, the curvature of the first derivative is different, amplifying or reducing it.

7 Conclusion

In this work report we studied different objective function to optimize a problem of estimate the probabilities of the data and measure various metrics to evaluate quality.

Some results reflect correlation among the functions based on probabilities. As Cross Entropy, Generalized I, Jensen-Shannon and KL show good results on the ranking metrics, it indicates that the models achieved to imitate the order of the category on the data, also the probabilities.

The in-expected result found that the analytically same in optimization objective function achieved different results on all the metrics may be cause different factors. The optimization framework, the stochastic of the optimization algorithm or maybe because the factors that are ignored in optimization may have a contribution.

Bibliography17

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[Banerjee et al., 2005] Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. (2005). Clustering with bregman divergences. Journal of machine learning research , 6(Oct):1705–1749.
2[Bregman, 1967] Bregman, L. M. (1967). The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics , 7(3):200–217.
3[Cao et al., 2007] Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., and Li, H. (2007). Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning , pages 129–136. ACM.
4[Chen et al., 2008] Chen, P., Chen, Y., Rao, M., et al. (2008). Metrics defined by bregman divergences: Part 2. Communications in Mathematical Sciences , 6(4):927–948.
5[Chollet et al., 2015] Chollet, F. et al. (2015). Keras.
6[Chung et al., 2014] Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. ar Xiv preprint ar Xiv:1412.3555 .
7[Glorot and Bengio, 2010] Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages 249–256.
8[Kingma and Ba, 2014] Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980 .