FRAME Revisited: An Interpretation View Based on Particle Evolution
Xu Cai, Yang Wu, Guanbin Li, Ziliang Chen, Liang Lin

TL;DR
This paper offers a new theoretical perspective on the FRAME model, identifying KL-vanishing as a cause of training instability, and proposes a Wasserstein distance-based approach to improve stability and consistency.
Contribution
It introduces a Wasserstein distance approach based on JKO flow to stabilize FRAME training and explains the instability through particle physics insights.
Findings
Enhanced training stability demonstrated in experiments
Superior visual realism in generated images
Theoretical validation of the proposed method's consistency
Abstract
FRAME (Filters, Random fields, And Maximum Entropy) is an energy-based descriptive model that synthesizes visual realism by capturing mutual patterns from structural input signals. The maximum likelihood estimation (MLE) is applied by default, yet conventionally causes the unstable training energy that wrecks the generated structures, which remains unexplained. In this paper, we provide a new theoretical insight to analyze FRAME, from a perspective of particle physics ascribing the weird phenomenon to KL-vanishing issue. In order to stabilize the energy dissipation, we propose an alternative Wasserstein distance in discrete time based on the conclusion that the Jordan-Kinderlehrer-Otto (JKO) discrete flow approximates KL discrete flow when the time step size tends to 0. Besides, this metric can still maintain the model's statistical consistency. Quantitative and qualitative experiments…
| Model Type | Name | Inception Score |
| Real Images | 11.240.11 | |
| Implicit Models | DCGAN | 6.160.07 |
| Improved GAN | 4.360.05 | |
| ALI | 5.340.05 | |
| Descriptive Models | WINN-5CNNs | 5.580.05 |
| FRAME (wl) | 4.950.05 | |
| FRAME | 4.280.05 | |
| wFRAME (ours,wl) | 6.050.13 | |
| wFRAME (ours) | 5.520.13 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques · Anomaly Detection Techniques and Applications
FRAME Revisited: An Interpretation View Based on Particle Evolution
Xu Cai1†, Yang Wu1†, Guanbin Li1, Ziliang Chen1, Liang Lin1,2
1School of Data and Computer Science, Sun Yat-Sen University, China
2Dark Matter AI Inc.
[email protected], [email protected],
[email protected], [email protected], [email protected] Xu Cai and Yang Wu contribute equally to this work and share first-authorship. Corresponding author is Liang Lin (Email: [email protected]). This work was supported in part by the National Key Research and Development Program of China under Grant No.2018YFC0830103, in part by the NSFC-Shenzhen Robotics Projects (U1613211), in part by the National Natural Science Foundation of China under Grant No.61702565, No.61622214 and No.61836012 and in part by National High Level Talents Special Support Plan (Ten Thousand Talents Program).
Abstract
FRAME (Filters, Random fields, And Maximum Entropy) is an energy-based descriptive model that synthesizes visual realism by capturing mutual patterns from structural input signals. The maximum likelihood estimation (MLE) is applied by default, yet conventionally causes the unstable training energy that wrecks the generated structures, which remains unexplained. In this paper, we provide a new theoretical insight to analyze FRAME, from a perspective of particle physics ascribing the weird phenomenon to KL-vanishing issue. In order to stabilize the energy dissipation, we propose an alternative Wasserstein distance in discrete time based on the conclusion that the Jordan-Kinderlehrer-Otto (JKO) discrete flow approximates KL discrete flow when the time step size tends to [math]. Besides, this metric can still maintain the model’s statistical consistency. Quantitative and qualitative experiments have been respectively conducted on several widely used datasets. The empirical studies have evidenced the effectiveness and superiority of our method.
Introduction
FRAME (Filters, Random fields, And Maximum Entropy) (?) is a model built on Markov random field that can be applied to approximate various types of data distributions, such as images, videos, audios and 3D shapes (?; ?; ?). It is an energy-based descriptive model in the sense that besides its parameters are estimated, samples can be synthesized from the probability distribution the model specifies. Such distribution is derived from maximum entropy principle (MEP), which is consistent with the statistical properties of the observed filter responses. FRAME can be trained via an information theoretical divergence between real data distribution and model distribution . Primitive efforts model it as KL-divergence by default, which also leads to the same results of MLE.
A large number of experimental results reveal that FRAME tends to generate inferior synthesized images and is often arduous to converge during training. For instance, displayed in Fig. 1, the synthesized images of FRAME seriously deteriorates along with the model energy. This phenomenon is caused by KL-vanishing in the stepwise parameters estimation of the model due to the existence of the great filter responses disparity between and . Specifically, the MLE-based learning algorithm attempts to optimize a transformation from the high dimensional support of to the non-existing support of , i.e., it starts from an initialization of a Gaussian noise covering the whole support of and , then gradually updates by calculating the KL discrete flow step-wisely. Therefore in the discrete time setting of the actual iterative training process, the dissipation of the model energy may become considerably unstable, and the stepwise minimization scheme may suffer serious KL-vanishing issue during the communicative parameters estimation.
To tackle the above shortcomings, we first investigate this model from a particle perspective by regarding all the observed signals as Brownian particles (pre-condition of KL discrete flow), which helps explore the reasons for the collapses of the FRAME model. This is inspired by the fact that the empirical measure of a set of Brownian particles generated by satisfies Large Deviation Principle (LDP) with rate functional coincides exactly with the KL discrete flow (see Lemma 1). We then delve into the model in discrete time state and translate its learning mechanism from KL discrete flow into the Jordan-Kinderlehrer-Otto (JKO) (?) discrete flow, which is a procedure for finding time-discrete approximations to solutions of diffusion equations in Wasserstein space. By resorting to the geometric distance between and through optimal transport (OT) (?) and replacing the KL-divergence with Wasserstein distance (a.k.a. the earth mover’s distance (?)), this method manages to stabilize the energy dissipation scheme in FRAME and maintain its statistical consistency. The whole theoretical contribution can be summed up as the following deduction process:
- •
We deduce the learning process of data density in FRAME model from a view of particle evolution and confirm that it can be approximated by a discrete flow model with gradually decreasing energy driven by the minimization of the KL divergence.
- •
We further propose Wasserstein perspective of FRAME (wFRAME) by reformulating the FRAME’s learning mechanism from KL discrete flow into the JKO discrete flow, of which the former theoretically explains the cause of the vanishing problem, while the latter overcomes the drawbacks, including the instability of sample generation and the failure of model convergence during training.
Qualitative and quantitative experiments demonstrate that the proposed wFRAME greatly ameliorates the vanishing issue of FRAME and can generate more visually promising results, especially for structurally complex training data. Moreover, to our knowledge, this method can be applied to most sampling processes which aim at abridging the KL-divergence between real data distribution and the generated data distribution by time sequence.
Related Work
Descriptive Model for Generation.
The descriptive models originated from statistical physics have an explicit probability distribution of the signal, where they are ordinarily called the Gibbs distributions (?). With the massive developments of Convolutional Neural Networks (CNN) (?) which has been proven to be a powerful discriminator, recently, increasing researches on the generative perspective of this model have drawn a lot of attention. (?) first introduces a generative gradient for pre-training discriminative ConvNet by a non-parametric importance sampling scheme and (?) proposes to learn FRAME using pre-learned filters of modern CNN. (?) further studies the theory of generative ConvNet intensively and show that the model has a representational structure which can be viewed as a hierarchical version of the FRAME model.
Implicit Model for Generation.
Apart from the descriptive models, another popular branch of deep generative models is black-box models which map the latent variables to signals via a top-down CNN, such as the Generative Adversarial Network (GAN) (?) and its variants. These models have gained remarkable success in generating realistic images and learn the generator network with an assistant discriminator network.
Relationship.
Unlike the majority of implicit generative models, which use an auxiliary network to guide the training of the generator, descriptive models maintain a single model which simultaneously serves as a descriptor and generator, though FRAME can be served as an auxiliary and be combined with GAN to facilitate each other (?). They factually generate samples directly from the input set, rather than from the latent space, which to a certain extent ensures that the model can be efficiently trained and produce stable synthesized results with relatively less model structure complexity. In this paper, FRAME and its variants as described above share the same MLE based learning mechanism, which follows an analysis-by-synthesis scheme and works by first generating synthesized samples from the current model using Langevin dynamics and then learn the parameters through observed-synthesized samples’ distance.
Preliminaries
Let denote the space of Borel probability measures on any given subset of space , where , . Given some sufficient statistics , scalar and base measure , the space of distributions satisfying linear constraint is defined as . The Wasserstein space of order is defined as , where denotes the -norm on . is the number of elements in domain . denotes gradient and denotes the divergence operator.
Markov Random Fields (MRF).
MRF belongs to the family of undirected graphical models, which can be written in the Gibbs form as
[TABLE]
where stands for the number of features and is the partition function (?). Its MLE learning process follows the iteration of the following two steps:
I. Update model parameter by ascending the gradient of the log likelihood
[TABLE]
where and is respectively the feature response over real data distribution and current model distribution .
II. Sample from the current model by parallel MCMC chains. The sampling process, according to (?), does not necessarily converge at each , thus we only establish one persistent sampler that converges globally in order to reduce calculus.
FRAME Model.
Based on an energy function, FRAME is defined on the exponential tilting of a reference distribution , which is a reformulation of MRF and can be written as (?):
[TABLE]
where is the nonlinear activation function, is the filtered image or feature map and denotes the Gaussian white noise model with mean [math] and variance .
KL Discrete Flow.
This flow is related to discrete probability distributions (evolutions discretized in time) with finite dimensional problems. More precisely, it indicates the system of independent Brownian particles whose position in is given by a Wiener process satisfies the following stochastic differential equation (SDE)
[TABLE]
is the drift term, stands for the diffusion term, denotes the Wiener process and subscript denotes time point . This empirical measure of those particles is proved to approximate Eq. 3 by an implicit descent step , where is the so called KL discrete flow consists of KL divergence and energy function .
[TABLE]
Particle Perspective of FRAME Model
Although there is a traditional statistical perspective to interpret the FRAME theory (?), we still need a more stable sampling process to avoid this frequent generation failure. We revisit the frame model from a completely new particle perspective and prove that its parameter update mechanism is actually equivalent to the reformulation of KL discrete flow. Its further transformation, a mechanism in JKO discrete flow manner which we will next prove the equivalence on condition of enough sampling time steps, has ameliorated this unpredictably vanishing phenomenon. All the proofs in detail are added to Appendix A.
Discrete Flow Driven by KL-divergence
Herein we first introduce FRAME in discrete flow manner. If we regard the observed signals with the generating function of Markov property as Brownian particles, then theorem 1 points out that Langevin dynamics can be deduced from KL discrete flow sufficiently and necessarily through lemma 1.
Lemma 1**.**
For i.i.d. particles with common generating function which has Markov property, the empirical measure satisfies Large Deviation Principle (LDP) with rate functional in the form of .
Theorem 1**.**
Given a base measure , a clique potential , the density of FRAME in Eq. 3 can be obtained sufficiently and necessarily by solving the following constrained optimization.
[TABLE]
Let be the Lagrange multiplier integrated in and ensure , then the optimizing objective can be reformulated as
[TABLE]
Since , then the SDE iteration of in Eq. 4 can be expressed in the Langevin form as
[TABLE]
By Lemma 1, if we fix , the sampling scheme in Eq. 8 approaches the KL discrete flow , the flow will fluctuate in case varies. is updated by calculating , which implies can dynamically transform the transition map into desired. The sampling process of FRAME can be summed up as
[TABLE]
where is the derivative of initial Gaussian noise . If we take a close look at the objective function, there is an adversarial mechanism while updating and . Regardless of fixing updating , or fixing updating , the correct direction cannot be insured to the optimal of minimizing .
Discrete Flow Driven by Wasserstein Metric
Although KL approach is relatively rational in the methodology of FRAME, there exists the risk of a KL-vanishing problem as we have discussed, since the parameter updating mechanism of MLE may suffer non-convergence. To avoid this problem, we introduce the Wasserstein metric to discrete flow, according to the statement of (?) that can be closer from a KL method given empirical measure , but far from the same measure in the Wasserstein distance. And (?) also claims that a better convergence and approximated results can be obtained since Wasserstein metric defines a weaker topology. The conclusion that when time step size rationalizes the proposed method. The proof of this conclusion in the one-dimensional situation has shown in (?) and in higher-dimensional has been proved by (?; ?). Here we first provide some background knowledge about the transformation then we briefly show the derivation process.
Fokker-Planck Equation.
Under the influence of drifts and random diffusions, this equation describes the evolution for the probability density function of the particle velocity. Let be an integral function and denote its Euler-Lagrange first variation, the equations are
[TABLE]
Wasserstein Metric.
The Benamou-Brenier form of this metric (?) of order involves solving a smoothy OT problem over any probabilities and in using the continuity equation showed in Eq. 10 as follows, where belongs to the tangent space of the manifold governed by some potential and associated with curve .
[TABLE]
JKO Discrete Flow.
Following the initial work (?), which shows how to recover Fokker-Planck diffusions of distributions in Eq. 10 when minimizing entropy functionals according to Wasserstein metric , the JKO discrete flow is applied by our method to replace the initial KL divergence with the entropic Wasserstein distance . The function of the flow is
[TABLE]
Remark 1**.**
The initial Gaussian term is left out for convenience to facilitate the derivation, otherwise, the entropy in Eq. 12 should be written as the relative entropy .
By Theorem 1, instead of can be calculated in approximation and a steady state will approach Eq. 3. Applying in the manner of dissipation mechanism as a substitute of allows regarding the diffusion Eq. 4 as the steepest descent of clique energy and entropy w.r.t. Wasserstein metric. Solving such optimization problem using is identical to solve the Monge-Kantorovich mass transference problem.
With Second Mean Value theorem for definite integrals, we can approximately recover the integral by two randomly interpolated rectangles
[TABLE]
where parameterizes the time piece and represents random interpolated parameter since is random. With Eq. 13, the functional derivative of w.r.t. is then proportional to
[TABLE]
which is exactly the result of Proposition 8.5.6 in (?). Assume be at least twice differentiable and treat Eq. 14 as the variational condition in Eq. 10, then plug Eq. 14 into the continuity equation of Eq. 10, which turns into a modified Wasserstein gradient flow in Fokker-Planck form as follows
[TABLE]
Then the corresponding SDE can be written in Euler-Maruyama form as
[TABLE]
By Remark 1, if we reconsider the initial Gaussian term, the discrete flow of in Eq. 16 should be added with .
Remark 2**.**
If is the energy function defined in Eq. 3, then .
It’s a direct result since defined in FRAME only involves inner-product, ReLu (piecewise linear) and other linear operations, the second derivative is obviously [math]. Therefore, both the time evolution of density in Eq. 15 and sample in Eq. 16 will respectively degenerate to Eq. 10 and Eq. 8. Thus the SDE of remains default, i.e. Langevin form while the gradients of the model parameter doesn’t degenerate.
Alike to the parameterized KL flow defined in Eq. LABEL:eq:klflow_1, we propose a similar form in JKO manner. With Eq. 13 and Eq. 14, the final optimization objective function can be formulated as
[TABLE]
With all discussed above, the learning progress of wFRAME can be constructed by ascending the gradient of , i.e. . The calculating steps in formulation are summarized in Eq. 18.
[TABLE]
The equation above indicates that the gradient of in Wasserstein manner is being added with some soft gradient norm constraints between the last two iterations. Such gradient norm has the following advantages compared with the original iteration process (Eq. 9).
First the norm serves as the constant speed geodesic connecting with in the manifold spanned by and , which may provide a speedup on converge. Next, it can be interpreted as the soft anti-force against the original gradient and prevent the whole learning process from vanishing. Moreover, in experiments, we find it can preserve data inner structural information. The new learning and generating the process of wFRAME is summarized in Algorithm 1 in detail.
Experiments
In this section, we intensively compare our proposed method with FRAME from two aspects, one is the confirmatory experiment of model collapse under varied settings with respect to the baseline, the other is the quantitative and qualitative comparison of generated results on extensively used datasets. In the first stage, as expected, the proposed wFRAME is verified to be more robust in training and the synthesized images are of higher quality and fidelity in most circumstances. The second stage, we evaluate both models on the whole datasets. We propose a new metric response distance, which measures the gap between the generated data distribution and the real data distribution.
Confirmation of Model Collapse
We recognize that under some circumstances FRAME will suffer serious model collapse. Due to MEP, the expected well-learned FRAME model should achieve minimum , i.e. the minimum amount of transformations to the reference measure. But such minimization of KL divergence might be the unpredictable cause of the energy to [math], namely the learned model will degenerate to produce initial noise instead of the desired minimum modification. Furthermore, in case , the learned model intends to degenerate. In other words, the images synthesized from FRAME driven by KL divergence will collapse immediately and the quality may barely restore. Consequently, the best curve of is slowly asymptotic to and slightly above [math].
To manifest the superiority of our method over FRAME compared with the baseline settings, we conduct the validation experiments on a subset of SUN dataset (?) under different circumstances. Intuitively, a simple trick to the model collapse issue is to restrict in a safe range, a.k.a. weight clipping. The experimental settings include respectively altering and to an insecure range, turning on or off the weight clipping and varying the inputs dimensions. The results are presented in Fig. 3, which shows the property of a more robust generation compared with the original strategy or FRAME with weight clipping trick.
Empirical Setup on Common Datasets
We apply wFRAME on several widely used datasets in the field of generative modeling. As for default experimental settings, , , the number of learning iterations is set to , the step number of Langevin sampling within each learning iteration is and the batch size is . The implementation of in our method is the first 4 convolutional layers of a pre-learned VGG-16 (?). Input shape varies by datasets and is specified following. The hyper-parameters appear in Algorithm 1 differs on each dataset in order to achieve the best results. As for FRAME we use default settings in (?).
CelebA (?) and LSUN-Bedroom (?) images are cropped and resized to . we set in both datasets, in CelebA and in LSUN-Bedroom. The visualizations of two methods are exhibited in Fig. 2.
CIFAR-10 (?) includes various categories and we learn both algorithms conditioned on the class label. In this experiment, we set , and images’ size are of . Numerically and visually in Fig. 4, 5 and Table 1, the results show great improvement.
For a fair comparison, two metrics are utilized to evaluate FRAME and wFRAME. We offer a new metric response distance to measure the disparity between two distributions according to the results sampled out, while the Inception score is a widely used standard in measuring samples diversity.
Response distance is defined as
[TABLE]
where denotes the th filter. The smaller the is, the better the generated results will be, since , which implies that provides an approximation of the divergence between the target data distribution and the generated data distribution. Furthermore, by Eq. 2, the faster falls the better converges.
Inception score (IS) is the most widely adopted metric of generative models, which estimates the diversity of the generated samples. It uses a network Inception v2 (?) pre-trained on ImageNet (?) to capture the classifiable properties of samples. This method has the drawbacks of neglecting the visual quality of the generated results and prefers models who generate objects rather than realistic scene images, but it can still provide essential diversity information of synthesized samples in evaluating generative models.
Comparison with GANs
We compare FRAME and wFRAME with GAN models implemented on CIFAR-10 via the Inception score in Table 1. Most GAN-family models achieve pretty high on this score, however, our method is a descriptive model instead of an implicit model. GANs with high scores perform badly in descriptive situations, for example, the image reconstruction task or training on a small amount of data. FRAME can handle most of these situations properly. The performance of DCGAN in modeling mere few images is presented in Fig. 6 where for equal comparison, we duplicate the input images several times to the total amount of 10000 to adopt the training environment of DCGAN. The compared wFRAME is trained in our own method. The DCGAN’s training procedure is ceased as it converges but still remains collapsed results.
Comparison of FRAME and wFRAME
From two aspects, we analyze FRAME and wFRAME as a summary of the whole experiments conducted above. As expected, our algorithm is more suitable for synthesizing complex and varied scene images and the resulting images are apparently more authentic compared with FRAME.
Quality of Generation Improvement.
According to our performances on response distance , the quality of the image synthesis is improved. This measurement is corresponding with the iteration learning process of both FRAME and wFRAME. The learning curves presented in Fig. 4 are the observations of the overall datasets synthesis. From the curves can we draw the conclusion that wFRAME converges better than FRAME. The results of generation on CelebA, LSUN-Bedroom and CIFAR-10 in Fig. 2 and 5 shows that even if the training images are relatively aligned with conspicuous structural information, or with only simple categorical context information, the images produced by FRAME are still abundant with motley noise and twisted texture, while ours are more reasonably mixed, more sensible structured and bright-colored with less distortion.
Training Steadiness Improvement.
Compared with FRAME as shown in Fig. 1 which illustrates the typical evolution of generated samples, we found an improvement on the training steadiness. The generated images are almost identical at the beginning, however, images produced by our algorithm are able to be back on track after 30 iterations while FRAME’s deteriorate. Quantitatively in Fig. 4, the curves are calculated by averaging across the whole dataset. wFRAME reaches lower cost on response distance, namely the direct critic of filter banks between synthesized samples and target samples is smaller and decreases more steadily. To be more specific, our algorithm has mostly solved the model collapse problem of FRAME for it not only ensures the closeness between the generated samples and “ground-truth” samples but also stabilizes the learning phase of the model parameter . The three plots clearly show the quantitative measures are well correlated with qualitative visualizations of generated samples. In the absence of collapsing, we attain comparable or even better results over FRAME.
Conclusion
In this paper, we re-derivatively track the origin of FRAME from the viewpoint of particle evolution and have discovered the potential factors that may lead to the deterioration of sample generation and the instability of model training, i.e, the inherent vanishing problem existing in the minimization of KL divergence. Based on this discovery, we propose wFRAME by reformulating the KL discrete flow in the FRAME to the JKO scheme, and prove through empirical examination that it can overcome the above-mentioned deficiencies. The experiments are carried out to demonstrate the superiority of the proposed wFRAME model and comparable results have shown that it can greatly ameliorate the vanishing issue of FRAME and can produce more visually promising results.
Appendix A A Proofs
Proof of Lemma 1
With another perspective that under the Gaussian reference measure, FRAME established on morden ConvNet has the piecewise Gaussian property and it’s summarized in Proposition 1.
Proposition 1**.**
(Reformulation of Theorem 1.) Equation 3 is piecewise Gaussian, on each piece the probability density can be written as:
[TABLE]
where is an approximated reconstruction of in one piece of data space by a linear transformation involving inner-products with model parameter and piecewise linear activation function(ReLu). This proposition implies that different pieces can be regarded as different generating samples acting as Brownian particles.
By Proposition 1, each particle (image piece) in FRAME has the transition kernel in Gaussian form (equation 19). It describes the probability of a particle moving from to in time . Let a fixed measure such as Gaussian be the initial measure of Brownian particles at time [math]. Sanov theorem shows that empirical measure of such transition particles satisfy LDP with rate functional , i.e.,
[TABLE]
Specially, each Brownian particle has internal sub-particles which are independent in different cliques. Let denote the number of cliques, Cramer’s theorem tells us that for i.i.d. RVs with common generating function , the empirical mean satisfies LDP with rate functional in the Legendre transformation of ,
[TABLE]
Since the empirical measure of is simply the empirical mean of the Dirac measure, i.e., , then the empirical measure over all particles achieves to
[TABLE]
where the exponent is exactly the KL discrete flow . Thus the empirical measure of the activation patterns of all those particles satisfies LDP with rate functional in discrete time. ∎
Proof of Theorem 1
Necessity can be constructed using MEP via calculating iteratively:
[TABLE]
Sufficiency: Recall the Markov property in Eq. 1, we can write the inner product as sum of feature responses w.r.t. different clique, then shared pattern activation can be approximated by the Dirac measure as
[TABLE]
The result coincides with the empirical measure of , so the proof of sufficiency turns into the proof of Lemma 1 and it was done in Lemma 1. ∎
Appendix B B More Visual Results
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[Adams et al . 2011] Adams, S.; Dirr, N.; Peletier, M. A.; and Zimmer, J. 2011. From a large-deviations principle to the wasserstein gradient flow: a new micro-macro passage. Communications in Mathematical Physics 307(3):791–815.
- 2[Ambrosio, Gigli, and Savaré 2008] Ambrosio, L.; Gigli, N.; and Savaré, G. 2008. Gradient flows: in metric spaces and in the space of probability measures . Springer Science & Business Media.
- 3[Arjovsky, Chintala, and Bottou 2017] Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein generative adversarial networks. In International Conference on Machine Learning , 214–223.
- 4[Benamou and Brenier 2000] Benamou, J.-D., and Brenier, Y. 2000. A computational fluid mechanics solution to the monge-kantorovich mass transfer problem. Numerische Mathematik 84(3):375–393.
- 5[Dai, Lu, and Wu 2014] Dai, J.; Lu, Y.; and Wu, Y.-N. 2014. Generative modeling of convolutional neural networks. ar Xiv preprint ar Xiv:1412.6296 .
- 6[Deng et al . 2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition (CVPR), 2009. IEEE Conference on , 248–255. IEEE.
- 7[Duong, Laschos, and Renger 2013] Duong, M. H.; Laschos, V.; and Renger, M. 2013. Wasserstein gradient flows from large deviations of many-particle limits. ESAIM: Control, Optimisation and Calculus of Variations 19(4):1166–1188.
- 8[Erbar et al . 2015] Erbar, Matthias an Erbar, M.; Maas, J.; Renger, M.; et al. 2015. From large deviations to wasserstein gradient flows in multiple dimensions. Electronic Communications in Probability 20.
