FRAME Revisited: An Interpretation View Based on Particle Evolution

Xu Cai; Yang Wu; Guanbin Li; Ziliang Chen; Liang Lin

arXiv:1812.01186·cs.LG·January 17, 2019

FRAME Revisited: An Interpretation View Based on Particle Evolution

Xu Cai, Yang Wu, Guanbin Li, Ziliang Chen, Liang Lin

PDF

Open Access

TL;DR

This paper offers a new theoretical perspective on the FRAME model, identifying KL-vanishing as a cause of training instability, and proposes a Wasserstein distance-based approach to improve stability and consistency.

Contribution

It introduces a Wasserstein distance approach based on JKO flow to stabilize FRAME training and explains the instability through particle physics insights.

Findings

01

Enhanced training stability demonstrated in experiments

02

Superior visual realism in generated images

03

Theoretical validation of the proposed method's consistency

Abstract

FRAME (Filters, Random fields, And Maximum Entropy) is an energy-based descriptive model that synthesizes visual realism by capturing mutual patterns from structural input signals. The maximum likelihood estimation (MLE) is applied by default, yet conventionally causes the unstable training energy that wrecks the generated structures, which remains unexplained. In this paper, we provide a new theoretical insight to analyze FRAME, from a perspective of particle physics ascribing the weird phenomenon to KL-vanishing issue. In order to stabilize the energy dissipation, we propose an alternative Wasserstein distance in discrete time based on the conclusion that the Jordan-Kinderlehrer-Otto (JKO) discrete flow approximates KL discrete flow when the time step size tends to 0. Besides, this metric can still maintain the model's statistical consistency. Quantitative and qualitative experiments…

Tables1

Table 1. Table 1: Inception score on datasets CIFAR-10 where ’wl’ means training with labels. The IS result of ALI is reported in ( ? ). IS of DCGAN is reported in ( ? ), and the result of Improved GAN(wl) is reported in ( ? ). WINN’s is reported in ( ? ). In the Descriptive Model plate, wFRAME outperforms the most methods.

Model Type	Name	Inception Score
	Real Images	11.24 $\pm$ 0.11
Implicit Models	DCGAN	6.16 $\pm$ 0.07
	Improved GAN	4.36 $\pm$ 0.05
	ALI	5.34 $\pm$ 0.05
Descriptive Models	WINN-5CNNs	5.58 $\pm$ 0.05
	FRAME (wl)	4.95 $\pm$ 0.05
	FRAME	4.28 $\pm$ 0.05
	wFRAME (ours,wl)	6.05 $\pm$ 0.13
	wFRAME (ours)	5.52 $\pm$ 0.13

Equations55

P\left(\boldsymbol{x};\boldsymbol{\theta}\right)=\frac{1}{Z(\boldsymbol{\theta})}\exp\{\sum_{k=1}^{K}\theta_{k}f_{k}(\boldsymbol{x})\Bigg{\}},

P\left(\boldsymbol{x};\boldsymbol{\theta}\right)=\frac{1}{Z(\boldsymbol{\theta})}\exp\{\sum_{k=1}^{K}\theta_{k}f_{k}(\boldsymbol{x})\Bigg{\}},

\frac{\partial}{\partial _{θ_{k}}} \frac{1}{N} lo g P (x; θ) = E_{P_{r}} [f_{k} (x)] - E_{P (x; θ)} [f_{k} (x)],

\frac{\partial}{\partial _{θ_{k}}} \frac{1}{N} lo g P (x; θ) = E_{P_{r}} [f_{k} (x)] - E_{P (x; θ)} [f_{k} (x)],

\begin{split}P\left(\boldsymbol{x};\boldsymbol{\theta}\right)&=\frac{1}{Z(\boldsymbol{\theta})}\exp\{\sum_{k=1}^{K}\sum_{x\in\mathcal{X}}\theta_{k}h\left(\langle\boldsymbol{x},\boldsymbol{w}\rangle+\boldsymbol{b}\right)_{k}\Bigg{\}}q(\boldsymbol{x}),\end{split}

\begin{split}P\left(\boldsymbol{x};\boldsymbol{\theta}\right)&=\frac{1}{Z(\boldsymbol{\theta})}\exp\{\sum_{k=1}^{K}\sum_{x\in\mathcal{X}}\theta_{k}h\left(\langle\boldsymbol{x},\boldsymbol{w}\rangle+\boldsymbol{b}\right)_{k}\Bigg{\}}q(\boldsymbol{x}),\end{split}

d x_{t} = μ (x_{t}) d t + ε (x_{t}) d B_{t} .

d x_{t} = μ (x_{t}) d t + ε (x_{t}) d B_{t} .

I_{t} = K (ρ ∣ ρ_{t}) + \int Φ d ρ .

I_{t} = K (ρ ∣ ρ_{t}) + \int Φ d ρ .

ρ_{t + 1} s . t . \int Φ d ρ = \int Φ = ρ argmin K (ρ ∣ ρ_{t}), d P_{r}, ρ_{0} = q, \forall ρ \in P_{α}^{l in} .

ρ_{t + 1} s . t . \int Φ d ρ = \int Φ = ρ argmin K (ρ ∣ ρ_{t}), d P_{r}, ρ_{0} = q, \forall ρ \in P_{α}^{l in} .

\begin{split}&\mathcal{I}_{t}^{l}=\operatorname*{min}_{\rho}\operatorname*{max}_{\boldsymbol{\theta}}\Big{\{}\mathcal{K}(\rho\mid\rho_{t})+\int\Phi(\boldsymbol{x};\boldsymbol{\theta})d\rho-\int\Phi(\boldsymbol{x};\boldsymbol{\theta})d\mathbb{P}_{r}\Big{\}}.\end{split}

\begin{split}&\mathcal{I}_{t}^{l}=\operatorname*{min}_{\rho}\operatorname*{max}_{\boldsymbol{\theta}}\Big{\{}\mathcal{K}(\rho\mid\rho_{t})+\int\Phi(\boldsymbol{x};\boldsymbol{\theta})d\rho-\int\Phi(\boldsymbol{x};\boldsymbol{\theta})d\mathbb{P}_{r}\Big{\}}.\end{split}

x_{t + 1} = x_{t} + \nabla_{x} lo g P (x_{t}; θ) + 2 ξ_{t} .

x_{t + 1} = x_{t} + \nabla_{x} lo g P (x_{t}; θ) + 2 ξ_{t} .

⎩ ⎨ ⎧ x_{t + 1} θ_{t + 1} = x_{t} - (\frac{x _{t}}{σ ^{2}} - \nabla_{x} Φ (x_{t}; θ)) + 2 ξ_{t} = θ_{t} + \nabla_{θ} E_{ρ_{t}} [Φ (x; θ)] - \nabla_{θ} E_{P_{r}} [Φ (x; θ)],

⎩ ⎨ ⎧ x_{t + 1} θ_{t + 1} = x_{t} - (\frac{x _{t}}{σ ^{2}} - \nabla_{x} Φ (x_{t}; θ)) + 2 ξ_{t} = θ_{t} + \nabla_{θ} E_{ρ_{t}} [Φ (x; θ)] - \nabla_{θ} E_{P_{r}} [Φ (x; θ)],

⎩ ⎨ ⎧ \partial_{t} ρ + \divergence (ρ ν) = 0 ν = - \nabla \frac{δ F}{δ ρ} ρ (\cdot, 0) = ρ_{0} (Continuity equation) (Variational condition) ρ_{0} \in L^{1} (R^{d}), ρ_{0} \geq 0.

⎩ ⎨ ⎧ \partial_{t} ρ + \divergence (ρ ν) = 0 ν = - \nabla \frac{δ F}{δ ρ} ρ (\cdot, 0) = ρ_{0} (Continuity equation) (Variational condition) ρ_{0} \in L^{1} (R^{d}), ρ_{0} \geq 0.

W^{r} (μ_{1}, μ_{2}) := ρ_{t} \in P_{r} min {\int_{0}^{1} \int_{R^{d}} ∣ ν_{t} ∣^{r} d ρ_{t} d t : \partial_{t} ρ_{t} + \divergence (ρ_{t} \cdot ν_{t}) = 0 \leavevmode ∣ \leavevmode ρ_{0} = μ_{1}, ρ_{1} = μ_{2}} .

W^{r} (μ_{1}, μ_{2}) := ρ_{t} \in P_{r} min {\int_{0}^{1} \int_{R^{d}} ∣ ν_{t} ∣^{r} d ρ_{t} d t : \partial_{t} ρ_{t} + \divergence (ρ_{t} \cdot ν_{t}) = 0 \leavevmode ∣ \leavevmode ρ_{0} = μ_{1}, ρ_{1} = μ_{2}} .

J_{t} = \frac{1}{2} W^{2} (ρ, ρ_{t}) + \int lo g ρ d ρ + \int Φ d ρ .

J_{t} = \frac{1}{2} W^{2} (ρ, ρ_{t}) + \int lo g ρ d ρ + \int Φ d ρ .

W^{2} (ρ_{t_{0}}, ρ_{t_{1}}) := ρ_{t} inf \int_{t_{0}}^{t_{1}} \int_{R^{d}} ∣\nablaΦ ∣^{2} d ρ_{t} d t \approx (ζ - t_{0}) \int_{R^{d}} ∣\nablaΦ ∣^{2} d ρ_{t_{0}} + (t_{1} - ζ) \int_{R^{d}} ∣\nablaΦ ∣^{2} d ρ_{t_{1}} = - β ((1 - γ) \int_{R^{d}} ∣\nablaΦ ∣^{2} d ρ_{t_{0}} + γ \int_{R^{d}} ∣\nablaΦ ∣^{2} d ρ_{t_{1}}) .

W^{2} (ρ_{t_{0}}, ρ_{t_{1}}) := ρ_{t} inf \int_{t_{0}}^{t_{1}} \int_{R^{d}} ∣\nablaΦ ∣^{2} d ρ_{t} d t \approx (ζ - t_{0}) \int_{R^{d}} ∣\nablaΦ ∣^{2} d ρ_{t_{0}} + (t_{1} - ζ) \int_{R^{d}} ∣\nablaΦ ∣^{2} d ρ_{t_{1}} = - β ((1 - γ) \int_{R^{d}} ∣\nablaΦ ∣^{2} d ρ_{t_{0}} + γ \int_{R^{d}} ∣\nablaΦ ∣^{2} d ρ_{t_{1}}) .

\frac{δ W ^{2} ( ρ _{t_{0}} , ρ _{t_{1}} )}{δ ρ _{t_{1}}} \propto ∣\nablaΦ ∣^{2},

\frac{δ W ^{2} ( ρ _{t_{0}} , ρ _{t_{1}} )}{δ ρ _{t_{1}}} \propto ∣\nablaΦ ∣^{2},

\partial_{t} ρ = Δ ρ - \divergence (ρ (\nablaΦ - \nabla∣\nablaΦ (x) ∣^{2})) .

\partial_{t} ρ = Δ ρ - \divergence (ρ (\nablaΦ - \nabla∣\nablaΦ (x) ∣^{2})) .

x_{t + 1} = x_{t} + \nabla Φ (x_{t}) - \nabla∣\nablaΦ (x_{t}) ∣^{2} + 2 ξ_{t} .

x_{t + 1} = x_{t} + \nabla Φ (x_{t}) - \nabla∣\nablaΦ (x_{t}) ∣^{2} + 2 ξ_{t} .

\begin{split}\mathcal{J}_{t}^{l}=&\operatorname*{min}_{\rho}\operatorname*{max}_{\boldsymbol{\theta}}\Big{\{}-\frac{\beta}{2}\left(1-\gamma\right)\int_{\mathbb{R}^{d}}|\nabla_{\boldsymbol{x}}\Phi(\boldsymbol{x};\boldsymbol{\theta})|^{2}d\rho_{t}\\ &-\frac{\beta}{2}\gamma\int_{\mathbb{R}^{d}}|\nabla_{\boldsymbol{x}}\Phi(\boldsymbol{x};\boldsymbol{\theta})|^{2}d\rho+\int log\rho d\rho\\ &+\int\Phi(\boldsymbol{x};\boldsymbol{\theta})d\rho-\int\Phi(\boldsymbol{x};\boldsymbol{\theta})d\mathbb{P}_{r}\Big{\}}.\end{split}

\begin{split}\mathcal{J}_{t}^{l}=&\operatorname*{min}_{\rho}\operatorname*{max}_{\boldsymbol{\theta}}\Big{\{}-\frac{\beta}{2}\left(1-\gamma\right)\int_{\mathbb{R}^{d}}|\nabla_{\boldsymbol{x}}\Phi(\boldsymbol{x};\boldsymbol{\theta})|^{2}d\rho_{t}\\ &-\frac{\beta}{2}\gamma\int_{\mathbb{R}^{d}}|\nabla_{\boldsymbol{x}}\Phi(\boldsymbol{x};\boldsymbol{\theta})|^{2}d\rho+\int log\rho d\rho\\ &+\int\Phi(\boldsymbol{x};\boldsymbol{\theta})d\rho-\int\Phi(\boldsymbol{x};\boldsymbol{\theta})d\mathbb{P}_{r}\Big{\}}.\end{split}

⎩ ⎨ ⎧ x_{t + 1} θ_{t + 1} = x_{t} - (\frac{x _{t}}{σ ^{2}} - \nabla_{x} Φ (x_{t}; θ)) + 2 ξ_{t} = θ_{t} + \nabla_{θ} E_{ρ_{t}} [Φ (x; θ)] - \nabla_{θ} E_{P_{r}} [Φ (x; θ)] - \frac{β}{2} (1 - γ) \nabla_{θ} E_{ρ_{t - 1}} [∣ \nabla_{x} Φ (x; θ) ∣^{2}] - \frac{β}{2} γ \nabla_{θ} E_{ρ_{t}} [∣ \nabla_{x} Φ (x; θ) ∣^{2}] .

⎩ ⎨ ⎧ x_{t + 1} θ_{t + 1} = x_{t} - (\frac{x _{t}}{σ ^{2}} - \nabla_{x} Φ (x_{t}; θ)) + 2 ξ_{t} = θ_{t} + \nabla_{θ} E_{ρ_{t}} [Φ (x; θ)] - \nabla_{θ} E_{P_{r}} [Φ (x; θ)] - \frac{β}{2} (1 - γ) \nabla_{θ} E_{ρ_{t - 1}} [∣ \nabla_{x} Φ (x; θ) ∣^{2}] - \frac{β}{2} γ \nabla_{θ} E_{ρ_{t}} [∣ \nabla_{x} Φ (x; θ) ∣^{2}] .

R = \frac{1}{K} k = 1 \sum K \frac{1}{N} i = 1 \sum N F_{k} (x^{i}) - \frac{1}{M} i = 1 \sum M F_{k} (y^{i})

R = \frac{1}{K} k = 1 \sum K \frac{1}{N} i = 1 \sum N F_{k} (x^{i}) - \frac{1}{M} i = 1 \sum M F_{k} (y^{i})

P (x; θ) \propto exp [- \frac{1}{4} ∥ x - y ∥^{2}]

P (x; θ) \propto exp [- \frac{1}{4} ∥ x - y ∥^{2}]

\frac{1}{n} i = 1 \sum n δ_{x_{τ}^{i}} \sim exp [- K (ρ_{τ} ∣ ρ_{0})] .

\frac{1}{n} i = 1 \sum n δ_{x_{τ}^{i}} \sim exp [- K (ρ_{τ} ∣ ρ_{0})] .

\frac{1}{C}\sum_{j=1}^{C}\boldsymbol{x}^{i}(j)\sim\exp[\boldsymbol{\theta}-\mathbb{E}[\Phi(\boldsymbol{\theta})]\Big{]}.

\frac{1}{C}\sum_{j=1}^{C}\boldsymbol{x}^{i}(j)\sim\exp[\boldsymbol{\theta}-\mathbb{E}[\Phi(\boldsymbol{\theta})]\Big{]}.

\frac{1}{n} i = 1 \sum n δ_{x_{τ}^{i}} \frac{1}{C} j = 1 \sum d x_{τ}^{i} (j) \sim E q . \leavevmode \ref e q : s an o v \leavevmode \ref e q : cr am er

\frac{1}{n} i = 1 \sum n δ_{x_{τ}^{i}} \frac{1}{C} j = 1 \sum d x_{τ}^{i} (j) \sim E q . \leavevmode \ref e q : s an o v \leavevmode \ref e q : cr am er

\displaystyle\exp[-\mathcal{K}(\rho_{\tau}\mid\rho_{0})]\cdot\exp[\theta-\mathbb{E}[\Phi(\boldsymbol{\theta})]\Big{]}

\displaystyle\propto\exp[\mathcal{K}(\rho_{\tau}\mid\rho_{0})+\mathbb{E}[\Phi(\boldsymbol{\theta})]\Big{]}.

t \to \infty lim ρ_{t + 1}

t \to \infty lim ρ_{t + 1}

= \frac{1}{Z} e^{(ω_{t} + ω_{t - 1}) \cdot Φ} ρ_{t - 1}

= \cdot \cdot \cdot = \frac{1}{Z} e^{Φ (x; θ)} q .

k = 1 \sum K h (⟨ x_{i}^{τ}, w_{k} ⟩ + b_{k})

k = 1 \sum K h (⟨ x_{i}^{τ}, w_{k} ⟩ + b_{k})

\propto δ_{x_{i}^{τ}} j = 1 \sum C x_{i}^{τ} (j) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques · Anomaly Detection Techniques and Applications

Full text

FRAME Revisited: An Interpretation View Based on Particle Evolution

Xu Cai1†, Yang Wu1†, Guanbin Li1, Ziliang Chen1, Liang Lin1,2

1School of Data and Computer Science, Sun Yat-Sen University, China

2Dark Matter AI Inc.

[email protected], [email protected],

[email protected], [email protected], [email protected] Xu Cai and Yang Wu contribute equally to this work and share first-authorship. Corresponding author is Liang Lin (Email: [email protected]). This work was supported in part by the National Key Research and Development Program of China under Grant No.2018YFC0830103, in part by the NSFC-Shenzhen Robotics Projects (U1613211), in part by the National Natural Science Foundation of China under Grant No.61702565, No.61622214 and No.61836012 and in part by National High Level Talents Special Support Plan (Ten Thousand Talents Program).

Abstract

FRAME (Filters, Random fields, And Maximum Entropy) is an energy-based descriptive model that synthesizes visual realism by capturing mutual patterns from structural input signals. The maximum likelihood estimation (MLE) is applied by default, yet conventionally causes the unstable training energy that wrecks the generated structures, which remains unexplained. In this paper, we provide a new theoretical insight to analyze FRAME, from a perspective of particle physics ascribing the weird phenomenon to KL-vanishing issue. In order to stabilize the energy dissipation, we propose an alternative Wasserstein distance in discrete time based on the conclusion that the Jordan-Kinderlehrer-Otto (JKO) discrete flow approximates KL discrete flow when the time step size tends to [math]. Besides, this metric can still maintain the model’s statistical consistency. Quantitative and qualitative experiments have been respectively conducted on several widely used datasets. The empirical studies have evidenced the effectiveness and superiority of our method.

Introduction

FRAME (Filters, Random fields, And Maximum Entropy) (?) is a model built on Markov random field that can be applied to approximate various types of data distributions, such as images, videos, audios and 3D shapes (?; ?; ?). It is an energy-based descriptive model in the sense that besides its parameters are estimated, samples can be synthesized from the probability distribution the model specifies. Such distribution is derived from maximum entropy principle (MEP), which is consistent with the statistical properties of the observed filter responses. FRAME can be trained via an information theoretical divergence between real data distribution $\mathbb{P}_{r}$ and model distribution $P_{\theta}$ . Primitive efforts model it as KL-divergence by default, which also leads to the same results of MLE.

A large number of experimental results reveal that FRAME tends to generate inferior synthesized images and is often arduous to converge during training. For instance, displayed in Fig. 1, the synthesized images of FRAME seriously deteriorates along with the model energy. This phenomenon is caused by KL-vanishing in the stepwise parameters estimation of the model due to the existence of the great filter responses disparity between $P_{\theta}$ and $\mathbb{P}_{r}$ . Specifically, the MLE-based learning algorithm attempts to optimize a transformation from the high dimensional support of $P_{\theta}$ to the non-existing support of $\mathbb{P}_{r}$ , i.e., it starts from an initialization of a Gaussian noise covering the whole support of $P_{\theta}$ and $\mathbb{P}_{r}$ , then gradually updates $\theta$ by calculating the KL discrete flow step-wisely. Therefore in the discrete time setting of the actual iterative training process, the dissipation of the model energy may become considerably unstable, and the stepwise minimization scheme may suffer serious KL-vanishing issue during the communicative parameters estimation.

To tackle the above shortcomings, we first investigate this model from a particle perspective by regarding all the observed signals as Brownian particles (pre-condition of KL discrete flow), which helps explore the reasons for the collapses of the FRAME model. This is inspired by the fact that the empirical measure of a set of Brownian particles generated by $P_{\theta}$ satisfies Large Deviation Principle (LDP) with rate functional coincides exactly with the KL discrete flow (see Lemma 1). We then delve into the model in discrete time state and translate its learning mechanism from KL discrete flow into the Jordan-Kinderlehrer-Otto (JKO) (?) discrete flow, which is a procedure for finding time-discrete approximations to solutions of diffusion equations in Wasserstein space. By resorting to the geometric distance between $P_{\theta}$ and $\mathbb{P}_{r}$ through optimal transport (OT) (?) and replacing the KL-divergence with Wasserstein distance (a.k.a. the earth mover’s distance (?)), this method manages to stabilize the energy dissipation scheme in FRAME and maintain its statistical consistency. The whole theoretical contribution can be summed up as the following deduction process:

•

We deduce the learning process of data density in FRAME model from a view of particle evolution and confirm that it can be approximated by a discrete flow model with gradually decreasing energy driven by the minimization of the KL divergence.

•

We further propose Wasserstein perspective of FRAME (wFRAME) by reformulating the FRAME’s learning mechanism from KL discrete flow into the JKO discrete flow, of which the former theoretically explains the cause of the vanishing problem, while the latter overcomes the drawbacks, including the instability of sample generation and the failure of model convergence during training.

Qualitative and quantitative experiments demonstrate that the proposed wFRAME greatly ameliorates the vanishing issue of FRAME and can generate more visually promising results, especially for structurally complex training data. Moreover, to our knowledge, this method can be applied to most sampling processes which aim at abridging the KL-divergence between real data distribution and the generated data distribution by time sequence.

Related Work

Descriptive Model for Generation.

The descriptive models originated from statistical physics have an explicit probability distribution of the signal, where they are ordinarily called the Gibbs distributions (?). With the massive developments of Convolutional Neural Networks (CNN) (?) which has been proven to be a powerful discriminator, recently, increasing researches on the generative perspective of this model have drawn a lot of attention. (?) first introduces a generative gradient for pre-training discriminative ConvNet by a non-parametric importance sampling scheme and (?) proposes to learn FRAME using pre-learned filters of modern CNN. (?) further studies the theory of generative ConvNet intensively and show that the model has a representational structure which can be viewed as a hierarchical version of the FRAME model.

Implicit Model for Generation.

Apart from the descriptive models, another popular branch of deep generative models is black-box models which map the latent variables to signals via a top-down CNN, such as the Generative Adversarial Network (GAN) (?) and its variants. These models have gained remarkable success in generating realistic images and learn the generator network with an assistant discriminator network.

Relationship.

Unlike the majority of implicit generative models, which use an auxiliary network to guide the training of the generator, descriptive models maintain a single model which simultaneously serves as a descriptor and generator, though FRAME can be served as an auxiliary and be combined with GAN to facilitate each other (?). They factually generate samples directly from the input set, rather than from the latent space, which to a certain extent ensures that the model can be efficiently trained and produce stable synthesized results with relatively less model structure complexity. In this paper, FRAME and its variants as described above share the same MLE based learning mechanism, which follows an analysis-by-synthesis scheme and works by first generating synthesized samples from the current model using Langevin dynamics and then learn the parameters through observed-synthesized samples’ distance.

Preliminaries

Let $\mathcal{P}$ denote the space of Borel probability measures on any given subset of space $\mathcal{X}$ , where $\forall\boldsymbol{x}\in\mathcal{X}$ , $\boldsymbol{x}\in\mathbb{R}^{d}$ . Given some sufficient statistics $\phi:\mathcal{X}\to\mathbb{R}$ , scalar $\alpha\in\mathbb{R}$ and base measure $q$ , the space of distributions satisfying linear constraint is defined as $\mathcal{P}_{\alpha}^{lin}=\left\{p,f\in\mathcal{P}:p=fq,f\geq 0,\int pdx=1,E_{p}[\phi(x)]=\alpha\right\}$ . The Wasserstein space of order $r\in[1,\infty]$ is defined as $\mathcal{P}_{r}=\left\{p\in\mathcal{P}:\int|x|^{r}dp<\infty\right\}$ , where $|\cdot|^{r}$ denotes the $r$ -norm on $\mathcal{X}$ . $|\mathcal{X}|$ is the number of elements in domain $\mathcal{X}$ . $\nabla$ denotes gradient and $\divergence$ denotes the divergence operator.

Markov Random Fields (MRF).

MRF belongs to the family of undirected graphical models, which can be written in the Gibbs form as

[TABLE]

where $K$ stands for the number of features $\left\{f_{k}\right\}_{k=1}^{K}$ and $Z(\cdot)$ is the partition function (?). Its MLE learning process follows the iteration of the following two steps:

I. Update model parameter $\boldsymbol{\theta}$ by ascending the gradient of the log likelihood

[TABLE]

where $\mathbb{E}_{\mathbb{P}_{r}}[f_{k}\left(\boldsymbol{x}\right)]$ and $\mathbb{E}_{P\left(\boldsymbol{x};\boldsymbol{\theta}\right)}\left[f_{k}\left(\boldsymbol{x}\right)\right]$ is respectively the feature response over real data distribution $\mathbb{P}_{r}$ and current model distribution $P\left(\boldsymbol{x};\boldsymbol{\theta}\right)$ .

II. Sample from the current model by parallel MCMC chains. The sampling process, according to (?), does not necessarily converge at each $\boldsymbol{\theta}_{t}$ , thus we only establish one persistent sampler that converges globally in order to reduce calculus.

FRAME Model.

Based on an energy function, FRAME is defined on the exponential tilting of a reference distribution $q$ , which is a reformulation of MRF and can be written as (?):

[TABLE]

where $h(\boldsymbol{x})=\max(0,\boldsymbol{x})$ is the nonlinear activation function, $\langle\boldsymbol{x},\boldsymbol{w}\rangle$ is the filtered image or feature map and $q\left(\boldsymbol{x}\right)=\frac{1}{(2\pi\sigma^{2})^{|\mathcal{X}|/2}}\exp\left[-\frac{1}{2\sigma^{2}}\|\boldsymbol{x}\|^{2}\right]$ denotes the Gaussian white noise model with mean [math] and variance $\sigma^{2}$ .

KL Discrete Flow.

This flow is related to discrete probability distributions (evolutions discretized in time) with finite dimensional problems. More precisely, it indicates the system of $n$ independent Brownian particles $\{\boldsymbol{x}^{i}\}_{i=1}^{n}\in\mathbb{R}^{d}$ whose position in $\mathbb{R}^{d}$ is given by a Wiener process satisfies the following stochastic differential equation (SDE)

[TABLE]

$\mu$ is the drift term, $\varepsilon$ stands for the diffusion term, $\boldsymbol{B}$ denotes the Wiener process and subscript $t$ denotes time point $t$ . This empirical measure of those particles is proved to approximate Eq. 3 by an implicit descent step $\rho^{*}=\operatorname*{argmin}_{\rho}\mathcal{I}_{t}$ , where $\mathcal{I}_{t}$ is the so called KL discrete flow consists of KL divergence and energy function $\Phi:\mathbb{R}^{d}\to\mathbb{R}$ .

[TABLE]

Particle Perspective of FRAME Model

Although there is a traditional statistical perspective to interpret the FRAME theory (?), we still need a more stable sampling process to avoid this frequent generation failure. We revisit the frame model from a completely new particle perspective and prove that its parameter update mechanism is actually equivalent to the reformulation of KL discrete flow. Its further transformation, a mechanism in JKO discrete flow manner which we will next prove the equivalence on condition of enough sampling time steps, has ameliorated this unpredictably vanishing phenomenon. All the proofs in detail are added to Appendix A.

Discrete Flow Driven by KL-divergence

Herein we first introduce FRAME in discrete flow manner. If we regard the observed signals $\{\boldsymbol{x}^{i}_{t}\}_{i=1}^{n}$ with the generating function of Markov property as Brownian particles, then theorem 1 points out that Langevin dynamics can be deduced from KL discrete flow sufficiently and necessarily through lemma 1.

Lemma 1.

For i.i.d. particles $\{\boldsymbol{x}^{i}_{t}\}_{i=1}^{n}$ with common generating function $\mathbb{E}[e^{\Phi(\boldsymbol{x};\boldsymbol{\theta})}]$ which has Markov property, the empirical measure $\rho_{t}=\frac{1}{n}\sum_{i=1}^{n}\delta_{\boldsymbol{x}^{i}_{t}}$ satisfies Large Deviation Principle (LDP) with rate functional in the form of $\mathcal{I}_{t}$ .

Theorem 1.

Given a base measure $q$ , a clique potential $\Phi$ , the density of FRAME in Eq. 3 can be obtained sufficiently and necessarily by solving the following constrained optimization.

[TABLE]

Let $\boldsymbol{\theta}$ be the Lagrange multiplier integrated in $\Phi(\boldsymbol{x};\boldsymbol{\theta})$ and ensure $\mathbb{E}[e^{\Phi(\boldsymbol{x};\boldsymbol{\theta})}]<\infty$ , then the optimizing objective can be reformulated as

[TABLE]

Since $\nabla_{\boldsymbol{x}}\log P(\boldsymbol{x};\boldsymbol{\theta})=\nabla_{\boldsymbol{x}}\Phi(\boldsymbol{x};\boldsymbol{\theta})$ , then the SDE iteration of $\boldsymbol{x}_{t}$ in Eq. 4 can be expressed in the Langevin form as

[TABLE]

By Lemma 1, if we fix $\boldsymbol{\theta}$ , the sampling scheme in Eq. 8 approaches the KL discrete flow $\mathcal{I}_{t}^{l}$ , the flow will fluctuate in case $\boldsymbol{\theta}$ varies. $\boldsymbol{\theta}$ is updated by calculating $\nabla_{\theta}\mathcal{I}_{t}^{l}$ , which implies $\boldsymbol{\theta}$ can dynamically transform the transition map into desired. The sampling process of FRAME can be summed up as

[TABLE]

where $-\boldsymbol{x}_{t}/\sigma^{2}$ is the derivative of initial Gaussian noise $q$ . If we take a close look at the objective function, there is an adversarial mechanism while updating $\boldsymbol{x}_{t}$ and $\boldsymbol{\theta}_{t}$ . Regardless of fixing $\boldsymbol{\theta}$ updating $\boldsymbol{x}$ , or fixing $\boldsymbol{x}$ updating $\boldsymbol{\theta}$ , the correct direction cannot be insured to the optimal of minimizing $\mathcal{K}(P(\boldsymbol{x};\theta)\mid\mathbb{P}_{r})$ .

Discrete Flow Driven by Wasserstein Metric

Although KL approach is relatively rational in the methodology of FRAME, there exists the risk of a KL-vanishing problem as we have discussed, since the parameter updating mechanism of MLE may suffer non-convergence. To avoid this problem, we introduce the Wasserstein metric to discrete flow, according to the statement of (?) that $P_{\theta}$ can be closer from a KL method given empirical measure $\rho_{t}$ , but far from the same measure in the Wasserstein distance. And (?) also claims that a better convergence and approximated results can be obtained since Wasserstein metric defines a weaker topology. The conclusion that $\mathcal{I}_{t}\approx\mathcal{J}_{t}$ when time step size $\tau\to 0$ rationalizes the proposed method. The proof of this conclusion in the one-dimensional situation has shown in (?) and in higher-dimensional has been proved by (?; ?). Here we first provide some background knowledge about the transformation then we briefly show the derivation process.

Fokker-Planck Equation.

Under the influence of drifts and random diffusions, this equation describes the evolution for the probability density function of the particle velocity. Let $F$ be an integral function and $\delta F/\delta\rho$ denote its Euler-Lagrange first variation, the equations are

[TABLE]

Wasserstein Metric.

The Benamou-Brenier form of this metric (?) of order $r$ involves solving a smoothy OT problem over any probabilities $\mu_{1}$ and $\mu_{2}$ in $\mathcal{P}_{r}$ using the continuity equation showed in Eq. 10 as follows, where $\nu$ belongs to the tangent space of the manifold governed by some potential and associated with curve $\rho_{t}$ .

[TABLE]

JKO Discrete Flow.

Following the initial work (?), which shows how to recover Fokker-Planck diffusions of distributions in Eq. 10 when minimizing entropy functionals according to Wasserstein metric $\mathcal{W}^{2}$ , the JKO discrete flow is applied by our method to replace the initial KL divergence with the entropic Wasserstein distance $\mathcal{W}^{2}-H(\rho)$ . The function of the flow is

[TABLE]

Remark 1.

The initial Gaussian term $q$ is left out for convenience to facilitate the derivation, otherwise, the entropy $-H(\rho)=\int\log\rho d\rho$ in Eq. 12 should be written as the relative entropy $\mathcal{K}(\rho\mid q)$ .

By Theorem 1, $\mathcal{J}_{t}$ instead of $\mathcal{I}_{t}$ can be calculated in approximation and a steady state will approach Eq. 3. Applying $\mathcal{J}_{t}$ in the manner of dissipation mechanism as a substitute of $\mathcal{I}_{t}$ allows regarding the diffusion Eq. 4 as the steepest descent of clique energy $\Phi$ and entropy $-H(P)$ w.r.t. Wasserstein metric. Solving such optimization problem using $\mathcal{W}$ is identical to solve the Monge-Kantorovich mass transference problem.

With Second Mean Value theorem for definite integrals, we can approximately recover the integral $\mathcal{W}^{2}$ by two randomly interpolated rectangles

[TABLE]

where $\beta=t_{1}-t_{0}$ parameterizes the time piece and $\gamma=\zeta/\beta\leavevmode\nobreak\ (0\leq\gamma\leq 1)$ represents random interpolated parameter since $\zeta$ is random. With Eq. 13, the functional derivative of $\mathcal{W}^{2}(\rho_{t_{0}},\rho_{t_{1}})$ w.r.t. $\rho_{t_{1}}$ is then proportional to

[TABLE]

which is exactly the result of Proposition 8.5.6 in (?). Assume $\Phi$ be at least twice differentiable and treat Eq. 14 as the variational condition in Eq. 10, then plug Eq. 14 into the continuity equation of Eq. 10, which turns into a modified Wasserstein gradient flow in Fokker-Planck form as follows

[TABLE]

Then the corresponding SDE can be written in Euler-Maruyama form as

[TABLE]

By Remark 1, if we reconsider the initial Gaussian term, the discrete flow of $\boldsymbol{x}_{t+1}$ in Eq. 16 should be added with $-\boldsymbol{x}_{t}/\sigma^{2}$ .

Remark 2.

If $\Phi$ is the energy function defined in Eq. 3, then $\nabla|\nabla\Phi(\boldsymbol{x})|^{2}=0$ .

It’s a direct result since $\Phi(\boldsymbol{x},\boldsymbol{\theta})$ defined in FRAME only involves inner-product, ReLu (piecewise linear) and other linear operations, the second derivative is obviously [math]. Therefore, both the time evolution of density $\rho_{t}$ in Eq. 15 and sample $\boldsymbol{x}_{t}$ in Eq. 16 will respectively degenerate to Eq. 10 and Eq. 8. Thus the SDE of $\boldsymbol{x}_{t}$ remains default, i.e. Langevin form while the gradients of the model parameter $\boldsymbol{\theta}_{t}$ doesn’t degenerate.

Alike to the parameterized KL flow $\mathcal{I}_{t}^{l}$ defined in Eq. LABEL:eq:klflow_1, we propose a similar form in JKO manner. With Eq. 13 and Eq. 14, the final optimization objective function $\mathcal{J}_{t}^{l}$ can be formulated as

[TABLE]

With all discussed above, the learning progress of wFRAME can be constructed by ascending the gradient of $\theta$ , i.e. $\nabla_{\boldsymbol{\theta}}\mathcal{J}_{t}^{l}$ . The calculating steps in formulation are summarized in Eq. 18.

[TABLE]

The equation above indicates that the gradient of $\boldsymbol{\theta}$ in Wasserstein manner is being added with some soft gradient norm constraints between the last two iterations. Such gradient norm has the following advantages compared with the original iteration process (Eq. 9).

First the norm serves as the constant speed geodesic connecting $\rho_{t}$ with $\rho_{t+1}$ in the manifold spanned by $P_{\boldsymbol{\theta}}$ and $\mathbb{P}_{r}$ , which may provide a speedup on converge. Next, it can be interpreted as the soft anti-force against the original gradient and prevent the whole learning process from vanishing. Moreover, in experiments, we find it can preserve data inner structural information. The new learning and generating the process of wFRAME is summarized in Algorithm 1 in detail.

Experiments

In this section, we intensively compare our proposed method with FRAME from two aspects, one is the confirmatory experiment of model collapse under varied settings with respect to the baseline, the other is the quantitative and qualitative comparison of generated results on extensively used datasets. In the first stage, as expected, the proposed wFRAME is verified to be more robust in training and the synthesized images are of higher quality and fidelity in most circumstances. The second stage, we evaluate both models on the whole datasets. We propose a new metric response distance, which measures the gap between the generated data distribution and the real data distribution.

Confirmation of Model Collapse

We recognize that under some circumstances FRAME will suffer serious model collapse. Due to MEP, the expected well-learned FRAME model $P_{\boldsymbol{\theta}}^{*}$ should achieve minimum $\mathcal{K}(P_{\boldsymbol{\theta}}^{*}\mid q)$ , i.e. the minimum amount of transformations to the reference measure. But such minimization of KL divergence might be the unpredictable cause of the energy to [math], namely the learned model will degenerate to produce initial noise instead of the desired minimum modification. Furthermore, in case $\Phi(\boldsymbol{x},\boldsymbol{\theta})\leq 0$ , the learned model intends to degenerate. In other words, the images synthesized from FRAME driven by KL divergence will collapse immediately and the quality may barely restore. Consequently, the best curve of $\Phi$ is slowly asymptotic to and slightly above [math].

To manifest the superiority of our method over FRAME compared with the baseline settings, we conduct the validation experiments on a subset of SUN dataset (?) under different circumstances. Intuitively, a simple trick to the model collapse issue is to restrict $\boldsymbol{\theta}$ in a safe range, a.k.a. weight clipping. The experimental settings include respectively altering $\lambda$ and $\delta$ to an insecure range, turning on or off the weight clipping and varying the inputs dimensions. The results are presented in Fig. 3, which shows the property of a more robust generation compared with the original strategy or FRAME with weight clipping trick.

Empirical Setup on Common Datasets

We apply wFRAME on several widely used datasets in the field of generative modeling. As for default experimental settings, $\sigma=0.01$ , $\beta=60$ , the number of learning iterations is set to $T=100$ , the step number $L$ of Langevin sampling within each learning iteration is $50$ and the batch size is $N=M=9$ . The implementation of $\Phi(x)$ in our method is the first 4 convolutional layers of a pre-learned VGG-16 (?). Input shape varies by datasets and is specified following. The hyper-parameters appear in Algorithm 1 differs on each dataset in order to achieve the best results. As for FRAME we use default settings in (?).

CelebA (?) and LSUN-Bedroom (?) images are cropped and resized to $64\times 64$ . we set $\lambda=1e^{-3}$ in both datasets, $\delta=0.2$ in CelebA and $\delta=0.15$ in LSUN-Bedroom. The visualizations of two methods are exhibited in Fig. 2.

CIFAR-10 (?) includes various categories and we learn both algorithms conditioned on the class label. In this experiment, we set $\delta=0.15$ , $\lambda=2e^{-3}$ and images’ size are of $32\times 32$ . Numerically and visually in Fig. 4, 5 and Table 1, the results show great improvement.

For a fair comparison, two metrics are utilized to evaluate FRAME and wFRAME. We offer a new metric response distance to measure the disparity between two distributions according to the results sampled out, while the Inception score is a widely used standard in measuring samples diversity.

Response distance $R$ is defined as

[TABLE]

where $F_{k}$ denotes the $k$ th filter. The smaller the $R$ is, the better the generated results will be, since $R\propto{max}_{\theta}\mathbb{E}_{r}[F(\boldsymbol{y}^{i})]-\mathbb{E}_{P_{\theta}}[F(\boldsymbol{x}^{i})]$ , which implies that $R$ provides an approximation of the divergence between the target data distribution and the generated data distribution. Furthermore, by Eq. 2, the faster $R$ falls the better $\boldsymbol{\theta}$ converges.

Inception score (IS) is the most widely adopted metric of generative models, which estimates the diversity of the generated samples. It uses a network Inception v2 (?) pre-trained on ImageNet (?) to capture the classifiable properties of samples. This method has the drawbacks of neglecting the visual quality of the generated results and prefers models who generate objects rather than realistic scene images, but it can still provide essential diversity information of synthesized samples in evaluating generative models.

Comparison with GANs

We compare FRAME and wFRAME with GAN models implemented on CIFAR-10 via the Inception score in Table 1. Most GAN-family models achieve pretty high on this score, however, our method is a descriptive model instead of an implicit model. GANs with high scores perform badly in descriptive situations, for example, the image reconstruction task or training on a small amount of data. FRAME can handle most of these situations properly. The performance of DCGAN in modeling mere few images is presented in Fig. 6 where for equal comparison, we duplicate the input images several times to the total amount of 10000 to adopt the training environment of DCGAN. The compared wFRAME is trained in our own method. The DCGAN’s training procedure is ceased as it converges but still remains collapsed results.

Comparison of FRAME and wFRAME

From two aspects, we analyze FRAME and wFRAME as a summary of the whole experiments conducted above. As expected, our algorithm is more suitable for synthesizing complex and varied scene images and the resulting images are apparently more authentic compared with FRAME.

Quality of Generation Improvement.

According to our performances on response distance $R$ , the quality of the image synthesis is improved. This measurement is corresponding with the iteration learning process of both FRAME and wFRAME. The learning curves presented in Fig. 4 are the observations of the overall datasets synthesis. From the curves can we draw the conclusion that wFRAME converges better than FRAME. The results of generation on CelebA, LSUN-Bedroom and CIFAR-10 in Fig. 2 and 5 shows that even if the training images are relatively aligned with conspicuous structural information, or with only simple categorical context information, the images produced by FRAME are still abundant with motley noise and twisted texture, while ours are more reasonably mixed, more sensible structured and bright-colored with less distortion.

Training Steadiness Improvement.

Compared with FRAME as shown in Fig. 1 which illustrates the typical evolution of generated samples, we found an improvement on the training steadiness. The generated images are almost identical at the beginning, however, images produced by our algorithm are able to be back on track after 30 iterations while FRAME’s deteriorate. Quantitatively in Fig. 4, the curves are calculated by averaging across the whole dataset. wFRAME reaches lower cost on response distance, namely the direct $L_{1}$ critic of filter banks between synthesized samples and target samples is smaller and decreases more steadily. To be more specific, our algorithm has mostly solved the model collapse problem of FRAME for it not only ensures the closeness between the generated samples and “ground-truth” samples but also stabilizes the learning phase of the model parameter $\boldsymbol{\theta}$ . The three plots clearly show the quantitative measures are well correlated with qualitative visualizations of generated samples. In the absence of collapsing, we attain comparable or even better results over FRAME.

Conclusion

In this paper, we re-derivatively track the origin of FRAME from the viewpoint of particle evolution and have discovered the potential factors that may lead to the deterioration of sample generation and the instability of model training, i.e, the inherent vanishing problem existing in the minimization of KL divergence. Based on this discovery, we propose wFRAME by reformulating the KL discrete flow in the FRAME to the JKO scheme, and prove through empirical examination that it can overcome the above-mentioned deficiencies. The experiments are carried out to demonstrate the superiority of the proposed wFRAME model and comparable results have shown that it can greatly ameliorate the vanishing issue of FRAME and can produce more visually promising results.

Appendix A A Proofs

Proof of Lemma 1

With another perspective that under the Gaussian reference measure, FRAME established on morden ConvNet has the piecewise Gaussian property and it’s summarized in Proposition 1.

Proposition 1.

(Reformulation of Theorem 1.) Equation 3 is piecewise Gaussian, on each piece the probability density can be written as:

[TABLE]

where $\boldsymbol{y}=\mathbf{B}_{\boldsymbol{\theta},\delta}=\sum_{k=1}^{K}\delta_{k}\boldsymbol{\theta}_{k}$ is an approximated reconstruction of $\boldsymbol{x}$ in one piece of data space by a linear transformation involving inner-products with model parameter and piecewise linear activation function(ReLu). This proposition implies that different pieces can be regarded as different generating samples acting as Brownian particles.

By Proposition 1, each particle (image piece) in FRAME has the transition kernel in Gaussian form (equation 19). It describes the probability of a particle moving from $\boldsymbol{x}\in\mathbb{R}^{d}$ to $\boldsymbol{y}\in\mathbb{R}^{d}$ in time $\tau>0$ . Let a fixed measure $\rho_{0}$ such as Gaussian be the initial measure of $n$ Brownian particles $\{\boldsymbol{x}_{0}^{i}\}_{i=1}^{n}$ at time [math]. Sanov theorem shows that empirical measure $\rho_{\tau}=\frac{1}{n}\sum_{i=1}^{n}\delta_{\boldsymbol{x}^{i}_{\tau}}$ of such transition particles satisfy LDP with rate functional $\mathcal{K}(\rho_{\tau}\mid\rho_{0})$ , i.e.,

[TABLE]

Specially, each Brownian particle has $|\mathcal{X}|$ internal sub-particles which are independent in different cliques. Let $C$ denote the number of cliques, Cramer’s theorem tells us that for i.i.d. RVs $\boldsymbol{x}^{i}(j)$ with common generating function $\mathbb{E}[e^{\Phi(\boldsymbol{\theta})}]$ , the empirical mean $\frac{1}{C}\sum_{j=1}^{C}\boldsymbol{x}^{i}(j)$ satisfies LDP with rate functional in the Legendre transformation of $\mathbb{E}[e^{\Phi(\boldsymbol{\theta})}]$ ,

[TABLE]

Since the empirical measure of $\boldsymbol{x}_{\tau}^{i}$ is simply the empirical mean of the Dirac measure, i.e., $\delta_{\boldsymbol{x}^{i}_{\tau}}\frac{1}{C}\sum_{j=1}^{C}\boldsymbol{x}_{\tau}^{i}(j)$ , then the empirical measure over all particles achieves to

[TABLE]

where the exponent is exactly the KL discrete flow $\mathcal{I}$ . Thus the empirical measure of the activation patterns of all those particles satisfies LDP with rate functional $\mathcal{I}_{t}$ in discrete time. ∎

Proof of Theorem 1

Necessity can be constructed using MEP via calculating $\partial_{\rho}\mathcal{I}_{t}^{l}$ iteratively:

[TABLE]

Sufficiency: Recall the Markov property in Eq. 1, we can write the inner product as sum of feature responses w.r.t. different clique, then shared pattern activation $h=\max(0,\boldsymbol{x}_{i}^{\tau})$ can be approximated by the Dirac measure as

[TABLE]

The result coincides with the empirical measure of $\boldsymbol{x}_{i}^{\tau}$ , so the proof of sufficiency turns into the proof of Lemma 1 and it was done in Lemma 1. ∎

Appendix B B More Visual Results

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[Adams et al . 2011] Adams, S.; Dirr, N.; Peletier, M. A.; and Zimmer, J. 2011. From a large-deviations principle to the wasserstein gradient flow: a new micro-macro passage. Communications in Mathematical Physics 307(3):791–815.
2[Ambrosio, Gigli, and Savaré 2008] Ambrosio, L.; Gigli, N.; and Savaré, G. 2008. Gradient flows: in metric spaces and in the space of probability measures . Springer Science & Business Media.
3[Arjovsky, Chintala, and Bottou 2017] Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein generative adversarial networks. In International Conference on Machine Learning , 214–223.
4[Benamou and Brenier 2000] Benamou, J.-D., and Brenier, Y. 2000. A computational fluid mechanics solution to the monge-kantorovich mass transfer problem. Numerische Mathematik 84(3):375–393.
5[Dai, Lu, and Wu 2014] Dai, J.; Lu, Y.; and Wu, Y.-N. 2014. Generative modeling of convolutional neural networks. ar Xiv preprint ar Xiv:1412.6296 .
6[Deng et al . 2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition (CVPR), 2009. IEEE Conference on , 248–255. IEEE.
7[Duong, Laschos, and Renger 2013] Duong, M. H.; Laschos, V.; and Renger, M. 2013. Wasserstein gradient flows from large deviations of many-particle limits. ESAIM: Control, Optimisation and Calculus of Variations 19(4):1166–1188.
8[Erbar et al . 2015] Erbar, Matthias an Erbar, M.; Maas, J.; Renger, M.; et al. 2015. From large deviations to wasserstein gradient flows in multiple dimensions. Electronic Communications in Probability 20.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

FRAME Revisited: An Interpretation View Based on Particle Evolution

Abstract

Introduction

Related Work

Descriptive Model for Generation.

Implicit Model for Generation.

Relationship.

Preliminaries

Markov Random Fields (MRF).

FRAME Model.

KL Discrete Flow.

Particle Perspective of FRAME Model

Discrete Flow Driven by KL-divergence

Lemma 1**.**

Theorem 1**.**

Discrete Flow Driven by Wasserstein Metric

Fokker-Planck Equation.

Wasserstein Metric.

JKO Discrete Flow.

Remark 1**.**

Remark 2**.**

Experiments

Confirmation of Model Collapse

Empirical Setup on Common Datasets

Comparison with GANs

Comparison of FRAME and wFRAME

Quality of Generation Improvement.

Training Steadiness Improvement.

Conclusion

Appendix A A Proofs

Proof of Lemma 1

Proposition 1**.**

Proof of Theorem 1

Appendix B B More Visual Results

Lemma 1.

Theorem 1.

Remark 1.

Remark 2.

Proposition 1.