Improved Training of Mixture-of-Experts Language GANs

Yekun Chai; Qiyue Yin; Junge Zhang

arXiv:2302.11875·cs.CL·February 24, 2023

Improved Training of Mixture-of-Experts Language GANs

Yekun Chai, Qiyue Yin, Junge Zhang

PDF

Open Access

TL;DR

This paper introduces an improved training method for language GANs using a mixture-of-experts model and feature statistics alignment, leading to better text generation quality.

Contribution

It empirically demonstrates that mixture-of-experts enhances generator capacity and employs feature statistics alignment for more effective training in language GANs.

Findings

01

Enhanced representation capacity with mixture-of-experts

02

FSA improves training signals for generator

03

Superior performance on benchmarks

Abstract

Despite the dramatic success in image generation, Generative Adversarial Networks (GANs) still face great challenges in synthesizing sequences of discrete elements, in particular human language. The difficulty in generator training arises from the limited representation capacity and uninformative learning signals obtained from the discriminator. In this work, we (1) first empirically show that the mixture-of-experts approach is able to enhance the representation capacity of the generator for language GANs and (2) harness the Feature Statistics Alignment (FSA) paradigm to render fine-grained learning signals to advance the generator training. Specifically, FSA forces the mean statistics of the distribution of fake data to approach that of real samples as close as possible in the finite-dimensional feature space. Empirical study on synthetic and real benchmarks shows the superior…

Tables9

Table 1. Table 1: Summary of experimental datasets.

dataset	vocabulary size	sequence length	training set	test set
synthetic data	5,000	20 / 40	10,000	10,000
MS COCO	4,657	37	10,000	10,000
EMNLP2017 WMT News	5,255	51	278,586	10,000

Table 2. Table 2: The NLL oracle performance of different models on the synthetic dataset with the sequence length of 20 and 40 respectively, where τ = 1 𝜏 1 \tau=1 for all model settings, N g = 2 / 3 subscript 𝑁 𝑔 2 3 N_{g}=2/3 for models with the sequence length of 20 and 40. For NLL, the lower, the better. All results of baseline models are obtained from original papers.

Length	MLE	SeqGAN	RankGAN	LeakGAN	RelGAN	SAL	CCL	Ours	Real
20	9.038	8.736	8.247	7.038	6.680 $\pm$ 0.343	7.71 $\pm$ 0.17	6.77 $\pm$ 0.34	5.835 $\pm$ 0.353	5.750
40	10.411	10.310	9.958	7.191	6.765 $\pm$ 0.026	9.31 $\pm$ 0.03	6.65 $\pm$ 0.14	5.713 $\pm$ 0.289	4.071

Table 3. Table 3: The BLEU and NLL gen performance on real data, in which we adopt τ = 0.01 𝜏 0.01 \tau=0.01 and N g = 2 subscript 𝑁 𝑔 2 N_{g}=2 . For all BLEU scores, the higher, the better. For NLL, the lower, the better.

Model	MS COCO Image Captions					EMNLP2017 WMT News
Model	BLEU-2	BLEU-3	BLEU-4	BLEU-5	NLL_gen	BLEU-2	BLEU-3	BLEU-4	BLEU-5	NLL_gen
MLE	0.731	0.497	0.305	0.189	0.718	0.768	0.473	0.240	0.126	2.382
SeqGAN	0.745	0.498	0.294	0.180	1.082	0.777	0.491	0.261	0.138	2.773
RankGAN	0.743	0.467	0.264	0.156	1.344	0.727	0.435	0.209	0.101	3.345
LeakGAN	0.746	0.528	0.355	0.230	0.679	0.826	0.645	0.437	0.272	2.356
RelGAN	0.849 $\pm$ 0.030	0.687 $\pm$ 0.047	0.502 $\pm$ 0.048	0.331 $\pm$ 0.044	0.756 $\pm$ 0.054	0.881 $\pm$ 0.013	0.705 $\pm$ 0.019	0.501 $\pm$ 0.023	0.319 $\pm$ 0.018	2.482 $\pm$ 0.031
SAL	0.785 $\pm$ 0.02	0.581 $\pm$ 0.03	0.362 $\pm$ 0.02	0.227 $\pm$ 0.02	0.873 $\pm$ 0.02	0.788 $\pm$ 0.02	0.523 $\pm$ 0.02	0.281 $\pm$ 0.02	0.149 $\pm$ 0.02	2.578 $\pm$ 0.04
CCL	0.871 $\pm$ 0.032	0.715 $\pm$ 0.050	0.538 $\pm$ 0.068	0.399 $\pm$ 0.082	0.630 $\pm$ 0.103	0.903 $\pm$ 0.016	0.749 $\pm$ 0.022	0.525 $\pm$ 0.017	0.324 $\pm$ 0.008	2.818 $\pm$ 0.499
Ours	0.963 $\pm$ 0.020	0.902 $\pm$ 0.059	0.814 $\pm$ 0.072	0.695 $\pm$ 0.076	0.639 $\pm$ 0.027	0.910 $\pm$ 0.015	0.769 $\pm$ 0.021	0.568 $\pm$ 0.026	0.374 $\pm$ 0.023	2.480 $\pm$ 0.025

Table 4. Table 4: Mean and standard deviation results of different models by human evaluation on MS COCO Image Caption dataset.

Model	MLE	SeqGAN	RankGAN	LeakGAN
Human score	3.126 $\pm$ 0.388	3.056 $\pm$ 0.349	3.017 $\pm$ 0.386	3.005 $\pm$ 0.353
Model	MaliGAN	TextGAN	RelGAN	Ours
Human score	3.014 $\pm$ 0.385	1.967 $\pm$ 0.421	3.709 $\pm$ 0.441	4.077 $\pm$ 0.348

Table 5. Table 5: Performance of different models on the synthetic dataset with the sequence length of 20 and 40, respectively. For NLL scores, the lower, the better.

Model	NLL_oracle (20/40)	NLL_gen (20/40)	NLL_oracle + NLL_gen (20/40)
MLE	9.05 $\pm$ 0.03 / 9.84 $\pm$ 0.02	5.96 $\pm$ 0.02 / 6.55 $\pm$ 0.02	15.02 $\pm$ 0.03 / 16.39 $\pm$ 0.01
SeqGAN	8.63 $\pm$ 0.19 / 9.63 $\pm$ 0.04	6.61 $\pm$ 0.22 / 6.98 $\pm$ 0.08	15.00 $\pm$ 0.03 / 16.35 $\pm$ 0.02
RankGAN	8.42 $\pm$ 0.31 / 9.52 $\pm$ 0.11	7.14 $\pm$ 0.34 / 7.05 $\pm$ 0.12	15.01 $\pm$ 0.02 / 16.37 $\pm$ 0.02
MaliGAN	8.74 $\pm$ 0.16 / 9.67 $\pm$ 0.03	6.62 $\pm$ 0.25 / 7.14 $\pm$ 0.09	15.03 $\pm$ 0.03 / 16.39 $\pm$ 0.03
RelGAN	6.73 $\pm$ 0.54 / 6.68 $\pm$ 0.20	6.38 $\pm$ 0.70 / 7.17 $\pm$ 0.69	13.11 $\pm$ 0.55 / 13.85 $\pm$ 0.54
SAL	7.71 $\pm$ 0.17 / 9.31 $\pm$ 0.03	6.58 $\pm$ 0.15 / 6.97 $\pm$ 0.05	14.29 $\pm$ 0.11 / 16.24 $\pm$ 0.03
CCL	6.77 $\pm$ 0.34 / 6.65 $\pm$ 0.14	6.91 $\pm$ 0.62 / 7.68 $\pm$ 0.79	13.69 $\pm$ 0.36 / 14.33 $\pm$ 0.76
Ours	5.84 $\pm$ 0.35 / 5.71 $\pm$ 0.29	5.07 $\pm$ 0.78 / 7.46 $\pm$ 0.66	10.90 $\pm$ 0.59 / 10.17 $\pm$ 0.43

Table 6. Table 6: The human evaluation scale from 1 to 5 with corresponding criteria and example sentences.

Scale

Criterion & Example

5 - Excellent

Grammatical, acceptable, and meaningful. For example, “a man is carving under yellow

planes .”

4 – Good

Include 1 to 2 tiny grammatical errors, and the whole sentence is mostly acceptable and

meaningful. For example, “two giraffe standing in front of them .”

3 – Fair

Include major grammatical errors, but the whole sentence is still acceptable and making

sense. For example, “a kitchen with a grill roll from him .”

2 – Poor

Include severe grammatical errors. The whole sentence does not make sense, but some

parts are still acceptable. For example, “a motorcycle on a paved road on the freeway .”

1 - Unacceptable

It is basically a string of words with random order and totally ungrammatical. The entire

sentence does not make any sense. For example, “a city .”

Table 7. Table 7: Samples of baseline models and real data on MS COCO Image Caption dataset.

Samples

Real

a man flies a kite in a park .

a man riding a bike with a wooden trainer attached and a dog riding in it .

MLE

a man watches on his bike , in a lake on a field .

a women is standing behind an orange table in helmet on a child in the background .

SeqGAN

some people sitting on top of luggage near a truck .

a man sitting in a bath tub on tops .

TextGAN

a man riding a motorcycle .

a bathroom with a sink , and a table .

LeakGAN

a man standing next to her cell phone on a street sign .

a woman is holding a child in the air .

MaliGAN

a woman is standing and another oak cake on a drain .

a man standing in a kitchen with her laptop and two tables

RankGAN

a colorful bike is is down next to a large mirror .

a man is riding a bike down a track .

RelGAN

a woman walking with a dog in the city in front of a city bus .

a man sitting on a bed in a room with a chair on the couch .

SAL

a man on a motorcycle is flying on a grassy field .

a man stands in a green field .

Ours

a person is flying a kite on a beach next to the ocean .

a man is cooking in a kitchen with a white toilet in the background .

Table 8. Table 8: Randomly sampled 10 samples trained on MS COCO Image Captions.

a person is flying a kite on a beach next to the ocean .

a large jet sitting on a runway next to a landing strip .

a bathroom with a toilet , a sink , and towels hanging on a rack .

a man is sitting on a motorcycle with a woman on the back .

a man is cooking in a kitchen with a white toilet in the background .

a man is sitting a motorcycle on a dirt road .

an airplane is flying high in the sky .

a man and a woman on a motorcycle in front of a building .

a herd of sheep grazing in a pasture .

a cluttered room has large bed and a large clock .

Table 9. Table 9: Randomly sampled 10 samples trained on EMNLP2017 WMT News.

when it was announced on february 10 , it was delivered to ensure that everyone involved will cut operating costs .

but it was always clear that we could see what it is about and what ’ s going on in the next round .

even though economists have not yet been able to speak out after the election result , she insists .

“but it ’ s a natural question for ordinary people in the uk and across the uk . ” he said .

but it is still only six months until now , according to a new survey .

and it was a real surprise for us , and now i ’ m optimistic , happy in the next couple of weeks .

“but it is a signal that we are carrying out our own power in the wake of this period , ” he said .

russia is its only one in the top of the list since the arrest warrant launched an operation on its own behalf .

it is therefore no evidence that the person who committed to office is involved in the operation , but no one has to

say any seriously wounded .

“but it was a very poor start for me and i ’ d imagine that ’ s what we ’ re doing and we ’ re looking at focusing

on any given contract , ” he said .

Equations20

θ^{(G)} min ϕ^{(D)} max

θ^{(G)} min ϕ^{(D)} max

+

p (r^{t} ∣ x^{< t}; θ_{t}) = \frac{1}{N _{g}} i = 1 \sum N_{g} q (r_{i}^{t} ∣ x_{i}^{< t}; θ_{i})

p (r^{t} ∣ x^{< t}; θ_{t}) = \frac{1}{N _{g}} i = 1 \sum N_{g} q (r_{i}^{t} ∣ x_{i}^{< t}; θ_{i})

π^{t} = softmax (W_{G} Y^{t})

π^{t} = softmax (W_{G} Y^{t})

\mathbf{y^{t}}=\textrm{one\_hot}\big{(}\arg\max_{i}[g_{i}+\log\pi^{t}_{i}]\big{)}

\mathbf{y^{t}}=\textrm{one\_hot}\big{(}\arg\max_{i}[g_{i}+\log\pi^{t}_{i}]\big{)}

\hat{y}^{t}_{i}=\frac{\exp\big{(}(\log(\pi^{t}_{i})+g_{i})/\tau\big{)}}{\sum_{j=1}^{|V|}\exp\big{(}(\log(\pi^{t}_{j})+g_{j})/\tau\big{)}}

\hat{y}^{t}_{i}=\frac{\exp\big{(}(\log(\pi^{t}_{i})+g_{i})/\tau\big{)}}{\sum_{j=1}^{|V|}\exp\big{(}(\log(\pi^{t}_{j})+g_{j})/\tau\big{)}}

\displaystyle{}\Delta\big{(}F(x_{r}),F(x_{f})\big{)}=

\displaystyle{}\Delta\big{(}F(x_{r}),F(x_{f})\big{)}=

\displaystyle{}\big{\|}\mathbb{E}_{x_{r}\sim p_{\textrm{data}}}\big{[}F(x_{r})\big{]}-\mathbb{E}_{z\sim p_{z}}\big{[}F(G(z;\theta^{(G)}))\big{]}\big{\|}_{2}

\Delta\big{(}H(x_{r}),H(x_{f})\big{)}=H(x_{r})-H(G(z;\theta^{(G)}))

\Delta\big{(}H(x_{r}),H(x_{f})\big{)}=H(x_{r})-H(G(z;\theta^{(G)}))

\mathcal{L}_{\textrm{D}}=-\mathbb{E}_{x\sim p_{\textrm{data}},z\sim p_{z}}\log a[\Delta\big{(}H(x_{r}),H(x_{f})\big{)}]

\mathcal{L}_{\textrm{D}}=-\mathbb{E}_{x\sim p_{\textrm{data}},z\sim p_{z}}\log a[\Delta\big{(}H(x_{r}),H(x_{f})\big{)}]

\mathcal{L}_{\textrm{G}}=-\mathcal{L}_{\textrm{D}}+\Delta\big{(}F(x_{r}),F(x_{f})\big{)}

\mathcal{L}_{\textrm{G}}=-\mathcal{L}_{\textrm{D}}+\Delta\big{(}F(x_{r}),F(x_{f})\big{)}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications

Full text

Improved Training of Mixture-of-Experts Language GANs

Yekun Chai

Baidu

[email protected]

&Qiyue Yin

Junge Zhang

Institute of Automation, CAS

{qyyin,jgzhang}@nlpr.ia.ac.cn Corresponding author.

Abstract

Despite the dramatic success in image generation, Generative Adversarial Networks (GANs) still face great challenges in synthesizing sequences of discrete elements, in particular human language. The difficulty in generator training arises from the limited representation capacity and uninformative learning signals obtained from the discriminator. In this work, we (1) first empirically show that the mixture-of-experts approach is able to enhance the representation capacity of the generator for language GANs and (2) harness the Feature Statistics Alignment (FSA) paradigm to render fine-grained learning signals to advance the generator training. Specifically, FSA forces the mean statistics of the distribution of fake data to approach that of real samples as close as possible in the finite-dimensional feature space. Empirical study on synthetic and real benchmarks shows the superior performance in quantitative evaluation and demonstrates the effectiveness of our approach to adversarial text generation.

1 Introduction

Unsupervised sequence generation is the cornerstone for a plethora of applications, such as dialogue generation Li et al. (2017). The most common approach to autoregressive sequence modeling is maximizing the likelihood of each token in the sequence given the previous partial observation. However, using maximum likelihood estimation (MLE) is inherently prone to the exposure bias problem Bengio et al. (2015), which results from the discrepancy between the training and inference stage: the generator predicts the next token conditioned on its previously generated ones during inference but on its prefix ground-truth tokens during training, yielding accumulative mismatch along with the increment of generated sequence length.

Generative Adversarial Networks (GANs) Goodfellow et al. (2014) can serve as an alternative to models trained with MLE, which have achieved promising results in generating sequences of discrete elements, in particular, language sequences Kusner and Hernández-Lobato (2016); Yu et al. (2017); Lin et al. (2017); Guo et al. (2017); Fedus et al. (2018); Nie et al. (2019); de Masson d’Autume et al. (2019); Zhou et al. (2020); Scialom et al. (2020). GANs consist of two competing networks: a discriminator that is trained to distinguish the generated samples from real data, and a generator that aims to generate high-quality samples to fool the discriminator.

Although language GANs have succeeded in avoiding exposure bias issues, the limited representation capacity of the generator still precludes them from covering complex patterns of the real data distribution and thus deteriorates the quality and diversity of generated samples Nie et al. (2019). There has been a recent move towards enhancing the expressive power of generators in language GANs Nie et al. (2019); Liu et al. (2020); Scialom et al. (2020) by leveraging advanced building blocks in the generator architecture, such as Relational Recurrent Neural Networks Santoro et al. (2018) or pretrained Transformers Scialom et al. (2020).

Inspired by the preeminence of mixture-of-experts approaches in image GANs Hoang et al. (2018); Ghosh et al. (2018) and language generation tasks, such as machine translation Bi et al. (2019), we propose to adopt a mixture-of-experts generator to enhance the representation capacity of the proposed model. Generally, the generator consists of multiple cooperative experts who attempt to jointly generate high-quality sentences through their interaction to compete with the discriminator. This is orthogonal to the aforementioned improvements on generators in language GANs.

Meanwhile, lacking sufficient learning signals is another reason that hampers language GAN’s performance Lin et al. (2017); Chai et al. (2021). Lin et al. (2017) claimed that the binary classification in the discriminator network limits the learning capacity of tasks because the diversity and richness are plagued by the degenerated distribution. RankGAN Lin et al. (2017) replaced the binary classifier with a pairwise feature ranker by comparing the similarities between sample features in the latent space. SAL Zhou et al. (2020) classified the encoded features of constructed pairwise training examples into three categories, i.e., better / worse / indistinguishable.

To further promote the generator’s training by enriching the learning signals aside from the discriminator, we use an auxiliary encoder to extract the latent embeddings from real and fake data distributions and endow the generator with additional updating feedback apart from that yielded by the discriminator. Specifically, we leverage the Feature Statistics Alignment (FSA) paradigm to embed such latent feature representations in the finite-dimensional feature space and force the distribution of generated samples to approach the real data distribution by minimizing the distance between their respective feature representation centroids. Intuitively, matching the mean feature representations of fake and real instances could make the two data distributions closer. These intuitive advantages are also borne out in our ablation analysis.

Overall, we introduce a GAN architecture with a mixture-of-experts generator and Feature Statistics Alignment method, termed MoEGAN, for sequence generation. Our experimental study demonstrates the benefits of mixture-of-experts and FSA techniques to stabilize the generator’s training process and promote the quality of generated samples. Besides, our models could generate sentences with high quality in terms of the semantic coherence and grammatical correctness of language, as per human evaluation. Furthermore, we empirically demonstrate that the proposed architecture overshadows most existing models in terms of quantitative and qualitative evaluation.

To summarize, our main contributions are as follows:

•

We design a mixture-of-experts language GAN framework that integrates the multi-agent structure to enhance the expressive capacity of the generator network.

•

We utilize an auxiliary encoder to extract the latent embeddings and propose the Feature Statistics Alignment paradigm to endow the generator with fine-grained learning signals aside from the discriminator’s feedback. We empirically demonstrate its effectiveness in promoting the quality of generated samples.

•

Our method achieves new state-of-the-art results on three different benchmarks of language GANs, including synthetic data, MS COCO Image Captions dataset, and EMNLP2017 WMT News dataset.

2 Language GANs

Adversarial sequence generation has attracted broad attention for its properties to solve the exposure bias issue suffered from maximum likelihood estimation (MLE) for generating language sequences. Based on the game theory, its goal is to train a generator network $G(z;\theta^{(G)})$ that produces samples from the data distribution $p_{\textrm{data}}(x)$ by decoding the randomly initialized starting token $z$ into the sequence $x=G(z;\theta^{(G)})$ , where the training signal is provided by the discriminator network $D(x;\phi^{(D)})$ that is trained to distinguish between the samples drawn from the real data distribution $p_{\textrm{data}}$ and those produced by the generator. The minimax objective of adversarial training is formulated as:

[TABLE]

Despite the impressive results of GANs in the sequence generation Yu et al. (2017); Gulrajani et al. (2017); Scialom et al. (2020), there are still several fundamental issues in the GAN training: (a) Training instability, which arises from the intrinsic nature of minimax games in GANs; (b) Mode dropping, which is the fact that GANs only generate samples with limited patterns in the real data distribution instead of attending to diverse patterns Chen et al. (2018); (c) Reward sparsity, which is because that it is easier to train the discriminator than the generator, making it difficult to acquire the instructive feedback Zhou et al. (2020).

Due to the non-differentiability of gradients caused by sampling operations between the generator and discriminator for sequence generation, the majority of previous works have resorted to reinforcement learning (RL) heuristics with Monte Carlo search to collect the credits from the discriminator. The usage of RL may further deteriorate the instability of model training and exacerbate the reward sparsity problem. Gumbel-Softmax relaxation has proven to be an alternative to RL techniques Kusner and Hernández-Lobato (2016); Nie et al. (2019). Therefore, we utilize the Gumbel-Softmax reparameterization instead of policy gradients to circumvent the unstable training of RL in our framework.

3 Methodology

Unlike conventional language GANs that only update the generator with only one type of learning signals, the proposed model collects extra update feedback by judging the distortion between real and generated data distributions, besides true-or-false comparative rewards. In this work, we design a language GAN framework, in which the generator is guided by two learning signals: comparative credits from the classifier, and distortion credits propagated from an auxiliary encoder. The former gauges the relativistic confidence of real samples compared with generated ones, while the latter measures the difference between latent feature statistics of real and generated samples.

As illustrated in Figure 1, MoEGAN consists of three components: (a) A generator that leverages multiple agents to collaboratively produce sequences to fool the discriminative classifier; (b) An auxiliary encoder that extracts latent embeddings from real and fake samples and thus renders fine-grained learning signals for generator updates; (c) A comparative classifier that measures the relative likelihood that given real samples are more authentic than fake data.

3.1 Mixture-of-Experts Generator

We utilize multiple experts $\{G_{i}|i\in(0,N_{g})\}$ as generators, in which each expert receives as input previous generated tokens $x_{t-1}$ and produces the representation $r_{i}^{t}=G_{i}^{t}(x_{t-1})$ at the $t$ -th time step. At the beginning, i.e., $t=1$ , the input $x_{0}$ is the starting token $z$ , which is a randomly initialized word embedding. The expert $G_{i}$ can be a kind of recurrent neural networks (RNNs), such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), which can autoregressively produce the representation of current tokens.

Formally, the model distribution of $i$ -th expert $G_{i}$ parameterized by $\theta_{i}$ can be defined as $q(r^{t}|x_{i}^{<t};\theta_{i})$ . The overall distribution for the mixture-of-experts generator can be processed in various ways, such as concatenation or gating functions. Different experts jointly model the target hidden representations in collaboration to further generate realistic tokens. Since our work aims to identify the impact of the mixture-of-experts mechanism in sequence GANs, we simply take the expectation of them for aggregation and leave the exploration of various interaction ways in future work.

The model distribution of mutliple experts is formulated as:

[TABLE]

where $\theta_{t}$ denotes the trainable parameters of the generator. This can also be treated as an ensemble of multiple agents, which is similar to Liu et al. (2018). In practice, each agent equivalently serves as a constituent to vote for candidate choices at every time step.

Denoting the aggregated output of multiple agents as $Y^{t}\in\mathbb{R}^{D_{g}}$ , the output probabilities over the $|V|$ -dimensional output vocabulary at $t$ -th time step is:

[TABLE]

where $\pi^{t}=\{\pi^{t}_{1},\pi^{t}_{2},\cdots,\pi^{t}_{|V|}\}$ represents the output probabilities of the vocabulary tokens, $W_{G}\in\mathbb{R}^{|V|\times D}$ denotes the trainable parameter.

GANs have fallen short of discrete data generation primarily resulting from the incapability of gradient propagation passing from the discriminator to the generator, which is incurred by the non-differential sampling or argmax operations in between. Following Kusner and Hernández-Lobato (2016); Nie et al. (2019), we leverage the Gumbel-Softmax distribution Jang et al. (2016) by smoothly annealing to approximate the categorical distribution. The Gumbel-Max trick Maddison et al. (2014) can be parameterized as:

[TABLE]

where $\{g_{i}|i=1,\cdots,|V|\}$ are i.i.d from the Gumbel(0,1) distribution, that is, $g_{i}=-\log(-\log u_{i})$ where $u_{i}$ is drawn from a standard uniform distribution Uniform(0,1). $\mathbf{y^{t}}$ represents the $|V|$ -dimensional one hot encoding.

The Gumbel-Softmax trick approximates the non-differential $\arg\max$ operation with the softmax function:

[TABLE]

where $\tau$ denotes the softmax temperature to modulate the exploitation and exploration during training. When $\tau$ approaches too high, the approximation is nearly equiprobable, encouraging the generator to explore different options. In contrast, the lower $\tau$ could discourage the exploration and tend to exploit during training. In particular, when $\tau\rightarrow 0$ , $\hat{y}^{t}$ approaches the result of one-hot operator as in Eq. 4, whereas $\hat{y}^{t}$ will degenerate into a uniform distribution when $\tau\rightarrow\infty$ .

3.2 Auxiliary Encoder

Given the input sequence of real samples (denoted as $x_{r}$ ) and that of generated ones (denoted as $x_{f}$ ), we use an auxiliary encoder $F$ to extract their latent feature embeddings and measure the difference of their respective latent distributions, i.e., $\Delta\big{(}F(x_{r}),F(x_{f})\big{)}$ . In practice, we adopt Convolutional Neural Networks (CNNs) as the auxiliary encoder.

Intuitively, aligning the statistics of embedded feature representations increases the model capacity to capture various modes of the data distribution. For brevity, we utilize the first-order mean statistics in our framework and leave the higher-order statistics for future work.

We propose Feature Statistics Alignment (FSA), which measures the Euclidean distance between the minibatch centroids of real and fake feature representations. Given mini-batches of real data samples with the batch size of $N_{f}$ , we formulate the FSA distance as:

[TABLE]

By forcing the mean statistics of fake data to be close to the real samples, the generator could receive more informative signals during the training process.

Introducing an extra auxiliary encoder could cost additional computing resources and may slow down the overall training process. To mitigate this issue, we do not update the weight of the proposed encoder during training. Instead, we copy the parameters from the comparative classifier after each training iteration and keep them fixed to extract the latent embeddings (This requires identical model structures between the auxiliary encoder and the comparative classifier).

3.3 Comparative Classifier

Following the relativistic discriminator Jolicoeur-Martineau (2018), we take into account the relativistic confidence of given real samples using a comparative classifier.

Given the discriminator $H$ , the mismatch between real and fake inputs measures the relative confidence that given real data are more realistic than randomly sampled fake data , which can be formulated as:

[TABLE]

The comparative discriminator aims to maximize the likelihood that given real instances are more authentic than generated ones, whereas the generator runs counter to it. The loss function of the comparative discriminator is defined as:

[TABLE]

where $a$ represents the activation function to be relativistic (we use sigmoid function in our experiments).

3.4 Training

Generator’s Training Objectives.

In language GANs, the generator is more difficult to train than the discriminator, resulting in training instability and reward sparsity. To relieve these issues, we endow the optimization of the generator with signals from the auxiliary encoder. The loss function is thus:

[TABLE]

The goal of the discriminator is to maximize the gap between the generated and real data (Eq. 8), whereas the generator jointly considers two different aspects simultaneously: it not only competes with the discriminator by maximizing the gap in terms of relativistic signals but takes into account feature distortion signals (Eq. 9). This could pass more instructive feedback only to the generator, and also prevent the discriminator from being overtrained.

Adversarial Training Algorithm.

Algorithm 1 illustrates the overall training process of the proposed framework. The Relativistic Discriminator and the generator could reach the Nash Equilibrium when the generator could fool the discriminator into accepting its output as being true. Since the discriminator network is easy to be overtrained, we do not pretrain it but only pretrain the generator using MLE for few epochs.

4 Experiments

4.1 Experimental Settings

Dataset.

Following Lin et al. (2017); Guo et al. (2017); Nie et al. (2019); Zhou et al. (2020), we evaluate the proposed framework based on the Texygen benchmark platform Zhu et al. (2018) for adversarial text generation. Experiments were conducted on synthetic and real datasets: (a) synthetic data, which is generated by an oracle single-layer LSTM as in Yu et al. (2017); (b) MS COCO Image Caption dataset Chen et al. (2015); (c) EMNLP WMT 2017 News dataset Guo et al. (2017). Table 1 summarizes the statistics of benchmark datasets for evaluation.

Evaluation Metrics.

(1) For synthetic data experiments, we utilize a single-layer LSTM initialized by standard normal distribution as the oracle model, which is used to generate 10,000 samples of length 20 and 40 respectively as real data. We use the negative log-likelihood (NLL) under the oracle data distribution for evaluation, termed NLL ${}_{\textrm{oracle}}$ . (2) For real data, BLEU scores Papineni et al. (2002) are used to evaluate the n-gram statistics overlapping on the whole dataset. (3) To measure the diversity of generated samples, the NLL of the generator (denoted as NLL ${}_{\textrm{gen}}$ ) is used by computing the NLL of reference samples in the test set by the generator. (4) Considering that BLEU scores always focus on the local text statistics and may be insufficient for evaluating the overall quality of texts, we conducted additional human evaluation via crowdsourcing on comparison models.

Baselines.

We adopt MLE and other state-of-the-art models as baselines, involving SeqGAN Yu et al. (2017), RankGAN Lin et al. (2017), LeakGAN Guo et al. (2017), RelGAN Nie et al. (2019), Self-Adversarial Learning (SAL) Zhou et al. (2020), and Counter-Contrastive Learning GAN (CCL) Chai et al. (2021).

Implementation Details.

We adopt the Relational Memory Core (RMC) Santoro et al. (2018) as the agent architecture of the generator. As for the discriminator and auxiliary encoder, we use CNN architecture Kim (2014) for input sequences. See Appendix A for more details.

4.2 Experimental Results

Synthetic Data.

Table 2 illustrates the performance of different models on NLL ${}_{\textrm{oracle}}$ . Our models outperform other baseline models in terms of the generated sample quality, demonstrating the effectiveness of our proposed method. We empirically found that our models achieve the superior performance with the sequence length of 20 / 40 when the number of agent $N_{g}$ takes 2 and 3 respectively. As to the diversity, our method outperforms or achieves competitive NLL ${}_{\textrm{gen}}$ score compared with baselines (see Appendix C.1 for details).

Real Data.

To further verify the performance on real data, we run and evaluate our model on MS COCO image caption and EMNLP2017 WMT News dataset. The data preprocessing remains the same as Texygen Zhu et al. (2018). Table 3 compares the results of our model with baselines in terms of both the quality (BLEU) and diversity (NLLgen) metrics. Overall, our model exceeds or matches the comparison models on automatic evaluation metrics.

For quality metrics, SeqGAN and RankGAN outperform the MLE on BLEU scores with short n-gram spans, such as BLEU-2, but are inferior to MLE when comparing long-term spans in metrics such as BLEU-4/5, which may result from the lack of informative signals for the generator update. Nevertheless, LeakGAN, RelGAN, SAL, and CCL surpass the MLE method on all BLEU scores. This implies that both leakage information and comparative feedback can enhance the adversarial training. The improvements of our model over MLE are even larger than comparison models, demonstrating the benefits of exploiting the mixture-of-experts and FSA paradigm on language GANs. In terms of the diversity metric, most baselines behave not as well as MLE, except LeakGAN. This is because LeakGAN leverages the internal information from discriminators as the guide signal, assisting the original learning credits collected from the discriminator. Regarding the diversity metric, the proposed model slightly outranks LeakGAN and MLE on the MS COCO Image Caption dataset (maximum length 37), but achieves similar results on the EMNLP 2017 WMT News dataset (maximum length 51), since it is even difficult to produce the informative signals for long sequences. By leveraging the proposed methods, our model greatly outperforms these models and yields long sequences with promising qualities while generates samples with similar diversity. It implies that our model is adept at generating high-quality sentences under the guidance of introduced latent-feature-enhanced signals.

Human Evaluation.

Apart from the automatic evaluation, we also conducted human evaluation on MS COCO dataset. We randomly sampled 100 sentences from each model, then asked ten different people to score them in terms of grammatical correctness and meaningfulness on a scale of 1-5 after anonymizing the model’s identity. Our model received the highest score in comparison with other baselines as shown in Table 4. Please see Appendix C.2 for details.

Ablation Study.

To examine the benefits of proposed components in adversarial sentence generation, we conducted ablation tests by removing particular modules. Figure 2 demonstrates the relative importance of each component of our models with an ablation test. The usage of the mixture-of-experts generator results in the most significant performance gain, followed by the FSA. We empirically found that proposed approaches are able to boost the performance due to accurate estimates of feature statistics and stabilizing effects during the adversarial training. Note that following experiments are conducted on the MS COCO dataset if not otherwise specified.

Figure 3 illustrates the ablation test of mixture-of-experts generator and FSA in terms of the BLEU-4 score (See Appendix D for all results) and NLLgen. It can be observed that the ablation of either the mixture-of-experts or FSA approach deteriorates the model performance: the full model achieves a higher BLEU score and lower NLLgen. Note that without the mixture-of-experts generator, the increasing trend of BLEU scores tends to be slow, whereas FSA may contribute more to the sample diversity in contrast to mixture-of-experts paradigm. This is because the multiple experts are able to enrich the representation capacity of generators, while the FSA paradigm could provide the consecutive “fined-grained” smoother learning signals to update the generator.

Auxiliary Feature Visualization.

To visually verify the FSA’s impact, we plot sampled latent features of real and fake data using t-SNE. Figure 4(c) and 4(d) compares latent representations of real and generated data using our models without FSA, showing a relative mismatch between visualized feature embeddings, especially the missing area on the top left corner in Figure 4(d). Meanwhile, there are several surplus embedding segments in fake data (Figure 4(d)) but not in real data (Figure 4(c)). On the contrary, Figure 4(a) and 4(b) reveal a larger overlapping area between real and fake embeddings generated by our model with FSA, indicating the benefit of the FSA on matching the distribution of real and generated data. This can further support previous automatic evaluation results.

Impact of Hyperparameters.

Figure 5(a) displays the impact of agent number $N_{g}$ in the generator network on MS COCO Image Caption dataset. We can see that all BLEU scores reach their top when $N_{g}=2$ . We guess this is because we adopt an identical model structure for each generation agent, which may saturate with the increase of agent numbers. Figure 5(b) manifests the influence of the Gumbel-Softmax temperature $\tau$ on model performance. It is obvious that our models attain the best BLEU scores when $\tau=0.01$ whereas get lower NLLgen when $\tau=0.001$ . To take the trade-off between the quality and diversity, we adopt $\tau=0.01$ in our optimal hyperparameter settings.

5 Related Work

There has been a large category of GANs for sequence generation, which heavily rely on the RL paradigm. SeqGAN Yu et al. (2017) regards the sequence generation as a Markov decision making process, estimates rewards via Monte Carlo search, and trains the generator with policy gradient. RankGAN Lin et al. (2017) and SAL Zhou et al. (2020) replace the binary classifier in the discriminator as comparative discriminators to focus on relations between constructed pairs. MaliGAN Che et al. (2017) utilizes the information in the discriminator as an additional source of training signals in the MLE objective to reduce the variance of gradients. LeakGAN Guo et al. (2017) leaks the intermediate information via a manager to guide the generator, which is inspired by hierarchical RL. ColdGAN Scialom et al. (2020) integrates the advance of importance sampling, RL algorithm to finetune pretrained models.

Another approach uses non-RL methods for adversarial sequence generation by either approximating the categorical sampling or directly using the continuous latent representation. TextGAN Zhang et al. (2017) uses feature matching via a kernelized discrepancy in the Reproducing Kernel Hilbert Space. FMGAN Chen et al. (2018) proposes to match feature distributions using a Feature-Mover’s Distance. Both apply annealed soft-argmax for approximation. ARAML Ke et al. (2019) utilizes Reward Augmented Maximum Likelihood by sampling from the stationary distribution to acquire rewards. Gumbel-Softmax (GS) GAN Kusner and Hernández-Lobato (2016) and RelGAN Nie et al. (2019) prove the usefulness of Gumbel-Softmax in language GANs. Chai et al. (2021) propose the counter-contrastive learning objective to learn contrastive signals by explicitly comparing real and fake samples. However, improving the training of language GANs still remains an open problem. Our model aims to promote GS-GAN with the proposed techniques to boost the training of language GANs. We also report a list of techniques we tried but proved to be unsuccessful or unnecessary in Appendix B. To the best of our knowledge, the proposed model is the first that employs mixture-of-experts techniques in language GANs.

6 Conclusion

We propose an adversarial training framework for discrete sequence generation, by leveraging the advance of mixture-of-experts generator and Feature Statistics Alignment. Our model empirically shows superior performance in terms of quantitative and human evaluation. In the future, it is promising to extend our method to large language models.

Appendix A Implementation Details

Mixture-of-Experts Generator.

For each agent in the generator, we adopt the Relational Memory Core (RMC) Santoro et al. (2018), setting the memory size as 256, the memory slot number as 1, the attention head number as 2. The input embedding dimension is set to 32. Similar to Hoang et al. (2018), we utilize parameter sharing among different agents to reduce the computing budget.

Comparative Classifier & Auxiliary Encoder.

For the Comparative Classifier and Auxiliary Encoder, we employ the multi-channel convolution using multiple filters with various window sizes to extract the distinct n-gram features, followed by a max-over-time pooling operation to gather the most salient features, i.e., features with the highest value for each feature map. The input embedding dimension for the discriminator is set to 64. We adopt the filter size of $\{2,3,4,5\}$ with the number of 300 channels for each. A max-over-time pooling is adopted after the convolution layer. Afterward, a highway layer that is identical to SeqGAN Yu et al. (2017) is used followed by a linear transformation with the dimension of 100. Finally, apply a linear transformation to get the final logits. The auxiliary encoder shares the identical architecture and weight maxtrix with the comparative classifier (CNN), and the latent dimension is set to 100.

Optimization.

We use Adam optimizer with $\beta_{1}=0.9$ and $\beta_{2}=0.999$ . The initial learning rate for the generator was set to 1e-2 and 1e-4 for pretraining and adversarial training respectively. We set the initial learning rate as 1e-4 for the discriminator network during adversarial training. To prevent overfitting, we clip the gradients of trainable parameters whose L2 norm exceeds 5.

Training Settings.

We conduct experiments to finetune the following hyperparameters: agent number $N_{g}=\{1,2,3,4,5\}$ , batch size of $\{32,64,128\}$ , Gumbel-Softmax temperature $\tau\in\{1,0.5,0.1,0.01.0.001\}$ . The training step for the generator and discriminator network is set to $g=1$ and $d=5$ , respectively. The generator is pretrained for 150 epochs before the adversarial training. Finally, the optimal batch size is set to 128 for both synthetic and real datasets. It is worth noting that we also test the batch size to 256, which requires too much GPU resource but do not show obvious improvement. All experiments are run with five different random seeds on single Nvidia Titan RTX GPU.

Appendix B Negative Results

We list some approaches we tried but proved unsuccessful:

•

Replacing the FSA distance with Wasserstein distance to verify the effect of different distance for Feature Statistics Alignment. We found that the Wasserstein distance is not as efficient as our method that measures the distance between real and fake distribution centroids (See Figure 6 for comparison performance).

•

Using Mogrifier LSTM as the generator, which achieves similar results as vanilla LSTMs on the synthetic data.

•

Using a Wasserstein loss instead of current Relativistic Discriminator. Not as stable as the current solution.

•

Using the Transformer model as the discriminator. It achieves unsatisfied results with the current experimental settings.

•

Using interleaved training instead of two-stage training, i.e., adversarial training after pretraining. It is unsuccessful to train the generator for 15 iterations after one iteration using MLE.

•

Using top-k sampling and nucleus sampling, instead of the argmax in the Gumbel-Max trick. This does not always boost the final performance.

•

Using a hinge loss on the discriminator. This did not improve over the current relativistic loss.

Appendix C Evaluation Details

C.1 Synthetic Data

For synthetic data, we evaluate the generated sequence w.r.t. both quality and diversity. We use the oracle LSTM to evaluate the negative log-likelihood of our generated samples (denoted as NLLoracle) to measure the quality, and the negative log-likelihood of the synthetic dataset (denoted as NLLgen) measured by the generator during training. We also report the best NLLoracle+NLLgen to evaluate the trade-off between quality and diversity. It is observed that our model outperforms baseline models in terms of quality (measured by NLLoracle) and quality-diversity trade-off (measured by NLLoracle+NLLgen), and achieves or matches the competitive results of baselines w.r.t. the diversity (indicated by NLLgen).

C.2 Human Evaluation

Acceptance (i.e. whether a sentence is acceptable by human beings), grammaticality (i.e., if a sentence is grammatically correct), and meaningfulness (i.e., if a sentence makes sense) are three main standards for the text quality evaluation. Please note that any minor text formatting issues which will not negatively influence the understanding and correctness of the sentences (e.g., punctuation, capitalization, spelling errors, extra spaces) can be ignored. Please also note: a sentence consists of less than 10 words should get one point deducted. Table 6 gives more detailed criteria.

It is worth mentioning that human evaluation is used to measure the quality of generated sentences rather than diversity. For comprehensive comparison, we also compare with samples of MaliGAN Che et al. (2017) and TextGAN Zhang et al. (2017) in addition to aforementioned baselines.

C.3 Linguistic Analysis

From the syntactic perspective, our model is well performed by generating grammatically correct and meaningful sentences in most instances. Great deals of the generated sentences follow the basic sentence pattern in English, which is SVO (Subject Verb Object), with few exceptions. One of the most common ungrammatical forms is the omission of the verb. For example, in the sentence “a man standing next to a small airplane with two dogs” the primary verb “is” did not appear. Hence, samples like this cannot be counted as grammatically acceptable sentences but only meaningful phrases. However, even though there is a grammatical error, the sentences mostly make sense, which can be counted in the scale of 4. This is because there is plenty of sentences omitting verbs in real training data, such as “a sink next to a large white door” and “city street with parked cars and a bench”. It thus makes sense that models generate meaningful sentences with predicate omission.

Another point worth noting is the syntactic ambiguity, resulting from the placement of preposition phrases. For instance, the sentence “a man is sitting on a motorcycle with a woman on the bike” can have at least two kinds of interpretations. One of the explanations could be – a man is sitting on a motorcycle, while a woman is sitting on a bike. Another absurd meaning could be – a man and a woman are riding a motorcycle while sitting on the bike. The example shows that even the model can generate grammatically correct sentences, not all meanings make sense. Therefore, it still remains challenges to generate sentences with better syntactic meaning.

C.4 Case Study

Table 7 displays the generated samples from all baseline models and references in the MS COCO Image Caption dataset. From the presented sentences, we can observe that samples generated by MLE is less meaningful than other models, which is consistent with the results in Zhou et al. (2020). Besides, models such as RankGAN, RelGAN, and ours, tend to produce realistic sentences. RelGAN tends to generate long sentences with more prepositional phrases but lacks of consistency of language context, such as “with a chair on the couch”. SAL also generates confusing words such as “a man on a motorcycle is flying”. In contrast, our method can generate prepositional phrases more appropriately, such as “on a beach next to the ocean”. We can observe that our method generates better human-looking samples than other GANs and the MLE baseline.

Appendix D Results of Ablation Study

Figure 7 presents all BLEU scores and NLLgen of the ablation study. Our model enjoys the advantage of both the mixture-of-experts and FSA paradigm, achieving the superior performance on both the quality and diversity metrics. This demonstrates the effectiveness of proposed methods on improving the quality of generated samples. It is observed that our full model can reach its peak more quickly than models with any ablations on the mixture-of-experts generator or the FSA method. Removing either of them can deteriorate the overall performance on all BLEU scores and NLLgen metric. It can be inferred that the FSA method can have positive effects on the sample diversity while the mixture-of-experts generator can enrich the representation capacity of the generator and thus promote the training efficiency of the generator network.

Appendix E Generated Samples on Real Data

Table 8 and 9 show the randomly sampled sentences from the proposed models generated on MS COCO Image Captions and EMNLP2017 WMT News dataset, respectively. We can see that the proposed method can generate meaningful and grammatically correct sentences on real data.

Bibliography31

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bengio et al. (2015) Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In NIPS .
2Bi et al. (2019) Tianchi Bi, Hao Xiong, Zhongjun He, Hua Wu, and Haifeng Wang. 2019. Multi-agent learning for neural machine translation. In EMNLP-IJCNLP .
3Chai et al. (2021) Yekun Chai, Haidong Zhang, Qiyue Yin, and Junge Zhang. 2021. Counter-contrastive learning for language GA Ns . In Findings of the Association for Computational Linguistics: EMNLP 2021 , pages 4834–4839, Punta Cana, Dominican Republic. Association for Computational Linguistics. · doi ↗
4Che et al. (2017) Tong Che, Yanran Li, R. Zhang, R. Devon Hjelm, W. Li, Y. Song, and Yoshua Bengio. 2017. Maximum-likelihood augmented discrete generative adversarial networks. Ar Xiv , abs/1702.07983.
5Chen et al. (2018) Liqun Chen, Shuyang Dai, Chenyang Tao, Dinghan Shen, Zhe Gan, H. Zhang, Yizhe Zhang, and L. Carin. 2018. Adversarial text generation via feature-mover’s distance. In Neur IPS .
6Chen et al. (2015) Xinlei Chen, H. Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. L. Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. Ar Xiv , abs/1504.00325.
7de Masson d’Autume et al. (2019) Cyprien de Masson d’Autume, Mihaela Rosca, Jack W. Rae, and Shakir Mohamed. 2019. Training language gans from scratch. In Neur IPS .
8Fedus et al. (2018) William Fedus, Ian Goodfellow, and Andrew M Dai. 2018. Maskgan: Better text generation via filling in the_. In ICLR .