Leveraging Sentence Similarity in Natural Language Generation: Improving   Beam Search using Range Voting

Sebastian Borgeaud; Guy Emerson

arXiv:1908.06288·cs.CL·May 27, 2020

Leveraging Sentence Similarity in Natural Language Generation: Improving Beam Search using Range Voting

Sebastian Borgeaud, Guy Emerson

PDF

TL;DR

This paper introduces a voting-based approach to natural language generation that enhances diversity and informativeness of outputs by selecting the most representative sentence using range voting and similarity measures, improving BLEU scores and human ratings.

Contribution

It presents a novel voting-based method for natural language generation that improves diversity and quality of outputs across different models and tasks.

Findings

01

Generated sentences are longer and more diverse.

02

Higher BLEU scores with larger beam sizes.

03

Human evaluations favor the proposed method.

Abstract

We propose a method for natural language generation, choosing the most representative output rather than the most likely output. By viewing the language generation process from the voting theory perspective, we define representativeness using range voting and a similarity measure. The proposed method can be applied when generating from any probabilistic language model, including n-gram models and neural network models. We evaluate different similarity measures on an image captioning task and a machine translation task, and show that our method generates longer and more diverse sentences, providing a solution to the common problem of short outputs being preferred over longer and more informative ones. The generated sentences obtain higher BLEU scores, particularly when the beam size is large. We also perform a human evaluation on both tasks and find that the outputs generated using our…

Tables4

Table 1. Table 1: BLEU-1 and BLEU-4 scores obtained on the MSCOCO validation images.

	BLEU-1				BLEU-4
Beam size $k$	1	2	10	100	1	2	10	100
Beam search	66.66	67.97	67.22	66.18	25.39	26.83	27.16	26.31
\hdashlineLength normalisation	66.66	68.47	64.72	63.10	25.39	26.72	25.76	24.72
Diverse decoding	66.66	67.90	67.24	66.43	25.39	26.68	26.93	26.37
\hdashline ${overlap}_{1}$	66.66	68.55	66.26	66.36	25.39	26.47	25.61	24.60
${precision}_{1}$	66.66	68.54	66.31	66.46	25.39	26.47	25.62	24.58
${overlap}_{2}$	66.66	68.20	67.36	67.19	25.39	26.82	27.22	27.13
${precision}_{2}$	66.66	68.20	67.63	67.21	25.39	26.82	27.23	27.13
$lstm _ states$	66.66	67.97	68.42	69.10	25.39	26.83	27.96	28.23
\hdashline ${bleu}_{4}$ (MBR)	66.66	67.93	67.66	68.56	25.39	26.83	27.80	28.71
$smoothed _ {bleu}_{4}$ (MBR)	66.66	67.98	67.87	69.45	25.39	26.84	27.89	29.19

Table 2. Table 2: Number of distinct captions, unigrams and bigrams in the generated captions.

	Distinct captions			Distinct unigrams			Distinct bigrams
Beam size $k$	2	10	100	2	10	100	2	10	100
Beam search	9208	5488	4150	668	621	605	3395	2778	2479
\hdashlineLength normalisation	9978	6418	5039	681	627	587	3502	2863	2471
Diverse decoding	9942	6424	4403	672	646	612	3402	3023	2561
\hdashline ${overlap}_{1}$	10727	8916	10808	687	646	628	3576	3232	3596
${precision}_{1}$	10727	8902	10768	687	645	638	3572	3238	3607
${overlap}_{2}$	9519	7598	9221	673	620	580	3446	2854	2887
${precision}_{2}$	9522	7590	9248	673	620	581	3444	2848	2892
$lstm _ states$	9208	7613	10133	668	629	655	3395	2891	3331
\hdashline ${bleu}_{4}$ (MBR)	9159	6512	6763	667	612	570	3392	2666	2446
$smoothed _ {bleu}_{4}$ (MBR)	9206	6522	7019	667	613	560	3396	2675	2415

Table 3. Table 3: Average length of the generated captions. The reference captions contain on average 10.59 words.

	Average caption length
Beam size $k$	1	2	10	100
Beam search	8.41	8.79	9.18	9.11
\hdashlineLength norm.	8.41	9.19	10.24	10.43
Diverse decod.	8.41	8.71	9.12	9.15
\hdashline ${overlap}_{1}$	8.41	9.22	10.40	11.20
${precision}_{1}$	8.41	9.21	10.38	11.15
${overlap}_{2}$	8.41	8.96	9.86	10.55
${precision}_{2}$	8.41	8.96	9.86	10.55
$lstm _ states$	8.41	8.79	9.17	8.82
\hdashline ${bleu}_{4}$ (MBR)	8.41	8.77	9.27	9.32
$smoothed _ {bleu}_{4}$	8.41	8.79	9.24	9.13

Table 4. Table 4: BLEU scores on newstest2014, with range voting applied to the beams obtained with no-copy filtering.

Beam size $k$	1	2	4	10	30	100
Beam search	24.04	25.10	25.36	24.91	23.46	20.56
\hdashlineLength normalisation	24.04	25.19	25.59	25.55	24.40	21.78
Diverse decoding	24.04	24.88	25.17	24.71	23.49	20.82
Diverse beam search	24.04	24.55	24.70	23.93	22.14	18.38
Beam search (no copy)	23.96	25.10	25.43	25.23	24.38	22.59
\hdashline ${overlap}_{1}$	23.96	25.17	25.48	25.55	24.97	24.20
${precision}_{1}$	23.96	25.17	25.47	25.54	24.95	24.21
${overlap}_{2}$	23.96	25.14	25.49	25.70	25.08	24.62
${precision}_{2}$	23.96	25.20	25.53	25.39	24.69	23.96
$transformer _ states$	23.96	25.10	25.44	25.51	24.67	23.36
\hdashline ${bleu}_{4}$ (MBR)	23.96	25.09	25.42	25.51	24.79	23.53
$smoothed _ {bleu}_{4}$ (MBR)	23.96	25.10	25.42	25.51	24.81	23.65

Equations5

score (c) = v \in V \sum P (v) \cdot sim (v, c)

score (c) = v \in V \sum P (v) \cdot sim (v, c)

precision_{n} (v, c)

precision_{n} (v, c)

overlap_{n} (v, c)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Leveraging Sentence Similarity in Natural Language Generation:

Improving Beam Search using Range Voting

Sebastian Borgeaud

DeepMind & University of Cambridge

[email protected]

&Guy Emerson

University of Cambridge

[email protected]

Abstract

We propose a method for natural language generation, choosing the most representative output rather than the most likely output. By viewing the language generation process from the voting theory perspective, we define representativeness using range voting and a similarity measure. The proposed method can be applied when generating from any probabilistic language model, including n-gram models and neural network models. We evaluate different similarity measures on an image captioning task and a machine translation task, and show that our method generates longer and more diverse sentences, providing a solution to the common problem of short outputs being preferred over longer and more informative ones. The generated sentences obtain higher BLEU scores, particularly when the beam size is large. We also perform a human evaluation on both tasks and find that the outputs generated using our method are rated higher.

1 Introduction

A language model specifies a probability distribution over sequences of words: given a sequence ${s=x_{1}x_{2}\cdots x_{n}}$ of length $n$ , the model assigns a probability $P(s)$ to the entire sequence. The probability distribution may be conditioned: for example in machine translation the distribution is conditioned on the source language sentence.

In many applications, it is desirable to output a single sequence, rather than a distribution. A common approach is to choose the most likely sequence. However, this is problematic when the most likely sequence is not representative of the whole distribution.

For example, in dialogue generation tasks, the most likely output can be “I don’t know”, even when most of the probability mass is assigned to long informative sequences. Cao and Clark (2017) call this the “boring output problem”.

For a real-valued distribution, we can choose a representative output by taking the mean. However, for a discrete distribution (such as over sequences), the mean is not defined. In this paper, we choose a representative output using tools from voting theory, allowing us to avoid the boring output problem. The general idea is that, if the distribution assigns most of the probability mass to a group of similar sequences, we would like to generate one of them – even if they have low probability as individual sequences, they have high probability as a group. We can formulate this process as a range voting election, where the sentences vote for each other, with the strength of a vote being proportional to the similarity between the voter sequence and the candidate sequence.

Our approach can be used to mitigate problems commonly associated with language models. For example, a long-recognised problem is that shorter sequences are assigned higher probabilities and thus choosing the most likely sequence favours short sequences (Brown et al., 1995). Indeed, Stahlberg and Byrne (2019) show that the most likely output in machine translation is often the empty string. By designing the similarity function to be asymmetric such that more informative candidate sequences receive stronger votes, we can generate longer and more diverse outputs (see Fig. 1 for an example).

We focus on simple similarity metrics based on n-grams and generate the candidates and voters using beam search. We evaluate on two tasks: image captioning and machine translation. For both tasks, we find that our approach achieves higher BLEU scores, and performs better in a human evaluation. Our approach also generates longer and more diverse outputs, with the generated length and diversity more closely matching the length and diversity of the reference captions and reference translations.

2 Related work

Much work has gone into analysing sources of errors in language generation, often focused on machine translation. Koehn and Knowles (2017) raise 6 challenges for machine translation, including degrading performance for longer sentences, and degrading performance for larger beam sizes. Stahlberg and Byrne (2019) distinguish model errors (high probabilities of bad sequences) and search errors (failing to find sequences preferred by the model). They show that the global optimal translations (according to likelihood) are considerably worse than translations found by beam search. This points to both serious model errors and serious search errors, which cancel out to some degree. This suggests there is much work to be done in improving both our models and our search objectives – the latter is the aim of this paper.

Ott et al. (2018) find that beam search typically covers only a small proportion of the model’s probability mass,111Since our paper was submitted, this finding was replicated by Eikema and Aziz (2020), who further argue that the maximum-likelihood decoding objective is hard to justify when the maximum likelihood is so low.

and they show that the degradation for large beams is at least partly due to the training data containing target sentences that are exact copies of source sentences. They also suggest that beam search is an effective search strategy, for the maximum-likelihood search objective, finding hypotheses with higher model probabilities than the reference translations.

Cohen and Beck (2019) also find a performance degradation with larger beam sizes across different tasks (translation, image captioning and summarisation) and propose to add a search discrepancy heuristic to beam search. For image captioning, Vinyals et al. (2017) show that larger beams not only decrease performance but also reduce the diversity of the captions. They claim this is an overfitting effect and propose the use of small beam sizes as further regularization.

In unconditional, open-ended language generation, Holtzman et al. (2020) find that using likelihood as the decoding objective leads to bland and repetitive text with unnaturally high probability and too little variance. They claim this is not due to a search error, but due to the maximum-likelihood decoding objective. They propose sampling, truncated the distribution to the top $p$ percent of tokens.

2.1 Generation length and diversity

To increase the length and diversity of a model’s outputs, some authors have proposed changes to the model architecture. In dialogue generation, Cao and Clark (2017) use a latent variable model to capture the possible “topics” of a response.

Others have proposed changing the objective function. In dialogue generation, Li et al. (2016a) optimise mutual information instead of probability. In machine translation, Tu et al. (2017) modify an encoder-decoder model by adding a “reconstructor” to predict the input based on the output.

However, modifying the model or the objective function depends on the particular task, and applying these techniques to an existing system requires retraining the model. In this paper, we focus on general methods which can be applied to any probabilistic model in any generation task. Length normalisation (Wu et al., 2016; Freitag and Al-Onaizan, 2017) explicitly penalises shorter sequences during the beam expansion phase by dividing the log-probability of a sequence by its length. Diverse decoding (Li et al., 2016b; Li and Jurafsky, 2016) penalises repeated expansions of the same beam node. Diverse beam search (Vijayakumar et al., 2018) penalises generation of similar beams using their Hamming diversity. These last two methods aim to increase the diversity within a beam, but not necessarily across the dataset.

Kool et al. (2019) propose a stochastic beam search based on the Gumbel-Top- $k$ trick to sample without replacement. The proposed approach can trade-off BLEU score against translation diversity.

Finally, it is important to make sure that improvements to a model can be properly evaluated. After our paper was submitted, Freitag et al. (2020) report that the references used in machine translation often exhibit poor diversity, which can unfairly penalise models which exhibit good diversity. They propose to use paraphrased reference translations instead. These paraphrases yield higher correlation with human judgement when evaluated using BLEU, and could be used in future work to improve the evaluation of translation systems which aim to generate appropriately diverse outputs.

2.2 Minimum Bayes Risk Decoding

Kumar and Byrne (2004) introduce the Minimum Bayes Risk (MBR) decoder for machine translation. Like our proposed approach, this aims to use the whole distribution, rather than picking the most likely sequence. They frame the problem in terms of Bayes Risk: given the true distribution over outputs, and given a loss function between the system output and the target output, the Bayes Risk is defined as the expected loss. The best output is the one which minimises the Bayes Risk.

However, the true distribution over outputs is not known, so Kumar and Byrne approximate it using the model’s distribution. The MBR decoder first uses beam search, and then re-ranks it according to the BLEU scores between sequences in the beam.

Tromble et al. (2008) apply MBR over translation lattices. Shimizu et al. (2012) use MBR with a smoothed BLEU loss function and propose to limit the possible translations to those that are similar to most-likely translation generated by beam search.

Blain et al. (2017) propose to re-rank the sentences generated by beam search using a similarity metric. Their approach is similar to ours but doesn’t include the probability of the sentences given by the decoder, and thus would degrade completely in the limit of very large beam sizes. They find that using BLEU as a similarity metric reduces the quality of generated translations, according to both BLEU and a human evaluation.

3 Method

3.1 Beam search

When working with a distribution over sequences, it is not feasible to consider all possible sequences. Finding the most likely sequence can be computationally expensive – in fact, for an RNN it is undecidable in the general case (Chen et al., 2018). A common solution is to use beam search, which generates the sequence one token at a time, maintaining a list of the $k$ most promising sequences at each time step (for example: Brown et al., 1995; Koehn, 2004a). Greedy search is the special case where $k=1$ .

Beam search introduces an extra hyper-parameter, the beam size $k$ . Increasing $k$ covers more of the search space, but increases the computational cost. It is tempting to assume that increasing $k$ will produce better results, but empirically, the quality of the most likely sequence starts to decrease after $k$ exceeds a certain threshold (Koehn and Knowles, 2017; Cohen and Beck, 2019).

In the next section, we propose an alternative way to generate from a beam, which aims to avoid the drop in performance as beam size increases. Rather than choosing the most likely sequence, we choose the most representative sequence.

3.2 Range voting

To formalise representativeness, we propose to use a voting procedure. Although voting has been applied to ensembles of classifiers (for an overview, see: Kuncheva, 2004; Kuncheva and Rodríguez, 2014), we are not aware of work using voting to select from a distribution.

We can see each sequence as a candidate in an election, and the probability of a sequence as the proportion of votes for that candidate. From this perspective, the problem of probability mass being split across long sequences is the well-known problem of vote splitting. Suppose candidate $i$ wins an election. Now suppose we run the election again, but add an additional candidate $j$ , identical to $i$ . A voting system is robust against vote splitting (and called independent of clones) if the winner must be $i$ or $j$ (Tideman, 1987).

A well-studied system which is independent of clones is range voting (Heckscher, 1892; Smith, 2000; Tideman, 2006; Lagerspetz, 2016). Each voter scores each candidate in the range $[0,1]$ , and the candidate with the highest total score wins.

In our setting, probability mass can be seen as the proportion of votes placing a candidate as first choice (see Fig. 1 for an example). For range voting, we need to augment the votes with scores for all other candidates. We propose to do this using a similarity measure. The final score for a sequence $\mathbf{c}\in\mathcal{C}$ (the set of candidates) is given in (1), for a set of voter sequences $\mathcal{V}$ and a similarity measure $\operatorname{sim}$ .

[TABLE]

A sequence can act as both voter and candidate. Each voter sequence is weighted by its probability, and casts a vote for each candidate sequence, where the strength of the vote is the similarity between the voter and the candidate. The simplest way to apply this method is to use beam search to define both the set of candidates and the set of voters.

This can be seen as a generalisation of taking an average. In a Euclidean space, the mean is equivalent to voting with quadratic similarity ${1-k(x-y)^{2}}$ , and the median is equivalent to voting with linear similarity ${1-k|x-y|}$ , for some constant $k$ .

Although the vote splitting problem may appear abstract, it can happen in practice, even without considering similarity. When using subword vocabularies (Sennrich et al., 2016), there are multiple ways of encoding any given sentence. The model’s probability mass is split across sentences with identical surface form but different encodings.

Defining semantic similarity between sentences is recognised as a hard problem (Achananuparp et al., 2008; Cer et al., 2017; Pawar and Mago, 2019). In this work, we focus on simple, domain-agnostic similarity measures which do not require additional training.

First, we consider similarity based on n-grams. For a sequence $\mathbf{s}$ , we write $\operatorname{set}_{n}(\mathbf{s})$ for its set of n-grams, and $\operatorname{bag}_{n}(\mathbf{s})$ for its bag (or multiset) of n-grams. We define two measures in (2–3). Both are asymmetric, to encourage informative sequences: if $\mathbf{c}$ contains $\mathbf{v}$ plus more information, $\operatorname{sim}(\mathbf{v},\mathbf{c})$ should be high, but if $\mathbf{c}$ contains less information, then $\operatorname{sim}(\mathbf{v},\mathbf{c})$ should be lower. This allows an informative candidate sequence to gather more votes.

[TABLE]

Second, inspired by Mueller and Thyagarajan (2016), we consider a similarity measure based on the hidden states of the decoders (LSTM and Transformer) during generation (see §4.1). For each sequence, we find the average of the hidden states, and then compute the cosine similarity. We refer to this measure as $\operatorname{lstm\_states}$ and $\operatorname{transformer\_states}$ .

3.3 Comparison with MBR Decoding

The formulation used for range voting is reminiscent of MBR decoding (see §2.2). In fact, if the similarity measure in (1) is $\operatorname{sim}(\mathbf{v},\mathbf{c})=\textrm{BLEU}(\mathbf{c},\mathbf{v})$ , range voting recovers MBR decoding. From a theoretical point of view, range voting provides an independent motivation for MBR decoding, and furthermore, one which does not require the assumption that we can approximate the true distribution by the model’s distribution. We know that the model’s distribution does not match the true distribution (or else we would have already solved the task), and so this is a strong assumption to make.

From a practical point of view, range voting suggests that any similarity measure could be used, and not necessarily the evaluation metric. Using BLEU has several disadvantages. Firstly, BLEU can be harsh: when there are no 3- or 4-gram matches, the score is 0. Secondly, BLEU is a corpus-level metric which does not decompose over sentences. Finally, BLEU is precision-based, penalising translations containing information that is not in the reference. In MBR, this means that candidate sequences are penalised for containing more information than voter sequences. Our proposed similarity measures are asymmetric in the opposite direction, to encourage generation of long and informative sequences.

Indeed, in our experiments, we have found that simple similarity measures produce longer and more diverse sentences than BLEU, and for translation better results, even though BLEU is used as the evaluation metric.

Furthermore, the voting theory perspective can yield analytical insights even when range voting is not used. For example, the performance degradation found by Cohen and Beck (2019) can be interpreted in terms of vote splitting. They argue for the need to filter out sequences which begin with a low-probability token that is followed by very-high-probability tokens, in favour of sequences where all tokens have fairly high probability. The sequences they want to filter out have not split the vote (later tokens have probability close to 1, so there are no similar sequences that have high probability), but the sequences they want to keep have split the vote (there are similar sequences with similar probability). Their method aims to remove these problematic sequences that don’t split the vote, while our method aims to be robust against vote splitting.

4 Experiments

We evaluate our method on two tasks: image captioning and machine translation. For MBR, we use BLEU ( $\operatorname{bleu_{4}}$ ) and a smoothed version of BLEU ( $\operatorname{smoothed\_bleu_{4}}$ ) which adds 1 to the $n$ -gram counts for $n{>}1$ to mitigate the harshness of the metric (Shimizu et al., 2012).

We consider two baselines: length normalisation and diverse decoding, described in §2.1. For machine translation, we also consider diverse beam search as a further baseline. Other methods mentioned in §2 cannot be straightforwardly applied as they require modifying the model or the training objective.

4.1 Image captioning

We use the MSCOCO dataset (Lin et al., 2014), which consists of 82,783 training images and 40,504 validation images, each annotated with 5 captions from human annotators.

We use the “Show and Tell” encoder-decoder architecture of Vinyals et al. (2015). The encoder is a pretrained Inception V3 CNN (Szegedy et al., 2016) from which we extract a feature vector from the final pooling layer (Ioffe and Szegedy, 2015). The decoder is an LSTM (Hochreiter and Schmidhuber, 1997) with 512 hidden units, initialising the hidden state using the encoder. The vocabulary consists of the 5000 most common words in the training captions, for which embeddings of size 512 are learned from scratch.

4.1.1 BLEU scores

Table 1 shows BLEU scores (Papineni et al., 2002) on the MSCOCO validation set computed using NLTK (Bird et al., 2009). The bigram similarity measures and the $\operatorname{lstm\_states}$ measure improve BLEU scores for almost all beam sizes. In contrast, diverse decoding has almost no effect on BLEU, while length normalisation performs worse than standard beam search. The best result with our similarity metrics is achieved by $\operatorname{lstm\_states}$ at $k{=}100$ . This is significantly better than the best result for standard beam search ( $k{=}10$ ), with $p{<}0.001$ for a paired bootstrap test following Koehn (2004b). Using $\operatorname{smoothed\_bleu_{4}}$ and increasing the beam size to $k{=}100$ gives the overall best results.

Sampling methods proposed for open-ended generation perform poorly. Top-k sampling (Fan et al., 2018) achieves BLEU scores of 17.15 $(k{=}4)$ and 13.79 $(k{=}10)$ , nucleus sampling (Holtzman et al., 2020) achieves a score of 13.62 $(\textrm{top\_p}{=}0.9)$

Consistent with Ott et al. (2018) and Koehn and Knowles (2017), increasing $k$ with beam search too much reduces BLEU. However, this drop does not occur for our voting method.

4.1.2 Caption length

To analyse differences between methods, we first look at caption length, shown in Table 3. Standard beam search produces slightly longer captions as $k$ increases up to 10. All n-gram measures generate longer captions than standard beam search, and length continues to increase as $k$ goes to 100. Length normalisation also increases caption length, but this is at the cost of BLEU score (see §4.1.1). Diverse decoding does not increase caption length. The $\operatorname{lstm\_states}$ measure produces slightly shorter captions – as it is symmetric, it does not favour long sequences as the asymmetric n-gram measures do (see §3.2). As predicted by our range voting interpretation, MBR, for which the asymmetry is in the opposite direction, produces shorter captions than the simple n-gram similarity metrics.

4.1.3 Caption diversity

Following the approach of Li et al. (2016a), Dhingra et al. (2017), and Xu et al. (2017, 2018), we investigate the diversity of the generated captions by counting the number of distinct captions, unigrams, and bigrams (see Table 2).

For standard beam search, the number of distinct captions drops as $k$ increases. Both baselines weaken this effect, but the drop is still present. In contrast, range voting maintains caption diversity as $k$ increases, for all similarity measures.

Similarly, standard beam search sees a drop in the number of distinct unigrams and bigrams as $k$ increases, and the baselines do not seem to mitigate this. In contrast, the unigram measures and the $\operatorname{lstm\_states}$ measure maintain both unigram diversity and bigram diversity as $k$ increases, while the bigram measures partially maintain bigram diversity. As expected from our range voting perspective, MBR generates less diverse captions.

4.1.4 Human evaluation

BLEU is known to be imperfect, and does not always match human judgements (Callison-Burch et al., 2006; Blain et al., 2017). While the n-gram similarity measures produce similar BLEU scores to standard beam search, they also produce longer captions, which are potentially more informative. To investigate whether they are more informative in way that is not reflected by BLEU, we took 500 validation images for human evaluation, comparing the captions produced by standard beam search ( $k{=}10$ ) against our best-performing n-gram measure ( $\operatorname{precision}_{2}$ , $k{=}100$ ). Each pair of captions was presented in a random order, with the original image, and judged on a five-point scale (one caption much better, slightly better, or no difference).

The voted caption was rated better 106 times, and worse 73 times. This is statistically significant, with $p{=}0.0165$ for a two-tailed sign test, discarding ties (Emerson and Simon, 1979). However, for captions rated much better, the voted caption was better 27 times and worse 40 times. This is suggestive but not fully significant ( $p{=}0.142$ ).

These results support the claim that a voted caption represents more of the information present in a model’s distribution over captions – this often leads to a better caption, but where the model is wrong, adding wrong information can make the caption much worse. After all, our method is designed as a better way to select from a distribution, not as an improvement to the distribution itself.

4.2 Machine translation

For the translation task, we use the WMT’14 English-German dataset, consisting of 4.5M sentence pairs. We train a Transformer ‘big’ model (Vaswani et al., 2017), implemented in the Tensor2Tensor library (Vaswani et al., 2018). We use the joint source and target byte-pair encoding vocabulary (Sennrich et al., 2016) with 32,000 tokens available on Tensor2Tensor. All results reported are for the newstest2014 test set, containing 2737 sentence pairs (Bojar et al., 2014).222We are evaluating systems translating from English into German, but half of the newstest2014 sentences were originally in German and translated into English. Translation artifacts are known to have an impact on machine translation performance (for example: Kurokawa et al., 2009; Holmqvist et al., 2009; Lembersky et al., 2012). One reviewer asked whether there is a difference in performance for the two halves of the dataset, as found by Freitag et al. (2019). In terms of BLEU score, range voting appears more effective for forward-translation (original text in English), but in terms of manual evaluation, it appears more effective for backward-translation (original text in German). For reasons of space, we only report results for the whole dataset.

The BLEU scores were computed using SacreBleu (Post, 2018).

Ott et al. (2018) found that a common source of model error comes from outputting a copy of the input sentence, still in the source language. We also observe this phenomenon: with beam size 4, 0.4% of the outputs are exact copies of the input. This increases to 3.8% of the outputs for beam size 100. When counting the number of partial copies333A partial copy is defined to be a generated sentence containing at least 50% of the unigrams in the input sentence. the effect is even stronger: for beam sizes 4 and 100, respectively 1.3% and 12.4% of the generated translations are partial copies. Because of this, we add the method proposed by Ott et al. (2018), which filters out partial copies during beam search, as an extra baseline.

4.2.1 BLEU scores

The BLEU scores obtained on the WMT’14 En-De newstest2014 test set are shown in Table 4.

For beam search and all considered baselines, the scores for the larger beam sizes drop considerably. Adding the copy pruning heuristic from Ott et al. (2018) does help mitigate this problem somewhat but does not solve it: there is almost a 3 BLEU point drop between $k{=}4$ and $k{=}100$ .

To decouple a trivial source of model errors (input copies) from search errors, we apply our range voting method on the beams obtained with the filtering heuristic (Table 4, bottom half). Regardless of which similarity metric is used, re-ranking using range voting improves the BLEU score, and with the $\operatorname{overlap}_{2}$ similarity, we achieve the best overall score of 25.70. Furthermore, the performance drop at large beam sizes is reduced when using range voting to about 1 BLEU point for $\operatorname{overlap}_{2}$ .

There are two possible reasons for lower performance at larger beams: (1) different candidates: the sentence selected for a small beam is not in the larger beam; or (2) different voter preferences: the sentence selected for a small beam size is still there, but range voting selects a different sentence. In fact, both phenomena occur. First, for beam search and all similarity metrics, about 10% and 5% of the sentences selected at $k{=}4$ and $k{=}10$ respectively are not in the beam of size 100. Second, 48% and 61% of the sentences chosen by standard beam search with $k{=}4$ and $k{=}10$ respectively are also chosen for $k{=}100$ , but this drops to 32% and 36% respectively when using range voting with $\operatorname{overlap}_{2}$ similarity. This suggests that generating candidates and voters independently could lead to further improvements, which we explore in §3.

Sampling methods also perform poorly on this task. Top k sampling achieves BLEU scores of 17.39 $(k{=}4)$ and 15.21 $(k{=}10)$ , nucleus sampling achieves a score of 10.10 $(\textrm{top\_p}{=}0.9)$

4.2.2 Translation length

The average length of the generated translations are shown in Figure 2. All similarity metrics generate longer translations than standard beam search with and without filtering, but shorter than length normalisation. At beam size $k{=}100$ , length normalised beam search generates almost an extra word per translation compared to $k{=}30$ .

Just as for image captioning, the length of translations generated by standard beam search decreases as the beam size increases. We again note that the translations generated by range voting with asymmetric similarity metrics are on average longer, except for MBR where the asymmetry in the similarity metric penalises longer candidates. However, it is no longer the case that increasing the beam size also increases the length of the translations generated by range voting.

4.2.3 Translation diversity

The numbers of distinct bigrams generated are shown in Figure 2. Out of diverse decoding and diverse beam search, which aim to increase diversity within a beam, only diverse decoding increases the number of generated bigrams compared to beam search. Length normalisation generates the most unique bigrams, and this increases with beam size, also due to the translations being longer on average. On the other hand, the copy filtering heuristic decreases the number of distinct bigrams generated. Just as for image captioning, range voting increases the diversity of the generated translations. For all similarity metrics, more unique bigrams are generated than beam search with copy filtering (on top of which range voting was applied). Furthermore, the simple n-gram metrics generate more unique bigrams than standard beam search, recovering the drop occurring for the filtering heuristic.

4.2.4 Human evaluation

We used a human evaluation to investigate differences not reflected by BLEU. For 500 sentences, we compared the strongest baseline (length normalisation, $k{=}4$ ) with range voting ( $\operatorname{precision}_{2}$ , $k{=}10$ , as this performed well on BLEU, length, and diversity), following the procedure as in §4.1.4. The voted translation was rated better 69 times, and worse 44 times. This is statistically significant, with $p{=}0.0235$ for a two-tailed sign test. For translations rated much better, the difference is not significant (36 better, 28 worse).

4.2.5 Including more voters

The range voting formulation doesn’t require the set of candidates $\mathcal{C}$ and voters $\mathcal{V}$ to be the same (see Equation 1). We can capture more knowledge from the underlying distribution by using a larger and more diverse set of voters (and could be acquired more efficiently by repeatedly sampling) whilst constraining the set of candidates to avoid model errors. This was similarly done by Tromble et al. (2008), who refer to the sets of voters and candidates as the “evidence” and “hypothesis” spaces.

For the voters, we increase $k$ from 4 to 1000 and apply 3 different search methods: sampling $k$ times, stochastic beam search (Kool et al., 2019), and beam search with copy filtering. For the candidates we use beam search with copy filtering and $k{=}4$ . We fix the similarity metric to $\operatorname{overlap}_{2}$ , which was the best performing metric for large $k{\geq}4$ ( §4.2.1).

For all 3 generation methods, increasing the number of voters increases BLEU (Figure 3), suggesting that the previous drop in performance is due to worse candidates in larger beams, rather than worse voter preferences.

5 Conclusion

Instead of generating the most likely sequence, we propose a method to generate the most representative sequence, formalising representativeness using a similarity measure and range voting.

The evaluation on image captioning and machine translation shows that despite using simple similarity measures, we achieve an increase in BLEU score, an increase in caption length and diversity, and statistically significantly better human evaluation performance on both tasks.

For the image captioning task, performance of our method does not drop as beam size increases, removing the sensitivity of results to this hyperparameter. On the machine translation task, performance does drop for larger beam sizes, although by much less than with standard beam search or the baselines. Furthermore, performance increases as the number of voters increases, for a fixed set of candidates.

Using better similarity measures that capture semantics could further improve results and is a promising direction for further research.

Finally, our approach can be applied to any probabilistic language model, without any need for additional training. This opens up many other tasks, including summarisation, dialogue systems, and question answering. If multiple outputs can be used (e.g. offering options to a user), our method can be extended to use reweighted range voting (Smith, 2005), a procedure that elects multiple candidates.

Acknowledgements

We would like to thank Kris Cao for discussions about distributions over sequences, which prompted the initial idea for this project. We would like to thank Dr. Robert Harle and Prof. Ann Copestake for making this project possible, and for providing some early feedback. We would like to thank Andreas Vlachos, Guy Aglionby, James Thorne, Chris Davis, and the NLIP reading group in Cambridge, for feedback on earlier drafts of this paper. Finally, we would like to thank Chris Dyer for his insightful comments and suggestions.

Appendix A Translation length and diversity

[TABLE]

Bibliography59

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Achananuparp et al. (2008) Palakorn Achananuparp, Xiaohua Hu, and Xiajiong Shen. 2008. The evaluation of sentence similarity measures. In Proceedings of the 10th International Conference on Data Warehousing and Knowledge Discovery , pages 305–316. Springer.
2Bird et al. (2009) Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit . O’Reilly Media Inc.
3Blain et al. (2017) Frédéric Blain, Lucia Specia, and Pranava Madhyastha. 2017. Exploring hypotheses spaces in neural machine translation . In Proceedings of the 16th Machine Translation Summit (MT Summit XVI) . Asia-Pacific Association for Machine Translation (AAMT).
4Bojar et al. (2014) Ondřej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, et al. 2014. Findings of the 2014 Workshop on Statistical Machine Translation . In Proceedings of the Ninth Workshop on Statistical Machine Translation , pages 12–58.
5Brown et al. (1995) Peter F Brown, John Cocke, Stephen A Della Pietra, Vincent J Della Pietra, Frederick Jelinek, Jennifer C Lai, and Robert L Mercer. 1995. Method and system for natural language translation. US Patent 5,477,451.
6Callison-Burch et al. (2006) Chris Callison-Burch, Miles Osborne, and Philipp Koehn. 2006. Re-evaluation the role of BLEU in machine translation research . In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL) .
7Cao and Clark (2017) Kris Cao and Stephen Clark. 2017. Latent variable dialogue models and their diversity . In Proceedings of the 15 th Conference of the European Chapter of the Association for Computational Linguistics (EACL) .
8Cer et al. (2017) Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. Sem Eval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation . In Proceedings of the 11th International Workshop on Semantic Evaluation (Sem Eval-2017) , pages 1–14.