Closing the Curious Case of Neural Text Degeneration

Matthew Finlayson; John Hewitt; Alexander Koller; Swabha Swayamdipta,; Ashish Sabharwal

arXiv:2310.01693·cs.CL·October 4, 2023·1 cites

Closing the Curious Case of Neural Text Degeneration

Matthew Finlayson, John Hewitt, Alexander Koller, Swabha Swayamdipta,, Ashish Sabharwal

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper provides a theoretical explanation for why truncation sampling heuristics like nucleus sampling are effective in language generation, and proposes a new, more precise truncation strategy leveraging the softmax bottleneck, demonstrating improved results.

Contribution

The paper offers a theoretical analysis of truncation sampling effectiveness and introduces a novel truncation method that surpasses threshold-based approaches in language generation tasks.

Findings

01

Theoretical proof that truncation methods ensure nonzero true probability tokens.

02

Development of a new truncation strategy leveraging the softmax bottleneck.

03

Experimental results showing improved performance over existing methods.

Abstract

Despite their ubiquity in language generation, it remains unknown why truncation sampling heuristics like nucleus sampling are so effective. We provide a theoretical explanation for the effectiveness of the truncation sampling by proving that truncation methods that discard tokens below some probability threshold (the most common type of truncation) can guarantee that all sampled tokens have nonzero true probability. However, thresholds are a coarse heuristic, and necessarily discard some tokens with nonzero true probability as well. In pursuit of a more precise sampling strategy, we show that we can leverage a known source of model errors, the softmax bottleneck, to prove that certain tokens have nonzero true probability, without relying on a threshold. Based on our findings, we develop an experimental truncation strategy and the present pilot studies demonstrating the promise of this…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 8· accept, good paperConfidence 4

Strengths

* The work offers a theoretical explanation for why certain ad-hoc methods used during language generator decoding work well. This is a valuable insight to the NLG community * The work then develops a sampling algorithm based on this theoretical explanation

Weaknesses

* The theoretical portion of the paper is at times difficult to understand due to notational choices and lack of specificity (for example, switching between individual token probabilities ). This is particularly important since the theoretical portion is the main contribution of the work * The method does not appear to have empirical performance benefits and is computationally expensive, making it impractical * There lacks robust empirical justification of the hypothesis. Figure 7, which is int

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- Nice idea and analysis - Well written / clear - Shines new insights on a well-studied problem and could lead to more promising sampling methods

Weaknesses

- Results are rather weak, efficacy of the method still remains to be demonstrated (minor) - An pseudo-code / algorithm box with the practical implementation of BA is needed in the main paper (minor) - Unclear whether the method will help for larger models or for models where the approximation errors (under-estimation / over-estimation) are small (kinda major).

Reviewer 03Rating 8· accept, good paperConfidence 2

Strengths

This is a great analysis paper, providing an interesting explanation for why truncation sampling works so well in language model decoding. The paper's motivation is clear and well-written. The fact that BAT can determine that some tokens have nonzero true support, even though they are assigned less probability than others which are not in the support of the true distribution, is a surprising and compelling result. Leveraging the softmax bottleneck is a clever trick here and one that will be unex

Weaknesses

The primary weakness seems to be the performance of BAT compared to other methods. Despite its theoretical justification, it does not clearly outperform other sampling approaches (Figure 5). Although there is a preference for BAT to eta-sampling shown in Figure 6 and Table 1, this preference is very slight and the comparison is only between two sampling methods. However, I do not see this weakness as a legitimate reason to reject the paper, since its main contribution seems to be analysis and th

Code & Models

Repositories

mattf1n/basis-aware-threshold
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Language and cultural evolution

MethodsSoftmax