Decoding Game: On Minimax Optimality of Heuristic Text Generation Strategies

Sijin Chen; Omar Hagrass; Jason M. Klusowski

arXiv:2410.03968·cs.LG·May 20, 2025

Decoding Game: On Minimax Optimality of Heuristic Text Generation Strategies

Sijin Chen, Omar Hagrass, Jason M. Klusowski

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Decoding Game, a theoretical framework that models text generation as a game, explaining why heuristic methods like Top-k and Nucleus sampling work well despite lacking formal justification.

Contribution

It provides a formal game-theoretic analysis of decoding strategies, deriving optimal strategies and explaining the success of heuristic truncation-normalization methods.

Findings

01

Truncation-normalization methods are first-order approximations to optimal strategies.

02

Decoding strategies can be understood as regularized likelihood maximization.

03

The framework unifies various decoding methods under a common theoretical model.

Abstract

Decoding strategies play a pivotal role in text generation for modern language models, yet a puzzling gap divides theory and practice. Surprisingly, strategies that should intuitively be optimal, such as Maximum a Posteriori (MAP), often perform poorly in practice. Meanwhile, popular heuristic approaches like Top- $k$ and Nucleus sampling, which employ truncation and normalization of the conditional next-token probabilities, have achieved great empirical success but lack theoretical justifications. In this paper, we propose Decoding Game, a comprehensive theoretical framework which reimagines text generation as a two-player zero-sum game between Strategist, who seeks to produce text credible in the true distribution, and Nature, who distorts the true distribution adversarially. After discussing the decomposibility of multi-step generation, we derive the optimal strategy in closed form…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 2

Strengths

* The problem is appropriately motivated. Considering decoding strategies from a theoretical standpoint is essential for the research community. * The algorithm proposed to solve the suggested minimax optimization problem appears to be easily implementable in practice.

Weaknesses

I'm wondering to what extent this study provides new insights for existing decoding strategies that have achieved empirical success. The author states "To resolve this dichotomy, this paper aims to propose a comprehensive theoretical framework of text generation" in the introduction section, meaning that one of the main goals of this paper is to theoretically explain the success of these existing strategies. However, I'm not convinced that the proposed framework can replicate the widely recogniz

Reviewer 02Rating 8Confidence 4

Strengths

The theoretical framework is well justified and rigorously supported by proofs Statements are accurate and notation is consistent. Assumptions are clearly stated and natural. The problem that is addressed is practically significant. The framework generalizes previous decoding schemes opening the door to new, theoretically-grounded, heuristics.

Weaknesses

In Prop 3.1, the authors could mention why we need the $\epsilon < max \hat{p}$ - that if we assign non-zero measure to $x_{<t}$, we’d get a cost of $-\infty$. Theorem 4.3 doesn’t say anything about how the $p$ that yields the optimal solution looks like, nor what role $\hat{w}$ plays (in the min problem; the max one is clear). Does that lack structure or could it be added to the Theorem statement? While the game approach to it is intuitive to some, it might be worth emphasizing the perspective

Reviewer 03Rating 5Confidence 3

Strengths

- The analysis of the decoding process of LLMs is thoroughly considered. - The proposed decoding is backed up theoretically - The experimental results show somewhat better performance than baselines.

Weaknesses

- What is the motivation for choosing the TV-sup norm to compute the difference between the true distribution and the approximated one by LLMs? Is there any theoretical justification behind this selection? If not, at least an ablation study showing the effectiveness of this norm is necessary. - It is still ambiguous why the authors define the distance between the two mappings by the way around Line 248. - The result in the Equation (1) is non-trivial. Detailed transformation of it is necessary.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multi-Agent Systems and Negotiation