Unregularized Linear Convergence in Zero-Sum Game from Preference Feedback
Shulun Chen, Runlong Zhou, Zihan Zhang, Maryam Fazel, Simon S. Du

TL;DR
This paper proves the first unregularized linear convergence guarantee for Optimistic Multiplicative Weights Update in Nash learning from human feedback, improving understanding of preference-based alignment in large language models.
Contribution
It provides the first convergence analysis of OMWU in NLHF without requiring NE uniqueness, with instance-dependent rates and exponential convergence behavior.
Findings
OMWU achieves last-iterate linear convergence after a burn-in phase.
The probability of rarely played actions grows exponentially, improving dependence on constants.
Experiments validate theoretical convergence in tabular and neural policy settings.
Abstract
Aligning large language models (LLMs) with human preferences has proven effective for enhancing model capabilities, yet standard preference modeling using the Bradley-Terry model assumes transitivity, overlooking the inherent complexity of human population preferences. Nash learning from human feedback (NLHF) addresses this by framing non-transitive preferences as a two-player zero-sum game, where alignment reduces to finding the Nash equilibrium (NE). However, existing algorithms typically rely on regularization, incurring unavoidable bias when computing the duality gap in the original game. In this work, we provide the first convergence guarantee for Optimistic Multiplicative Weights Update () in NLHF, showing that it achieves last-iterate linear convergence after a burn-in phase whenever an NE with full support exists, with an instance-dependent linear convergence rate…
Peer Reviews
Decision·Submitted to ICLR 2026
The theoretical results are interesting, as prior analyses of OMWU typically require the Nash equilibrium to be unique, whereas this paper relaxes that assumption by only requiring the equilibrium to have full support.
The presentation and organization of the paper require significant improvement. From Section 3.2 to Section 3.4, the authors attempt to connect their problem to the NLHF setting in the context of LLM alignment. However, this connection is incorrect, as the current paper studies a pure matrix game, whereas the NLHF literature (e.g., Munos et al., 2023) considers games with KL regularization terms, and it is precisely the KL regularization that makes those games relevant to the LLM alignment setti
1.The analysis relaxes prior assumptions (from unique equilibrium to a more general full-support equilibrium) and introduces a novel framework for understanding how the algorithm "escapes" from suboptimal actions, leading to faster convergence. 2.The paper is well-structured and clearly motivates the need for last-iterate convergence. It effectively contrasts its contributions with prior work, making its advancements easy to understand. 3.Experiments on synthetic data (tabular and neural policie
1. The core theoretical guarantee relies on the assumption that a Nash Equilibrium exists where every action has a non-zero probability. This may not hold in many real-world scenarios where some actions are always suboptimal, thus narrowing the theory's applicability. 2. The paper claims relevance to LLM alignment but provides no experiments on actual language models or real preference data. This makes it unclear how the theoretical gains translate to practice, leaving a gap between theory and a
1. This paper is the first to provide a rigorous last-iterate linear convergence proof (converging to the original NE) for an unregularized algorithm (OMWU) in NLHF, thereby addressing the inherent bias introduced by regularization. 2. This paper relaxes the strict unique NE assumption required by Wei et al. (2020) to a more realistic existence of a full-support NE (Assumption 1). This relaxation enables the theory to be directly applicable to NLHF. 3. The convergence bound is improved from an
1. Although the title and introduction position the work as addressing the NLHF problem for LLM alignment, all experiments are conducted on synthetic tabular games ($n=10$) or small MLP-based policies ($n=100$). While the theoretical contribution is elegant, It remains unclear whether OMWU can retain its theoretical advantages in realistic, high-dimensional, and non-stationary LLM fine-tuning settings. 2. The entire theoretical foundation of the paper relies on the existence of a full-support
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Reinforcement Learning in Robotics · Speech and dialogue systems
