LLM Active Alignment: A Nash Equilibrium Perspective
Tonghan Wang, Yuqi Pan, Xinyi Yang, Yanchen Jiang, Milind Tambe, David C. Parkes

TL;DR
This paper introduces a game-theoretic framework using Nash equilibrium analysis to predict and influence large language model behaviors, providing explicit guidance for socially desirable alignment outcomes.
Contribution
It develops a novel analytical approach for active alignment of LLMs via Nash equilibrium, enabling strategic control over multi-agent LLM populations.
Findings
Nash equilibrium characterizations for LLM populations
Identification of political exclusion phenomena in reasoning-based models
Active alignment can prevent social biases in LLM interactions
Abstract
We develop a game-theoretic framework for predicting and steering the behavior of populations of large language models (LLMs) through Nash equilibrium (NE) analysis. To avoid the intractability of equilibrium computation in open-ended text spaces, we model each agent's action as a mixture over human subpopulations. Agents choose actively and strategically which groups to align with, yielding an interpretable and behaviorally substantive policy class. We derive closed-form NE characterizations, adopting standard concave-utility assumptions to enable analytical system-level predictions and give explicit, actionable guidance for shifting alignment targets toward socially desirable outcomes. The method functions as an active alignment layer on top of existing alignment pipelines such as RLHF. In a social-media setting, we show that a population of LLMs, especially reasoning-based models,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Language and cultural evolution · Text Readability and Simplification
