Stackelberg Learning from Human Feedback: Preference Optimization as a Sequential Game

Barna P\'asztor; Thomas Kleine Buening; Andreas Krause

arXiv:2512.16626·cs.LG·December 19, 2025

Stackelberg Learning from Human Feedback: Preference Optimization as a Sequential Game

Barna P\'asztor, Thomas Kleine Buening, Andreas Krause

PDF

Open Access 1 Models 3 Reviews

TL;DR

This paper introduces SLHF, a sequential game framework for preference optimization in AI alignment, enabling iterative refinement and improved robustness over existing methods like RLHF and NLHF.

Contribution

SLHF models preference optimization as a sequential game, capturing richer preferences and enabling inference-time refinements, which improves alignment and robustness.

Findings

01

SLHF achieves strong alignment across diverse datasets.

02

SLHF scales from 0.5B to 8B parameters.

03

Inference-time refinements transfer across models.

Abstract

We introduce Stackelberg Learning from Human Feedback (SLHF), a new framework for preference optimization. SLHF frames the alignment problem as a sequential-move game between two policies: a Leader, which commits to an action, and a Follower, which responds conditionally on the Leader's action. This approach decomposes preference optimization into a refinement problem for the Follower and an optimization problem against an adversary for the Leader. Unlike Reinforcement Learning from Human Feedback (RLHF), which assigns scalar rewards to actions, or Nash Learning from Human Feedback (NLHF), which seeks a simultaneous-move equilibrium, SLHF leverages the asymmetry of sequential play to capture richer preference structures. The sequential design of SLHF naturally enables inference-time refinement, as the Follower learns to improve the Leader's actions, and these refinements can be…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

* The paper is generally easy to follow and can be understood. The overview of differetn approaches is nice too! * The section demonstrating the different types of preference relationships that different models can address is very nice and useful! * The approach is well justified theoretically. * The experimental evaluation considers both preference dataset evaluation and general finetuning * to my knowledge, using a stackleberg game for learning from feedback is novel.

Weaknesses

* The experiment section lacks any ablations on the choices made. For example, how does the two-timescale schedule affect performance? * The method seems like it will be computationally more expensive. * I am not sure why the stackleberg formulation makes sense. I can see how the nash formulation can resolve ambiguities in preferences vs BT, but realistically when would I want to have a leader and follower? Using the follower will double inference costs. * the gains of the leader vs the nash m

Reviewer 02Rating 6Confidence 3

Strengths

- Unlike standard RLHF, SLHF optimizes directly over pairwise preferences without collapsing them into a single scalar reward, allowing it to handle complex and intransitive preference cycles. - The Leader-Follower structure naturally supports improving model outputs at inference time, as the Follower is explicitly trained to refine a given response, allowing for iterative improvement with more computation. - By decomposing the problem, the Follower solves a simpler refinement task against a fix

Weaknesses

- The method's success heavily relies on having a "well-specified and representative pairwise preference function, which can be unavailable. - The experiments suggest the method can be sensitive to biases in the preference judge (in this case, an "LLM-as-a-judge"). The authors attribute the gap between standard and length-controlled win rates to the judge model's "length bias," which the SLHF model may have learned to exploit.

Reviewer 03Rating 10Confidence 5

Strengths

This paper introduces an innovative game-theoretic Stackelberg structure for preference learning. The proposal is rooted in the existence of intransitivity in pairwise preferences. It proposes a rational computational solution that replicates the logic with additional transparency into the learning and inference process. Experiments showed that it outperforms or matches RLHF/NLHF baselines across multiple datasets. Some theoretical foundations are discussed, i.e., qualitative connections to

Weaknesses

The two-policy framework, i.e., Leader policy and Follower policy, increases computational and training costs.

Code & Models

Models

🤗
pasztorb/Llama-3.1-Tulu-3-8B-SLHF
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Recommender Systems and Techniques · Explainable Artificial Intelligence (XAI)