Stackelberg Self-Annotation: A Robust Approach to Data-Efficient LLM Alignment
Xu Chu, Zhixin Zhang, Tianyu Jia, Yujie Jin

TL;DR
This paper introduces a robust, data-efficient framework for aligning large language models with human preferences using a Stackelberg game approach, significantly reducing the need for extensive human-labeled data.
Contribution
The paper proposes SGPO, a Stackelberg game-based alignment method, and SSAPO, a self-annotation technique that achieves strong performance with minimal human labels.
Findings
SSAPO uses only 2K seed preferences to outperform benchmarks.
SSAPO maintains robustness against noisy self-labels.
The approach reduces human annotation costs by over 95%.
Abstract
Aligning large language models (LLMs) with human preferences typically demands vast amounts of meticulously curated data, which is both expensive and prone to labeling noise. We propose Stackelberg Game Preference Optimization (SGPO), a robust alignment framework that models alignment as a two-player Stackelberg game between a policy (leader) and a worst-case preference distribution (follower). The proposed SGPO guarantees -bounded regret within an -Wasserstein ball, offering formal robustness to (self-)annotation noise. We instantiate SGPO with Stackelberg Self-Annotated Preference Optimization (SSAPO), which uses minimal human-labeled "seed" preferences and iteratively self-annotates new prompts. In each iteration, SSAPO applies a distributionally robust reweighting of synthetic annotations, ensuring that noisy or biased self-labels do not derail…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Constraint Satisfaction and Optimization
MethodsAttention Is All You Need · Absolute Position Encodings · Dense Connections · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Label Smoothing · Multi-Head Attention · Position-Wise Feed-Forward Layer
