Bootstrapping LLMs via Preference-Based Policy Optimization

Chen Jia

arXiv:2511.12867·cs.AI·December 25, 2025

Bootstrapping LLMs via Preference-Based Policy Optimization

Chen Jia

PDF

Open Access

TL;DR

This paper introduces a novel preference-based policy optimization framework for bootstrapping large language models, leveraging a min-max game between the policy and reward model, with theoretical guarantees and superior experimental performance.

Contribution

It proposes a new iterative online algorithm for preference-based policy optimization with theoretical regret bounds and improved benchmark results.

Findings

01

Outperforms existing preference optimization methods.

02

Provides theoretical guarantees with high-probability regret bounds.

03

Demonstrates effectiveness across five benchmark datasets.

Abstract

Bootstrapping large language models (LLMs) through preference-based policy optimization offers a promising direction for aligning model behavior with human preferences without relying on extensive manual annotations. In this work, we propose a novel preference-based policy optimization (PbPO) framework that formulates the learning process as a min-max game between the main policy and a reward model (RM). The RM is constrained within a confidence set derived from preference data to ensure reliable exploitation. Our iterative online algorithm actively collects preference data through guided exploration of the evolving policy, enabling continual self-improvement of both the policy and the RM. We provide theoretical guarantees for our method, establishing high-probability regret bounds for both settings with sequence-level RM and token-level RM, demonstrating its effectiveness in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Recommender Systems and Techniques · Multimodal Machine Learning Applications