Semi-Supervised Preference Optimization with Limited Feedback

Seonggyun Lee; Sungjun Lim; Seojin Park; Soeun Cheon; Kyungwoo Song

arXiv:2511.00040·cs.LG·February 20, 2026

Semi-Supervised Preference Optimization with Limited Feedback

Seonggyun Lee, Sungjun Lim, Seojin Park, Soeun Cheon, Kyungwoo Song

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Semi-Supervised Preference Optimization (SSPO), a method that reduces resource costs by learning from limited labeled preferences and large unpaired datasets, while maintaining high alignment quality.

Contribution

The paper provides a theoretical proof for an optimal reward threshold enabling pseudo-labeling in semi-supervised preference learning, and demonstrates significant data efficiency improvements.

Findings

01

SSPO trained on 1% of data surpasses baselines trained on 10%.

02

Theoretical proof of an optimal reward threshold for pseudo-labeling.

03

Effective distillation of preferences from large unpaired datasets.

Abstract

The field of preference optimization has made outstanding contributions to the alignment of language models with human preferences. Despite these advancements, recent methods still rely heavily on substantial paired (labeled) feedback data, leading to substantial resource expenditures. To address these challenges, we study the problem of Semi-Supervised Preference Optimization (SSPO) in which the idea is to learn from both a small number of pairwise preference labels and a large pool of unpaired samples simultaneously. Our key theoretical contribution proves the existence of an optimal reward threshold capable of separating winning and losing responses with high probability, which enables a principled pseudo-labeling of unpaired data. By leveraging these pseudo-labels, SSPO effectively distills latent preferences from large-scale unpaired data, thus maintaining human alignment while…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 8Confidence 3

Strengths

- Novel framing of preference learning as a Bayes-optimal classification problem that yields a principled reward threshold for pseudo-labeling unpaired data is clear. - Allowing SFT-style unlabeled data for a preference learning is a promising approach for reducing the annotation cost. - Strong data-efficiency: with 1 percent labels, SSPO often beats 10 percent baselines. - Extensive experiemnts, including label-noise testing, ablations, show the robustness of SSPO.

Weaknesses

- The computational overhead should be addressed in comparison to other baselines.

Reviewer 02Rating 2Confidence 3

Strengths

- Simple, practical recipe: thresholded pseudo-labeling on top of SimPO + a scheduler; easy to implement. - Broad empirical sweep across backbones (Phi-2, Mistral-7B, Llama-3-8B) and two domains, with some ablations (prior sensitivity, scheduler). - Shows consistent LC improvements in data-scarce regimes; engineering details (EMA, KDE bandwidth, configs) are documented.

Weaknesses

- Limited novelty: essentially classic self-training/pseudo-labeling using $ r_\theta $ and $\delta$; close to SSRM/SPA and prior semi-supervised alignment. - Theory misaligned with practice: Theorem relies on high-probability separation (max loser ≤ $\delta$ ≤ min winner) under sub-Gaussian assumptions, unrealistic with overlapping reward distributions; equality $\delta^*=\mu_l+t_1=\mu_w-t_2$ not generally guaranteed; no analysis of KDE/EMA estimation error or consistency. - Objective inconsist

Reviewer 03Rating 6Confidence 3

Strengths

1. The paper tackles an important and realistic bottleneck in preference optimization as human-labeled preference pairs are generally scarce and expensive. Hence, the proposed approach of leveraging abundant unpaired data is nice. The motivation of the paper is clearly described. 2. The proposed pseudo-labeling approach appears to be easily intergratable with existing approaches (especially SimPO) without any notable architectural changes. 3. The authors conduct extensive experiments across va

Weaknesses

For now, my concerns are still in the form of questions that I hope the authors can clarify (see Questions below). To briefly highlight the most important ones here: 1. The theoretical setup doesn’t seem consistent with the changing reward distributions during training, which makes the “Bayesian” framing and the threshold stability unclear. 2. I’m worried the pseudo-labeling approach creates a self-reinforcing loop, where initial model biases are amplified over training. 3. The unpaired dat

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Recommender Systems and Techniques · Constraint Satisfaction and Optimization