Semi-Supervised Preference Optimization with Limited Feedback
Seonggyun Lee, Sungjun Lim, Seojin Park, Soeun Cheon, Kyungwoo Song

TL;DR
This paper introduces Semi-Supervised Preference Optimization (SSPO), a method that reduces resource costs by learning from limited labeled preferences and large unpaired datasets, while maintaining high alignment quality.
Contribution
The paper provides a theoretical proof for an optimal reward threshold enabling pseudo-labeling in semi-supervised preference learning, and demonstrates significant data efficiency improvements.
Findings
SSPO trained on 1% of data surpasses baselines trained on 10%.
Theoretical proof of an optimal reward threshold for pseudo-labeling.
Effective distillation of preferences from large unpaired datasets.
Abstract
The field of preference optimization has made outstanding contributions to the alignment of language models with human preferences. Despite these advancements, recent methods still rely heavily on substantial paired (labeled) feedback data, leading to substantial resource expenditures. To address these challenges, we study the problem of Semi-Supervised Preference Optimization (SSPO) in which the idea is to learn from both a small number of pairwise preference labels and a large pool of unpaired samples simultaneously. Our key theoretical contribution proves the existence of an optimal reward threshold capable of separating winning and losing responses with high probability, which enables a principled pseudo-labeling of unpaired data. By leveraging these pseudo-labels, SSPO effectively distills latent preferences from large-scale unpaired data, thus maintaining human alignment while…
Peer Reviews
Decision·ICLR 2026 Oral
- Novel framing of preference learning as a Bayes-optimal classification problem that yields a principled reward threshold for pseudo-labeling unpaired data is clear. - Allowing SFT-style unlabeled data for a preference learning is a promising approach for reducing the annotation cost. - Strong data-efficiency: with 1 percent labels, SSPO often beats 10 percent baselines. - Extensive experiemnts, including label-noise testing, ablations, show the robustness of SSPO.
- The computational overhead should be addressed in comparison to other baselines.
- Simple, practical recipe: thresholded pseudo-labeling on top of SimPO + a scheduler; easy to implement. - Broad empirical sweep across backbones (Phi-2, Mistral-7B, Llama-3-8B) and two domains, with some ablations (prior sensitivity, scheduler). - Shows consistent LC improvements in data-scarce regimes; engineering details (EMA, KDE bandwidth, configs) are documented.
- Limited novelty: essentially classic self-training/pseudo-labeling using $ r_\theta $ and $\delta$; close to SSRM/SPA and prior semi-supervised alignment. - Theory misaligned with practice: Theorem relies on high-probability separation (max loser ≤ $\delta$ ≤ min winner) under sub-Gaussian assumptions, unrealistic with overlapping reward distributions; equality $\delta^*=\mu_l+t_1=\mu_w-t_2$ not generally guaranteed; no analysis of KDE/EMA estimation error or consistency. - Objective inconsist
1. The paper tackles an important and realistic bottleneck in preference optimization as human-labeled preference pairs are generally scarce and expensive. Hence, the proposed approach of leveraging abundant unpaired data is nice. The motivation of the paper is clearly described. 2. The proposed pseudo-labeling approach appears to be easily intergratable with existing approaches (especially SimPO) without any notable architectural changes. 3. The authors conduct extensive experiments across va
For now, my concerns are still in the form of questions that I hope the authors can clarify (see Questions below). To briefly highlight the most important ones here: 1. The theoretical setup doesn’t seem consistent with the changing reward distributions during training, which makes the “Bayesian” framing and the threshold stability unclear. 2. I’m worried the pseudo-labeling approach creates a self-reinforcing loop, where initial model biases are amplified over training. 3. The unpaired dat
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Recommender Systems and Techniques · Constraint Satisfaction and Optimization
