When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger

Amirabbas Afzali; Myeongho Jeon; Maria Brbic

arXiv:2603.04968·cs.CL·March 6, 2026

When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger

Amirabbas Afzali, Myeongho Jeon, Maria Brbic

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that weak LLMs, when used with confidence-based sample selection and weighting, can effectively perform preference alignment with less human annotation, often outperforming fully human-labeled models.

Contribution

The authors introduce Confidence-Weighted Preference Optimization (CW-PO), a novel framework that leverages weak LLM confidence to reduce annotation costs while maintaining or improving alignment quality.

Findings

01

Selecting high-confidence samples improves performance.

02

CW-PO with 20% human annotations outperforms full annotation models.

03

Weak LLMs can effectively replace extensive human labeling.

Abstract

Preference alignment is an essential step in adapting large language models (LLMs) to human values, but existing approaches typically depend on costly human annotations or large-scale API-based models. We explore whether a weak LLM can instead act as an effective annotator. We surprisingly find that selecting only a subset of a weak LLM's highly confident samples leads to substantially better performance than using full human annotations. Building on this insight, we propose Confidence-Weighted Preference Optimization (CW-PO), a general framework that re-weights training samples by a weak LLM's confidence and can be applied across different preference optimization objectives. Notably, the model aligned by CW-PO with just 20% of human annotations outperforms the model trained with 100% of annotations under standard DPO. These results suggest that weak LLMs, when paired with confidence…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- The idea is useful in practical scenarios and straightforward to apply - The analysis presented in this paper aligns with them of prior art, whilst also adding further results

Weaknesses

- Most analysis done in this work uses models from the same family. It isn't entirely clear whether the proposed method will generalize across different model families. - It's unclear on how exactly the weak LLMs are trained. The equation makes it seem like a regression problem, but the footnote says classification. Could you clarify this part?

Reviewer 02Rating 2Confidence 4

Strengths

1. The paper is well-written and easy to follow, making it straightforward to understand the motivation, methodology, and results. 2. The authors conduct evaluations on multiple models, including OPT and Qwen, as well as on major datasets such as HH-RLHF and TL;DR, which strengthens the credibility and reliability of the reported results. 3. The proposed method is simple and easily applicable.

Weaknesses

1. In this paper, a small model is trained as a reward model using policy logits through BT modeling, similar to reference-based DPO approaches such as SimPO. However, this is not a novel idea. The subsequent incorporation of confidence-weighted DPO training is also conceptually similar to prior works such as WPO. Moreover, the notion that small models can effectively perform reward modeling has already been demonstrated in works like WS-DPO. From this perspective, the novelty of the paper appea

Reviewer 03Rating 6Confidence 4

Strengths

- The paper identifies that leveraging a small subset of highly confident samples can significantly enhance alignment performance. - The proposed CW-PO method is well-motivated and closely aligned with the empirical observations discussed in Section 3.1.

Weaknesses

- The paper lacks sufficient discussion of related works on weak-to-strong supervision and generalization [1][2]. - Some evaluation details are unclear, such as sampling robustness and result variance. - The presentation could be more concise — many key methodological details are deferred to the appendix, which slightly weakens the readability of the main text. - The comparison in Table 5 needs clarification, since DPO and SFT+DPO optimize generative policies for text prediction, while BT is a d

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Machine Learning and Data Classification · Topic Modeling