Conformal Feedback Alignment: Quantifying Answer-Level Reliability for Robust LLM Alignment
Tiejin Chen, Xiaoou Liu, Vishnu Nandam, Kuan-Ru Liou, Hua Wei

TL;DR
This paper introduces Conformal Feedback Alignment (CFA), a new framework that assesses answer-level reliability using conformal prediction to improve the robustness and efficiency of large language model alignment.
Contribution
CFA is the first method to incorporate answer-level reliability into preference-based alignment, grounded in conformal prediction guarantees, enhancing robustness and data efficiency.
Findings
CFA improves alignment robustness across datasets.
CFA enhances data efficiency in training.
Answer-side uncertainty modeling complements preference weighting.
Abstract
Preference-based alignment like Reinforcement Learning from Human Feedback (RLHF) learns from pairwise preferences, yet the labels are often noisy and inconsistent. Existing uncertainty-aware approaches weight preferences, but ignore a more fundamental factor: the reliability of the \emph{answers} being compared. To address the problem, we propose Conformal Feedback Alignment (CFA), a framework that grounds preference weighting in the statistical guarantees of Conformal Prediction (CP). CFA quantifies answer-level reliability by constructing conformal prediction sets with controllable coverage and aggregates these reliabilities into principled weights for both DPO- and PPO-style training. Experiments across different datasets show that CFA improves alignment robustness and data efficiency, highlighting that modeling \emph{answer-side} uncertainty complements preference-level weighting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Mobile Crowdsensing and Crowdsourcing · Topic Modeling
