TL;DR
This paper introduces a training-free, post-hoc method called Robust Preference Selection (RPS) that improves large language models' alignment with nuanced human preferences by sampling and selecting responses from a local preference neighborhood.
Contribution
The paper proposes RPS, a novel neighborhood consensus approach that enhances preference alignment robustness without retraining, supported by theoretical guarantees and extensive experiments.
Findings
RPS outperforms baseline methods with up to 69% win rate on challenging preferences.
RPS improves robustness across three different alignment paradigms.
Theoretical analysis shows neighborhood sampling is provably superior to single-sample approaches.
Abstract
Aligning large language models with human preferences is critical for creating reliable and controllable AI systems. A human preference can be visualized as a high-dimensional vector where different directions represent trade-offs between desired attributes (e.g., helpfulness vs. verbosity). Yet, because the training data often reflects dominant, average preferences, LLMs tend to perform well on common requests but fall short in specific, individual needs. This mismatch creates a preference coverage gap. Existing methods often address this through costly retraining, which may not be generalized to the full spectrum of diverse preferences. This brittleness means that when a user's request reflects a nuanced preference deviating from the training data's central tendency, model performance can degrade unpredictably. To address this challenge, we introduce Robust Preference Selection (RPS),…
Peer Reviews
Decision·ICLR 2026 Poster
- The preference selection method is training-free, making it applicable to models trained under different schemes. - Theoretical analysis shows that the expected score of the best response selected by RPS is greater than that of the best response selected by the baseline.
- The method’s generalization to higher-dimensional preference spaces is not empirically validated. - RPS assumes that the reward model used in the Consensus Selection phase (Phase 3) is robust when evaluating responses against OOD targets ($v_{\text{target}}$), potentially shifting the “brittleness” problem from generation to evaluation. - The theoretical foundation relies on two key assumptions: Assumption 1 and the local consistency assumption (L203). Assumption 1 requires the neighborhood ve
First, the paper proposed a conceptually novel method, RPS with a clear motivation. Rather than attempting to force a model to directly generate a high-quality response for a difficult, out-of-distribution (OOD) preference, RPS reframes the problem. It hypothesizes that it is more effective to first sample from a neighborhood of related, but easier, preference vectors where the model is inherently more competent. This conceptual shift from direct, constrained generation to a "generate-then-selec
First, the paper's theoretical claim of being "provably superior" rests on a critical yet unformalized logical gap, which will undermine its rigor. The entire argument of Theorem 1 hinges on the "local consistency" assumption, stated as v_target^T r(x, y_i) ≈ v_i^T r(x, y_i). This approximation is presented without any formal justification, error bounds, or discussion of the conditions under which it might hold. Consequently, the strong language of "guarantee" and "proof" is a mischaracterizatio
- The paper is well-written, the motivation is clear, and the teaser figures are intuitive and easy to follow. - The authors present theoretical evidence for their proposed method, showing the effectiveness of RPS under certain assumptions.
- Assumption 1 appears rather idealized, and the paper provides limited empirical evidence to support it. Although the authors mention that Figure 5 offers some justification, a deeper analysis or additional experiments would help validate it. - The paper lacks ablation studies on the choices of $k$ and $\theta_{\max}$; without these, it is difficult to assess whether the method is sensitive to hyperparameters. - The assumption of a well-calibrated reward model that generalizes to out-of-distrib
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
