Swap-guided Preference Learning for Personalized Reinforcement Learning from Human Feedback
Gihoon Kim, Euntai Kim

TL;DR
This paper introduces Swap-guided Preference Learning (SPL), a novel method to enhance personalized reinforcement learning from human feedback by addressing the posterior collapse issue in variational preference learning.
Contribution
The paper proposes SPL, which uses fictitious swap annotators and new regularization techniques to improve personalization and prevent latent variable collapse in preference learning.
Findings
SPL mitigates posterior collapse in preference learning.
SPL enriches user-specific latent representations.
SPL improves preference prediction accuracy.
Abstract
Reinforcement Learning from Human Feedback (RLHF) is a widely used approach to align large-scale AI systems with human values. However, RLHF typically assumes a single, universal reward, which overlooks diverse preferences and limits personalization. Variational Preference Learning (VPL) seeks to address this by introducing user-specific latent variables. Despite its promise, we found that VPL suffers from posterior collapse. While this phenomenon is well known in VAEs, it has not previously been identified in preference learning frameworks. Under sparse preference data and with overly expressive decoders, VPL may cause latent variables to be ignored, reverting to a single-reward model. To overcome this limitation, we propose Swap-guided Preference Learning (SPL). The key idea is to construct fictitious swap annotators and use the mirroring property of their preferences to guide the…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper surfaces posterior collapse in preference learning (especially for VPL), provides detailed diagnostics of why the user latent is ignored, and introduces a swap-guided remedy centered on the mirrored “swap” property to keep the latent informative. 2. SPL is derived from a clear ELBO with a swap-guidance regularizer. Analyses show how base regularization and the proposed P-IAF reduce swap-mismatch and prevent cross-context leakage, which explains why the posterior should not collapse.
1. Some recent methods reported results on UF-P and are discussed by the authors but not included in experiments (e.g., Nam et al., 2025), making it hard to situate SPL’s gains against the newest alternatives. Adding these would strengthen claims. 2. The approach adds flow-based inference (P-IAF) and swap-guidance terms, which plausibly increase compute and memory, but the paper does not report training time, inference latency, or budget-constrained comparisons. Reporting these costs would clari
- The paper is well written, with the motivation and background work sufficiently laid out. - The authors clearly expose the posterior collapse problem with experimental evidence. This is a strong motivation towards explaining the issues with priors works and motivating the solutions introduced. - The data augmentation technique for regularisation seems to be very effective and low-overhead, which potentially makes it very efficient. - The P-IAF architecture introduced ensures that the regularis
- An issue with the method is that it assumes that the user-context is provided via binary preference labels. This doesnt seem to be scalable as more recent works [1], have focused on expanding user context to muti-turn dialogue. It would be interesting if the authors could discuss the applicability of the introduced regularisation to other forms of context. - The swap based data augmentation seems to be a very interesting contribution. If the authors could include a baseline that trains a VPL b
1. The paper clearly identifies posterior collapse in personalized preference learning and introduces the intuitive idea of swap-guided mirroring, where swapping preferences flips the latent mean but keeps variance invariant, offering a novel and insightful diagnostic lens. 2. SPL integrates swap-guided base regularization, P-IAF, and adaptive latent conditioning into a coherent framework, directly addressing collapse while preserving user-specific information. The design is principled, interpre
1. SPL introduces many additional hyper parameters, like $\beta, \gamma, \eta$, but does not analyze how robust these hyperparameters are or how much tuning would cost.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Emotion and Mood Recognition · Recommender Systems and Techniques
