ARF-RLHF: Adaptive Reward-Following for RLHF through Emotion-Driven Self-Supervision and Trace-Biased Dynamic Optimization
YuXuan Zhang

TL;DR
This paper introduces ARF-RLHF, a novel method that leverages natural language feedback to create continuous preference signals, enhancing the alignment of large language models more effectively than traditional binary-label approaches.
Contribution
The paper proposes ARF, a new approach that converts free-form feedback into continuous preference trajectories and optimizes them with TraceBias, improving RLHF performance.
Findings
ARF outperforms PPO and DPO in diverse settings.
Improves alignment by up to 7.6%.
Provides a scalable, personalized RLHF framework.
Abstract
Current RLHF methods such as PPO and DPO typically reduce human preferences to binary labels, which are costly to obtain and too coarse to reflect individual variation. We observe that expressions of satisfaction and dissatisfaction follow stable linguistic patterns across users, indicating that more informative supervisory signals can be extracted from free-form feedback. Building on this insight, we introduce Adaptive Reward-Following (ARF), which converts natural feedback into continuous preference trajectories and optimizes them using the novel TraceBias algorithm. Across diverse LLMs and preference domains, ARF consistently outperforms PPO and DPO, improving alignment by up to 7.6%. Our results demonstrate that continuous reward modeling provides a scalable path toward personalized and theoretically grounded RLHF.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMental Health Research Topics
