TL;DR
This paper introduces DFA, a reinforcement learning algorithm that integrates rewards and preferences directly into policy updates, improving stability and performance over traditional reward-only methods.
Contribution
DFA fuses rewards and preferences into a single update rule using policy log-probabilities, avoiding separate reward modeling and enabling better performance and stability.
Findings
DFA matches or exceeds SAC in control tasks.
DFA outperforms reward-modeling RLHF baselines in stochastic environments.
DFA approaches oracle performance with limited preference data.
Abstract
We present Dual-Feedback Actor (DFA), a reinforcement learning algorithm that fuses both individual rewards and pairwise preferences (if available) into a single update rule. DFA uses the policy's log-probabilities directly to model the preference probability, avoiding a separate reward-modeling step. Preferences can be provided by human-annotators (at state-level or trajectory-level) or be synthesized online from Q-values stored in an off-policy replay buffer. Under a Bradley-Terry model, we prove that minimizing DFA's preference loss recovers the entropy-regularized Soft Actor-Critic (SAC) policy. Our simulation results show that DFA trained on generated preferences matches or exceeds SAC on six control environments and demonstrates a more stable training process. With only a semi-synthetic preference dataset under Bradley-Terry model, our algorithm outperforms reward-modeling…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper proves that minimizing the DFA preference loss (under a BT assumption on $Q^*$) is equivalent to finding the optimal policy for the entropy-regularized SAC objective - Section 6.2 provides a convincing experiment demonstrating DFA's capabilities in a stochastic MDP where only preference feedback is available
- The claims of "matching or exceeding SAC" rest on an algorithm that is confounded by a heuristic: a nearest-neighbor state search in the replay buffer to find a comparison. This heuristic is computationally expensive (a $k$-NN search on the buffer per gradient step). What would the training curves look like when plotting against the wall clock? - The practical algorithm (Section 4.2) relies on synthesizing preferences from the current, noisy Q-estimate, $Q_k$. The theory, however, relies on th
Strengths: - The method fuses scalar rewards and pairwise preferences into one loss and update rule without requiring a separate reward model. This problem is important for the related field. - Theorem 5.2 establishes that minimizing DFA’s state-wise preference loss recovers the SAC policy under Bradley–Terry preferences on the optimal Q-function. - Support for off-policy training with replay buffers.
Weakness: - There is no real human feedback experiments. I strongly suggest that the authors to conduct studies with real human feedback with substantial subjects. If the method aims to use human feedback to improve the system, but no experiment is conducted on real human feedback, and an insufficient number of individuals are used to demonstrate generalizability. It's difficult to convince the audience that the method is an effective approach for leveraging human feedback without human or with
This paper claims dual compatibility with both reward signals and preference feedback, and shows that the method can be used in both on-policy and off-policy settings.
The motivation for using dual feedback is not clearly explained. In addition, the experiments are not solid. The paper lacks strong baselines, it should compare against standard PbRL (preference-based RL) methods as well as common rl algorithms to properly demonstrate effectiveness. More diverse environments are also needed to support the claims.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
