Pref-GUIDE: Continual Policy Learning from Real-Time Human Feedback via Preference-Based Learning

Zhengran Ji; Boyuan Chen

arXiv:2508.07126·cs.LG·October 8, 2025

Pref-GUIDE: Continual Policy Learning from Real-Time Human Feedback via Preference-Based Learning

Zhengran Ji, Boyuan Chen

PDF

Open Access

TL;DR

Pref-GUIDE introduces a novel method to convert real-time scalar human feedback into preference data, enhancing reward learning and policy training in online reinforcement learning scenarios, especially under noisy feedback conditions.

Contribution

It proposes a preference-based framework that improves reward model accuracy and robustness by filtering and aggregating human feedback, outperforming scalar-feedback baselines.

Findings

01

Pref-GUIDE outperforms scalar-feedback methods in three environments.

02

Voting variant surpasses dense reward performance.

03

Framework effectively handles noisy and inconsistent feedback.

Abstract

Training reinforcement learning agents with human feedback is crucial when task objectives are difficult to specify through dense reward functions. While prior methods rely on offline trajectory comparisons to elicit human preferences, such data is unavailable in online learning scenarios where agents must adapt on the fly. Recent approaches address this by collecting real-time scalar feedback to guide agent behavior and train reward models for continued learning after human feedback becomes unavailable. However, scalar feedback is often noisy and inconsistent, limiting the accuracy and generalization of learned rewards. We propose Pref-GUIDE, a framework that transforms real-time scalar feedback into preference-based data to improve reward model learning for continual policy training. Pref-GUIDE Individual mitigates temporal inconsistency by comparing agent behaviors within short…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Multimodal Machine Learning Applications