The Dark Side of Rich Rewards: Understanding and Mitigating Noise in VLM Rewards
Sukai Huang, Shu-Wei Liu, Nir Lipovetzky, Trevor Cohn

TL;DR
This paper investigates the negative impact of false positive rewards in Vision-Language Model-based reward signals for embodied agents, introduces BiMI to reduce noise, and demonstrates improved learning efficiency.
Contribution
It identifies false positive rewards as particularly harmful, analyzes the limitations of cosine similarity, and proposes BiMI as a novel reward function to mitigate reward noise.
Findings
False positive rewards significantly hinder learning.
BiMI improves training efficiency in navigation tasks.
Cosine similarity is prone to false positive errors.
Abstract
While Vision-Language Models (VLMs) are increasingly used to generate reward signals for training embodied agents to follow instructions, our research reveals that agents guided by VLM rewards often underperform compared to those employing only intrinsic (exploration-driven) rewards, contradicting expectations set by recent work. We hypothesize that false positive rewards -- instances where unintended trajectories are incorrectly rewarded -- are more detrimental than false negatives. Our analysis confirms this hypothesis, revealing that the widely used cosine similarity metric is prone to false positive reward estimates. To address this, we introduce BiMI ({Bi}nary {M}utual {I}nformation), a novel reward function designed to mitigate noise. BiMI significantly enhances learning efficiency across diverse and challenging embodied navigation environments. Our findings offer a nuanced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Elevator Systems and Control · Evolutionary Algorithms and Applications
MethodsSparse Evolutionary Training
