Training-free Generation of Temporally Consistent Rewards from VLMs
Yinuo Zhao, Jiale Yuan, Zhiyuan Xu, Xiaoshuai Hao, Xinyi Zhang, Kun Wu, Zhengping Che, Chi Harold Liu, Jian Tang

TL;DR
This paper introduces $ ext{T}^2$-VLM, a training-free framework that generates accurate, temporally consistent rewards from vision-language models for robotic manipulation, improving decision-making and failure recovery without fine-tuning.
Contribution
The paper presents a novel training-free method that tracks subgoal status changes in VLMs to produce reliable rewards, enhancing RL performance in robotic tasks.
Findings
Achieves state-of-the-art results in robot manipulation benchmarks.
Provides accurate rewards with reduced computational costs.
Enhances long-horizon decision-making and failure recovery.
Abstract
Recent advances in vision-language models (VLMs) have significantly improved performance in embodied tasks such as goal decomposition and visual comprehension. However, providing accurate rewards for robotic manipulation without fine-tuning VLMs remains challenging due to the absence of domain-specific robotic knowledge in pre-trained datasets and high computational costs that hinder real-time applicability. To address this, we propose -VLM, a novel training-free, temporally consistent framework that generates accurate rewards through tracking the status changes in VLM-derived subgoals. Specifically, our method first queries the VLM to establish spatially aware subgoals and an initial completion estimate before each round of interaction. We then employ a Bayesian tracking algorithm to update the goal completion status dynamically, using subgoal hidden states to generate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
