MVR: Multi-view Video Reward Shaping for Reinforcement Learning
Lirui Luo, Guoxi Zhang, Hongming Xu, Yaodong Yang, Cong Fang, Qing Li

TL;DR
This paper introduces Multi-View Video Reward Shaping (MVR), a novel framework that uses multi-view videos and vision-language models to improve reward signals in reinforcement learning for complex, dynamic tasks.
Contribution
MVR models state relevance using videos from multiple viewpoints and integrates task rewards with VLM guidance, addressing limitations of static image-based reward augmentation.
Findings
MVR improves performance on humanoid locomotion and manipulation tasks.
Video-text similarity effectively guides complex dynamic behaviors.
Ablation studies confirm the importance of multi-view and video-based approaches.
Abstract
Reward design is of great importance for solving complex tasks with reinforcement learning. Recent studies have explored using image-text similarity produced by vision-language models (VLMs) to augment rewards of a task with visual feedback. A common practice linearly adds VLM scores to task or success rewards without explicit shaping, potentially altering the optimal policy. Moreover, such approaches, often relying on single static images, struggle with tasks whose desired behavior involves complex, dynamic motions spanning multiple visually different states. Furthermore, single viewpoints can occlude critical aspects of an agent's behavior. To address these issues, this paper presents Multi-View Video Reward Shaping (MVR), a framework that models the relevance of states regarding the target task using videos captured from multiple viewpoints. MVR leverages video-text similarity from a…
Peer Reviews
Decision·ICLR 2026 Poster
- This paper improves prior methods that measure rewards solely through static image-text similarity by introducing video-text-based rewards.
- Since both HumanoidBench and MetaWorld rely on reward engineering for task performance, the proposed method does not fundamentally solve the problem of reward shaping. - In Section 3, it is confusing and inappropriate to use the same function notation fff for both the off-the-shelf VLM’s text-similarity function and the function defined within the MVR framework, as they are conceptually different. - The writing could be improved for clarity. For instance, in Section 4.1, two challenges—(1) the
1. The proposed method utilize multi-view reference videos for dense reward shaping, which can be generalizable to various different tasks without a hugh amount of human efford for reward design. 2. The multi-view reference videos can improve the spatial understanding, which enhance the training efficiency. 3. This paper conducts extensive experiments accross 19 tasks accross two simulation benchmarks.
1. During the earily stage of the training process, the video quality might be pretty low, even choose the top-k trajectories. Those low-quality data might bring limited guidance to finish the task. Especially for some challenging tasks such as stick pull and hammer in MetaWorld. 2. Some previous works rely on generative model with reference trajectories and videos for reward shaping. Such as using reference trajectory [1], generated robot videos [2, 3, 4], and generated object motions [5, 6] f
Strong motivation and sound strategy - The paper clearly motivates its aim of guiding policies toward optimal motion patterns. The VLM guidance automatically decays during training, curbing early suboptimal actions and ensuring convergence to the true task reward (not the shaping signal), achieving the goal in a principled way. Thorough empirical analysis - The paper provides comprehensive analyses, baselines, and ablations of their design choices. - The paper acknowledges cases where it trails
Loss of temporal information? - Although the method trains state sequences to match task relevance, the final reward in Eq. (9) is computed from a single timestep. Does this discard temporal information, thereby weakening the benefits of a video-based approach? This concern may be due to my misunderstanding; please clarify how temporal cues are preserved. Notation & Clarity - L187: I believe the term “similarity fluctuation” isn’t defined and was confusing—especially in the opening problem para
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Reinforcement Learning in Robotics
