Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning

Qi Wang; Mian Wu; Yuyang Zhang; Mingqi Yuan; Wenyao Zhang; Haoxiang You; Yunbo Wang; Xin Jin; Xiaokang Yang; Wenjun Zeng

arXiv:2512.00961·cs.LG·April 6, 2026

Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning

Qi Wang, Mian Wu, Yuyang Zhang, Mingqi Yuan, Wenyao Zhang, Haoxiang You, Yunbo Wang, Xin Jin, Xiaokang Yang, Wenjun Zeng

PDF

TL;DR

This paper introduces a novel reward mechanism for reinforcement learning that leverages pretrained video diffusion models to provide goal-driven signals without manual reward design, improving task generalization.

Contribution

It proposes using pretrained video diffusion models as reward functions, employing their latent representations and CLIP-based frame relevance for more effective goal guidance in RL.

Findings

01

Effective in Meta-World and Distracting Control Suite environments.

02

Improves goal achievement without manual reward engineering.

03

Utilizes domain-specific fine-tuning of video diffusion models.

Abstract

Reinforcement Learning (RL) has achieved remarkable success in various domains, yet it often relies on carefully designed programmatic reward functions to guide agent behavior. Designing such reward functions can be challenging and may not generalize well across different tasks. To address this limitation, we leverage the rich world knowledge contained in pretrained video diffusion models to provide goal-driven reward signals for RL agents without ad-hoc design of reward. Our key idea is to exploit off-the-shelf video diffusion models pretrained on large-scale video datasets as informative reward functions in terms of video-level and frame-level goals. For video-level rewards, we first finetune a pretrained video diffusion model on domain-specific datasets and then employ its video encoder to evaluate the alignment between the latent representations of agent's trajectories and the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.