Diffusion-DRF: Free, Rich, and Differentiable Reward for Video Diffusion Fine-Tuning
Yifan Wang, Yanyu Li, Gordon Guocheng Qian, Sergey Tulyakov, Yun Fu, Anil Kag

TL;DR
Diffusion-DRF introduces a novel, reward-rich, and differentiable framework for fine-tuning video diffusion models using off-the-shelf vision-language models, eliminating the need for preference datasets and improving performance.
Contribution
It proposes a new reward framework that leverages multi-dimensional feedback from vision-language models for stable, dataset-free video diffusion fine-tuning.
Findings
Outperforms state-of-the-art Flow-GRPO by 4.74% on VBench-2.0
Provides more stable and informative reward signals
Eliminates the need for reward model training and preference datasets
Abstract
Video diffusion alignment has been heavily relied on scalar rewards. These rewards are typically derived from learned reward models in human preference datasets, requiring additional training and extensive collection. Moreover, scalar rewards provide coarse, global supervision, offering limited prompt-generation mismatch credit assignment and making models prone to reward exploitation and unstable optimization. We propose Diffusion-DRF, a free, rich, and differentiable reward framework for video diffusion fine-tuning. Diffusion-DRF employs a frozen, off-the-shelf Vision-Language Model (VLM) as the critic, eliminating the need for reward model training. Instead of relying on a single scalar reward, it decomposes each user prompt into multi-dimensional questions with freeform dense VQA explanation queries, yielding information-rich feedback. By direct differentiable optimization over this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
