Diffusion-DRF: Free, Rich, and Differentiable Reward for Video Diffusion Fine-Tuning

Yifan Wang; Yanyu Li; Gordon Guocheng Qian; Sergey Tulyakov; Yun Fu; Anil Kag

arXiv:2601.04153·cs.CV·March 18, 2026

Diffusion-DRF: Free, Rich, and Differentiable Reward for Video Diffusion Fine-Tuning

Yifan Wang, Yanyu Li, Gordon Guocheng Qian, Sergey Tulyakov, Yun Fu, Anil Kag

PDF

Open Access

TL;DR

Diffusion-DRF introduces a novel, reward-rich, and differentiable framework for fine-tuning video diffusion models using off-the-shelf vision-language models, eliminating the need for preference datasets and improving performance.

Contribution

It proposes a new reward framework that leverages multi-dimensional feedback from vision-language models for stable, dataset-free video diffusion fine-tuning.

Findings

01

Outperforms state-of-the-art Flow-GRPO by 4.74% on VBench-2.0

02

Provides more stable and informative reward signals

03

Eliminates the need for reward model training and preference datasets

Abstract

Video diffusion alignment has been heavily relied on scalar rewards. These rewards are typically derived from learned reward models in human preference datasets, requiring additional training and extensive collection. Moreover, scalar rewards provide coarse, global supervision, offering limited prompt-generation mismatch credit assignment and making models prone to reward exploitation and unstable optimization. We propose Diffusion-DRF, a free, rich, and differentiable reward framework for video diffusion fine-tuning. Diffusion-DRF employs a frozen, off-the-shelf Vision-Language Model (VLM) as the critic, eliminating the need for reward model training. Instead of relying on a single scalar reward, it decomposes each user prompt into multi-dimensional questions with freeform dense VQA explanation queries, yielding information-rich feedback. By direct differentiable optimization over this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning