InstructVideo: Instructing Video Diffusion Models with Human Feedback
Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining, Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, Dong Ni

TL;DR
InstructVideo introduces a human feedback-based fine-tuning approach for text-to-video diffusion models, improving video quality by efficient reward-based editing and repurposing image reward models for better alignment with human preferences.
Contribution
The paper presents a novel reward fine-tuning method that reduces computational costs and leverages image reward models for improved video generation quality.
Findings
Enhanced video quality with human feedback fine-tuning
Reduced fine-tuning computational cost through partial inference
Effective use of image reward models for video preference alignment
Abstract
Diffusion models have emerged as the de facto paradigm for video generation. However, their reliance on web-scale data of varied quality often yields results that are visually unappealing and misaligned with the textual prompts. To tackle this problem, we propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning. InstructVideo has two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by generating through the full DDIM sampling chain, we recast reward fine-tuning as editing. By leveraging the diffusion process to corrupt a sampled video, InstructVideo requires only partial inference of the DDIM sampling chain, reducing fine-tuning cost while improving fine-tuning efficiency. 2) To mitigate the absence of a dedicated video reward model for human preferences, we repurpose established image reward models, e.g., HPSv2.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image and Video Quality Assessment
MethodsDiffusion
