V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models
Jisoo Kim, Wooseok Seo, Junwan Kim, Seungho Park, Sooyeon Park, Youngjae Yu

TL;DR
This paper introduces V.I.P., a novel framework combining iterative online preference distillation and dataset filtering to significantly reduce the size of video diffusion models while maintaining or improving their quality.
Contribution
It proposes ReDPO, a new distillation method integrating DPO and SFT, and V.I.P., a dataset filtering framework, to enhance efficiency and quality in text-to-video models.
Findings
Achieved 36.2% and 67.5% parameter reduction on two T2V models.
Maintained or surpassed full model performance after distillation.
Validated effectiveness on multiple leading T2V models.
Abstract
With growing interest in deploying text-to-video (T2V) models in resource-constrained environments, reducing their high computational cost has become crucial, leading to extensive research on pruning and knowledge distillation methods while maintaining performance. However, existing distillation methods primarily rely on supervised fine-tuning (SFT), which often leads to mode collapse as pruned models with reduced capacity fail to directly match the teacher's outputs, ultimately resulting in degraded quality. To address this challenge, we propose an effective distillation method, ReDPO, that integrates DPO and SFT. Our approach leverages DPO to guide the student model to focus on recovering only the targeted properties, rather than passively imitating the teacher, while also utilizing SFT to enhance overall performance. We additionally propose V.I.P., a novel framework for filtering and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
