Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation
Shihan Cheng, Nilesh Kulkarni, David Hyde, Dmitriy Smirnov

TL;DR
This paper introduces a data-efficient fine-tuning method for controllable text-to-video generation that learns from sparse synthetic data, outperforming models trained on high-quality real data.
Contribution
The authors propose a novel fine-tuning strategy that enables control over video generation using low-quality synthetic data, with a theoretical framework explaining its effectiveness.
Findings
Fine-tuning on synthetic data achieves desired control features.
Models fine-tuned on synthetic data outperform those trained on real data.
The framework provides intuitive and quantitative justification for the results.
Abstract
Fine-tuning large-scale text-to-video diffusion models to add new generative controls, such as those over physical camera parameters (e.g., shutter speed or aperture), typically requires vast, high-fidelity datasets that are difficult to acquire. In this work, we propose a data-efficient fine-tuning strategy that learns these controls from sparse, low-quality synthetic data. We show that not only does fine-tuning on such simple data enable the desired controls, it actually yields superior results to models fine-tuned on photorealistic "real" data. Beyond demonstrating these results, we provide a framework that justifies this phenomenon both intuitively and quantitatively.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
