Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation

Shihan Cheng; Nilesh Kulkarni; David Hyde; Dmitriy Smirnov

arXiv:2511.17844·cs.CV·April 9, 2026

Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation

Shihan Cheng, Nilesh Kulkarni, David Hyde, Dmitriy Smirnov

PDF

TL;DR

This paper introduces a data-efficient fine-tuning method for controllable text-to-video generation that learns from sparse synthetic data, outperforming models trained on high-quality real data.

Contribution

The authors propose a novel fine-tuning strategy that enables control over video generation using low-quality synthetic data, with a theoretical framework explaining its effectiveness.

Findings

01

Fine-tuning on synthetic data achieves desired control features.

02

Models fine-tuned on synthetic data outperform those trained on real data.

03

The framework provides intuitive and quantitative justification for the results.

Abstract

Fine-tuning large-scale text-to-video diffusion models to add new generative controls, such as those over physical camera parameters (e.g., shutter speed or aperture), typically requires vast, high-fidelity datasets that are difficult to acquire. In this work, we propose a data-efficient fine-tuning strategy that learns these controls from sparse, low-quality synthetic data. We show that not only does fine-tuning on such simple data enable the desired controls, it actually yields superior results to models fine-tuned on photorealistic "real" data. Beyond demonstrating these results, we provide a framework that justifies this phenomenon both intuitively and quantitatively.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.