Interactive Video Generation via Domain Adaptation
Ishaan Rawal, Suryansh Kumar

TL;DR
This paper introduces domain adaptation-inspired techniques, including mask normalization and a temporal intrinsic diffusion prior, to enhance interactive video generation quality and control in text-conditioned diffusion models.
Contribution
It proposes novel domain adaptation methods to address perceptual degradation and initialization issues in interactive video generation, improving quality and controllability.
Findings
Mask normalization reduces perceptual degradation.
Temporal intrinsic diffusion prior improves spatio-temporal consistency.
Enhanced control over object trajectories in generated videos.
Abstract
Text-conditioned diffusion models have emerged as powerful tools for high-quality video generation. However, enabling Interactive Video Generation (IVG), where users control motion elements such as object trajectory, remains challenging. Recent training-free approaches introduce attention masking to guide trajectory, but this often degrades perceptual quality. We identify two key failure modes in these methods, both of which we interpret as domain shift problems, and propose solutions inspired by domain adaptation. First, we attribute the perceptual degradation to internal covariate shift induced by attention masking, as pretrained models are not trained to handle masked attention. To address this, we propose mask normalization, a pre-normalization layer designed to mitigate this shift via distribution matching. Second, we address initialization gap, where the randomly sampled initial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Vision and Imaging · Multimedia Communication and Technology
MethodsSoftmax · Attention Is All You Need · Diffusion · ALIGN
