Interactive Video Generation via Domain Adaptation

Ishaan Rawal; Suryansh Kumar

arXiv:2505.24253·cs.CV·June 2, 2025

Interactive Video Generation via Domain Adaptation

Ishaan Rawal, Suryansh Kumar

PDF

Open Access

TL;DR

This paper introduces domain adaptation-inspired techniques, including mask normalization and a temporal intrinsic diffusion prior, to enhance interactive video generation quality and control in text-conditioned diffusion models.

Contribution

It proposes novel domain adaptation methods to address perceptual degradation and initialization issues in interactive video generation, improving quality and controllability.

Findings

01

Mask normalization reduces perceptual degradation.

02

Temporal intrinsic diffusion prior improves spatio-temporal consistency.

03

Enhanced control over object trajectories in generated videos.

Abstract

Text-conditioned diffusion models have emerged as powerful tools for high-quality video generation. However, enabling Interactive Video Generation (IVG), where users control motion elements such as object trajectory, remains challenging. Recent training-free approaches introduce attention masking to guide trajectory, but this often degrades perceptual quality. We identify two key failure modes in these methods, both of which we interpret as domain shift problems, and propose solutions inspired by domain adaptation. First, we attribute the perceptual degradation to internal covariate shift induced by attention masking, as pretrained models are not trained to handle masked attention. To address this, we propose mask normalization, a pre-normalization layer designed to mitigate this shift via distribution matching. Second, we address initialization gap, where the randomly sampled initial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Advanced Vision and Imaging · Multimedia Communication and Technology

MethodsSoftmax · Attention Is All You Need · Diffusion · ALIGN