SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls
Qianxun Xu, Chenxi Song, Yujun Cai, Chi Zhang

TL;DR
SwitchCraft is a training-free framework that improves multi-event video generation by aligning attention with event prompts, enhancing scene clarity and temporal consistency without additional training.
Contribution
It introduces Event-Aligned Query Steering and Auto-Balance Strength Solver to address multi-event video generation challenges without training.
Findings
Significantly improves prompt alignment and event clarity.
Enhances scene consistency in multi-event videos.
Operates without additional training or fine-tuning.
Abstract
Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis. However, current models are predominantly optimized for single-event generation. When handling multi-event prompts, without explicit temporal grounding, such models often produce blended or collapsed scenes that break the intended narrative. To address this limitation, we present SwitchCraft, a training-free framework for multi-event video generation. Our key insight is that uniform prompt injection across time ignores the correspondence between events and frames. To this end, we introduce Event-Aligned Query Steering (EAQS), which steers frame-level attention to align with relevant event prompts. Furthermore, we propose Auto-Balance Strength Solver (ABSS), which adaptively balances steering strength to preserve temporal consistency and visual fidelity. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Human Motion and Animation
