SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls

Qianxun Xu; Chenxi Song; Yujun Cai; Chi Zhang

arXiv:2602.23956·cs.CV·March 24, 2026

SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls

Qianxun Xu, Chenxi Song, Yujun Cai, Chi Zhang

PDF

Open Access

TL;DR

SwitchCraft is a training-free framework that improves multi-event video generation by aligning attention with event prompts, enhancing scene clarity and temporal consistency without additional training.

Contribution

It introduces Event-Aligned Query Steering and Auto-Balance Strength Solver to address multi-event video generation challenges without training.

Findings

01

Significantly improves prompt alignment and event clarity.

02

Enhances scene consistency in multi-event videos.

03

Operates without additional training or fine-tuning.

Abstract

Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis. However, current models are predominantly optimized for single-event generation. When handling multi-event prompts, without explicit temporal grounding, such models often produce blended or collapsed scenes that break the intended narrative. To address this limitation, we present SwitchCraft, a training-free framework for multi-event video generation. Our key insight is that uniform prompt injection across time ignores the correspondence between events and frames. To this end, we introduce Event-Aligned Query Steering (EAQS), which steers frame-level attention to align with relevant event prompts. Furthermore, we propose Auto-Balance Strength Solver (ABSS), which adaptively balances steering strength to preserve temporal consistency and visual fidelity. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Human Motion and Animation