TL;DR
SWIFT is a training-free framework that enables efficient, coherent multi-prompt long-video generation by adaptive memory management and semantic injection, significantly reducing inference costs.
Contribution
It introduces a novel Semantic Injection Cache, head-wise semantic injection, and an Adaptive Dynamic Window for improved efficiency and semantic coherence in long-video diffusion models.
Findings
Achieves 22.6 FPS on a single H100 GPU.
Preserves generation quality compared to state-of-the-art methods.
Reduces average inference cost through adaptive memory allocation.
Abstract
Streaming long-video generation faces a central challenge in continuous semantic switching, requiring adaptive memory to preserve coherent visual evolution. Current approaches rely on cache rebuilding at prompt boundaries or fixed memory budgets, but they introduce redundant computation and limit flexible semantic adaptation. This limitation arises from a mismatch between cached video history and prompt updates, as memory preserves visual continuity while prompt switches demand rapid semantic adaptation. Motivated by this observation, we present SWIFT, Semantic Windowing and Injection for Flexible Transitions, a training-free framework for multi-prompt long-video generation that enables efficient semantic switching while preserving temporal coherence in causal video diffusion models. SWIFT introduces a lightweight Semantic Injection Cache that augments cached video memory rather than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
