SSG-Dit: A Spatial Signal Guided Framework for Controllable Video Generation
Peng Hu, Yu Gu, Liang Luo, and Fuji Ren

TL;DR
SSG-DiT introduces a novel two-stage framework for controllable video generation that enhances semantic consistency and spatial relationship control by leveraging spatial signals and a lightweight adapter, achieving state-of-the-art results.
Contribution
The paper presents SSG-DiT, a new framework combining spatial signal prompting and a dual-branch attention mechanism for improved controllable video synthesis.
Findings
Outperforms existing models on VBench benchmark
Achieves superior spatial relationship control
Maintains high semantic consistency in generated videos
Abstract
Controllable video generation aims to synthesize video content that aligns precisely with user-provided conditions, such as text descriptions and initial images. However, a significant challenge persists in this domain: existing models often struggle to maintain strong semantic consistency, frequently generating videos that deviate from the nuanced details specified in the prompts. To address this issue, we propose SSG-DiT (Spatial Signal Guided Diffusion Transformer), a novel and efficient framework for high-fidelity controllable video generation. Our approach introduces a decoupled two-stage process. The first stage, Spatial Signal Prompting, generates a spatially aware visual prompt by leveraging the rich internal representations of a pre-trained multi-modal model. This prompt, combined with the original text, forms a joint condition that is then injected into a frozen video DiT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
