SSG-Dit: A Spatial Signal Guided Framework for Controllable Video Generation

Peng Hu; Yu Gu; Liang Luo; and Fuji Ren

arXiv:2508.17062·cs.CV·August 26, 2025

SSG-Dit: A Spatial Signal Guided Framework for Controllable Video Generation

Peng Hu, Yu Gu, Liang Luo, and Fuji Ren

PDF

TL;DR

SSG-DiT introduces a novel two-stage framework for controllable video generation that enhances semantic consistency and spatial relationship control by leveraging spatial signals and a lightweight adapter, achieving state-of-the-art results.

Contribution

The paper presents SSG-DiT, a new framework combining spatial signal prompting and a dual-branch attention mechanism for improved controllable video synthesis.

Findings

01

Outperforms existing models on VBench benchmark

02

Achieves superior spatial relationship control

03

Maintains high semantic consistency in generated videos

Abstract

Controllable video generation aims to synthesize video content that aligns precisely with user-provided conditions, such as text descriptions and initial images. However, a significant challenge persists in this domain: existing models often struggle to maintain strong semantic consistency, frequently generating videos that deviate from the nuanced details specified in the prompts. To address this issue, we propose SSG-DiT (Spatial Signal Guided Diffusion Transformer), a novel and efficient framework for high-fidelity controllable video generation. Our approach introduces a decoupled two-stage process. The first stage, Spatial Signal Prompting, generates a spatially aware visual prompt by leveraging the rich internal representations of a pre-trained multi-modal model. This prompt, combined with the original text, forms a joint condition that is then injected into a frozen video DiT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.