Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations

Yuji Wang; Moran Li; Xiaobin Hu; Ran Yi; Jiangning Zhang; Han Feng; Weijian Cao; Yabiao Wang; Chengjie Wang; Lizhuang Ma

arXiv:2507.04705·cs.CV·October 28, 2025

Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations

Yuji Wang, Moran Li, Xiaobin Hu, Ran Yi, Jiangning Zhang, Han Feng, Weijian Cao, Yabiao Wang, Chengjie Wang, Lizhuang Ma

PDF

Open Access

TL;DR

This paper introduces a spatial-temporal decoupled framework for text-to-video generation that improves identity preservation and motion consistency by separating spatial and temporal representations, validated by competitive results.

Contribution

The paper proposes a novel decoupled approach with semantic prompt optimization and stage-wise generation to enhance spatiotemporal coherence in text-to-video synthesis.

Findings

01

Achieves excellent spatiotemporal consistency.

02

Secures runner-up in 2025 ACM MultiMedia Challenge.

03

Demonstrates strong identity preservation and text relevance.

Abstract

Identity-preserving text-to-video (IPT2V) generation, which aims to create high-fidelity videos with consistent human identity, has become crucial for downstream applications. However, current end-to-end frameworks suffer a critical spatial-temporal trade-off: optimizing for spatially coherent layouts of key elements (e.g., character identity preservation) often compromises instruction-compliant temporal smoothness, while prioritizing dynamic realism risks disrupting the spatial coherence of visual structures. To tackle this issue, we propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics. Specifically, our paper proposes a semantic prompt optimization mechanism and stage-wise decoupled generation paradigm. The former module decouples the prompt into spatial and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Natural Language Processing Techniques · Topic Modeling