StarVid: Enhancing Semantic Alignment in Video Diffusion Models via   Spatial and SynTactic Guided Attention Refocusing

Yuanhang Li; Qi Mao; Lan Chen; Zhen Fang; Lei Tian; Xinyan Xiao,; Libiao Jin; Hua Wu

arXiv:2409.15259·cs.CV·March 4, 2025

StarVid: Enhancing Semantic Alignment in Video Diffusion Models via Spatial and SynTactic Guided Attention Refocusing

Yuanhang Li, Qi Mao, Lan Chen, Zhen Fang, Lei Tian, Xinyan Xiao,, Libiao Jin, Hua Wu

PDF

Open Access

TL;DR

StarVid introduces a training-free approach that enhances semantic alignment in text-to-video diffusion models by using language model-guided motion planning and syntax-aware attention refocusing, improving multi-object and motion fidelity.

Contribution

It proposes a novel, plug-and-play method combining LLM-guided motion planning and syntax-guided attention to improve semantic consistency in T2V generation.

Findings

01

Outperforms baseline methods in semantic accuracy

02

Produces higher quality videos with better object-motion alignment

03

Effective in complex multi-object scenarios

Abstract

Recent advances in text-to-video (T2V) generation with diffusion models have garnered significant attention. However, they typically perform well in scenes with a single object and motion, struggling in compositional scenarios with multiple objects and distinct motions to accurately reflect the semantic content of text prompts. To address these challenges, we propose \textbf{StarVid}, a plug-and-play, training-free method that improves semantic alignment between multiple subjects, their motions, and text prompts in T2V models. StarVid first leverages the spatial reasoning capabilities of large language models (LLMs) for two-stage motion trajectory planning based on text prompts. Such trajectories serve as spatial priors, guiding a spatial-aware loss to refocus cross-attention (CA) maps into distinctive regions. Furthermore, we propose a syntax-guided contrastive constraint to strengthen…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Human Pose and Action Recognition

MethodsDiffusion · Focus