SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models
Jiesong Lian, Zixiang Zhou, Ruizhe Zhong, Yuan Zhou, Qinglin Lu, Rui Wang, Long Hu, Yixue Hao, Baoru Huang

TL;DR
SARA introduces a semantically adaptive relational alignment method for video diffusion models, enhancing prompt relevance and interaction fidelity by selectively distilling token relations based on text-conditioned saliency.
Contribution
It proposes a novel adaptive supervision routing mechanism that improves fine-grained text alignment and motion quality in video diffusion models.
Findings
SARA outperforms SFT, VideoREPA, and MoAlign on VBench benchmarks.
SARA improves both text alignment and motion quality.
SARA achieves better results in a blind user study.
Abstract
Recent video diffusion models (VDMs) synthesize visually convincing clips, yet still drop entities, mis-bind attributes, and weaken the interactions specified in the prompt. Representation-alignment objectives such as VideoREPA and MoAlign improve fine-grained text following by distilling spatio-temporal token relations from a frozen visual foundation model, but their pairwise supervision budget is allocated by visual or motion cues rather than by how relevant each pair is to the prompt. We present SARA, Semantically Adaptive Relational Alignment, which keeps token-relation distillation (TRD) on a frozen VFM target and adds a text-conditioned saliency that decides which token pairs carry supervision. A lightweight Stage 1 aligner is trained with per-entity SAM 3.1 mask supervision and an InfoNCE regulariser, and its continuous saliency is fused into TRD through a pair-routing operator…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
