SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models

Jiesong Lian; Zixiang Zhou; Ruizhe Zhong; Yuan Zhou; Qinglin Lu; Rui Wang; Long Hu; Yixue Hao; Baoru Huang

arXiv:2605.07800·cs.CV·May 11, 2026

SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models

Jiesong Lian, Zixiang Zhou, Ruizhe Zhong, Yuan Zhou, Qinglin Lu, Rui Wang, Long Hu, Yixue Hao, Baoru Huang

PDF

TL;DR

SARA introduces a semantically adaptive relational alignment method for video diffusion models, enhancing prompt relevance and interaction fidelity by selectively distilling token relations based on text-conditioned saliency.

Contribution

It proposes a novel adaptive supervision routing mechanism that improves fine-grained text alignment and motion quality in video diffusion models.

Findings

01

SARA outperforms SFT, VideoREPA, and MoAlign on VBench benchmarks.

02

SARA improves both text alignment and motion quality.

03

SARA achieves better results in a blind user study.

Abstract

Recent video diffusion models (VDMs) synthesize visually convincing clips, yet still drop entities, mis-bind attributes, and weaken the interactions specified in the prompt. Representation-alignment objectives such as VideoREPA and MoAlign improve fine-grained text following by distilling spatio-temporal token relations from a frozen visual foundation model, but their pairwise supervision budget is allocated by visual or motion cues rather than by how relevant each pair is to the prompt. We present SARA, Semantically Adaptive Relational Alignment, which keeps token-relation distillation (TRD) on a frozen VFM target and adds a text-conditioned saliency that decides which token pairs carry supervision. A lightweight Stage 1 aligner is trained with per-entity SAM 3.1 mask supervision and an InfoNCE regulariser, and its continuous saliency is fused into TRD through a pair-routing operator…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.