SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Xinyao Zhang; Wenkai Dong; Yuxin Song; Bo Fang; Qi Zhang; Jing Wang; Fan Chen; Hui Zhang; Haocheng Feng; Yu Lu; Hang Zhou; Chun Yuan; Jingdong Wang

arXiv:2603.19228·cs.CV·March 20, 2026

SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Xinyao Zhang, Wenkai Dong, Yuxin Song, Bo Fang, Qi Zhang, Jing Wang, Fan Chen, Hui Zhang, Haocheng Feng, Yu Lu, Hang Zhou, Chun Yuan, Jingdong Wang

PDF

Open Access 2 Models

TL;DR

SAMA introduces a novel framework for instruction-guided video editing that factorizes semantic anchoring and motion modeling, improving robustness and generalization without relying on external priors.

Contribution

The paper proposes a factorized approach with semantic anchoring and motion alignment, enabling better zero-shot and supervised video editing performance.

Findings

01

Achieves state-of-the-art open-source video editing results

02

Strong zero-shot editing capabilities from pre-training alone

03

Competitive with leading commercial systems

Abstract

Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Human Pose and Action Recognition