What Semantics Survive the Connector? Diagnosing VLM-to-DiT Alignment in Video Editing
Hangyu Lin, Chao Wen, Chengming Xu, Jianxiong Gao, Jiangning Zhang, Xiaobin Hu, Yanwei Fu

TL;DR
This paper investigates whether the alignment process between vision-language models and DiT models preserves fine-grained semantics in video editing, revealing it often acts as a semantic bottleneck.
Contribution
It introduces TRACE-Edit, a diagnostic dataset and protocol to systematically evaluate semantic preservation in VLM-to-DiT alignment for video editing.
Findings
Alignment degrades fine-grained structural semantics
VLM-to-DiT alignment is a major semantic bottleneck
Proposed diagnostic tools reveal limitations in current models
Abstract
Flow matching based video generative models have been increasingly relying on prepended Vision-Language Models (VLMs) to handle complex, instruction-based video editing. The prevailing assumption underlying this paradigm is that a connector module can seamlessly align the VLM's rich multi-modal reasoning with the original text embedding space of DiTs. However, we hypothesize that this alignment acts as a severe semantic bottleneck, degrading fine-grained structural variables. Verifying this is challenging, as end-to-end evaluations conflate alignment failures with generation errors, and natural datasets lack disentangled annotations. To rigorously investigate this, we propose a controlled data processing pipeline based on video composition that results in TRACE-Edit, a diagnostic dataset focusing on relation-based editing. Leveraging this dataset, we propose a comprehensive diagnostic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
