What Semantics Survive the Connector? Diagnosing VLM-to-DiT Alignment in Video Editing

Hangyu Lin; Chao Wen; Chengming Xu; Jianxiong Gao; Jiangning Zhang; Xiaobin Hu; Yanwei Fu

arXiv:2605.20795·cs.CV·May 21, 2026

What Semantics Survive the Connector? Diagnosing VLM-to-DiT Alignment in Video Editing

Hangyu Lin, Chao Wen, Chengming Xu, Jianxiong Gao, Jiangning Zhang, Xiaobin Hu, Yanwei Fu

PDF

TL;DR

This paper investigates whether the alignment process between vision-language models and DiT models preserves fine-grained semantics in video editing, revealing it often acts as a semantic bottleneck.

Contribution

It introduces TRACE-Edit, a diagnostic dataset and protocol to systematically evaluate semantic preservation in VLM-to-DiT alignment for video editing.

Findings

01

Alignment degrades fine-grained structural semantics

02

VLM-to-DiT alignment is a major semantic bottleneck

03

Proposed diagnostic tools reveal limitations in current models

Abstract

Flow matching based video generative models have been increasingly relying on prepended Vision-Language Models (VLMs) to handle complex, instruction-based video editing. The prevailing assumption underlying this paradigm is that a connector module can seamlessly align the VLM's rich multi-modal reasoning with the original text embedding space of DiTs. However, we hypothesize that this alignment acts as a severe semantic bottleneck, degrading fine-grained structural variables. Verifying this is challenging, as end-to-end evaluations conflate alignment failures with generation errors, and natural datasets lack disentangled annotations. To rigorously investigate this, we propose a controlled data processing pipeline based on video composition that results in TRACE-Edit, a diagnostic dataset focusing on relation-based editing. Leveraging this dataset, we propose a comprehensive diagnostic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.