MiVE: Multiscale Vision-language features for reference-guided video Editing
Tong Wang, Meng Zou, Chengjing Wu, Xiaochao Qu, Luoqi Liu, Xiaolin Hu, Ting Liu

TL;DR
MiVE introduces a multiscale vision-language framework that leverages hierarchical features from VLMs to improve reference-guided video editing, achieving state-of-the-art results.
Contribution
The paper proposes MiVE, a novel approach that repurposes VLMs as multiscale feature extractors to enhance video editing fidelity and detail preservation.
Findings
MiVE outperforms existing methods in human preference tests.
Hierarchical features from VLMs improve editing accuracy.
MiVE surpasses both academic and commercial systems.
Abstract
Reference-guided video editing takes a source video, a text instruction, and a reference image as inputs, requiring the model to faithfully apply the instructed edits while preserving original motion and unedited content. Existing methods fall into two paradigms, each with inherent limitations: decoupled encoders suffer from modality gaps when processing instructions and visual content independently, while unified vision-language encoders lose fine-grained spatial details by relying solely on final-layer representations. We observe that VLM layers encode complementary information hierarchically -- early layers capture localized spatial details essential for precise editing, while deeper layers encode global semantics for instruction comprehension. Building on this insight, we present MiVE (Multiscale Vision-language features for reference-guided video Editing), a framework that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
