Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Yiqi Lin; Guoqiang Liang; Ziyun Zeng; Zechen Bai; Yanzhe Chen; Mike Zheng Shou

arXiv:2603.02175·cs.CV·May 14, 2026

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, Mike Zheng Shou

PDF

1 Repo 12 Models 2 Datasets

TL;DR

Kiwi-Edit introduces a scalable data generation pipeline and a unified architecture for instruction and reference-guided video editing, achieving state-of-the-art results in controllable editing tasks.

Contribution

The paper presents a novel data synthesis pipeline, a large-scale dataset, and a unified editing model that significantly improve instruction and reference-guided video editing.

Findings

01

Achieved state-of-the-art performance on instruction-reference-following tasks.

02

Developed a scalable pipeline to generate high-quality training data.

03

Proposed a unified architecture that enhances reference fidelity and instruction following.

Abstract

Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

showlab/Kiwi-Edit
github

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.