Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance
Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, Mike Zheng Shou

TL;DR
Kiwi-Edit introduces a scalable data generation pipeline and a unified architecture for instruction and reference-guided video editing, achieving state-of-the-art results in controllable editing tasks.
Contribution
The paper presents a novel data synthesis pipeline, a large-scale dataset, and a unified editing model that significantly improve instruction and reference-guided video editing.
Findings
Achieved state-of-the-art performance on instruction-reference-following tasks.
Developed a scalable pipeline to generate high-quality training data.
Proposed a unified architecture that enhances reference fidelity and instruction following.
Abstract
Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗linyq/kiwi-edit-5b-instruct-only-diffusersmodel· 552 dl· ♡ 7552 dl♡ 7
- 🤗linyq/kiwi-edit-5b-instruct-reference-diffusersmodel· 171 dl· ♡ 10171 dl♡ 10
- 🤗linyq/kiwi-edit-5b-reference-only-diffusersmodel· 59 dl· ♡ 459 dl♡ 4
- 🤗linyq/wan2.2_ti2v_5b_qwen25vl_3b_stage3_img_vid_refvid_720x1280_81fmodel· 39 dl· ♡ 339 dl♡ 3
- 🤗linyq/wan2.2_ti2v_5b_qwen25vl_3b_stage1_img_onlymodel· 28 dl· ♡ 228 dl♡ 2
- 🤗linyq/wan2.2_ti2v_5b_qwen25vl_3b_stage2_img_vid_720x1280_81fmodel· 33 dl· ♡ 233 dl♡ 2
- 🤗linyq/wan2.2_ti2v_5b_qwen25vl_3b_stage3_refvid_only_720x1280_81f_pad_firstmodel· 22 dl· ♡ 222 dl♡ 2
- 🤗linyq/wan2.2_ti2v_5b_qwen25vl_3b_stage2_img_vid_600x600_81fmodel· 13 dl· ♡ 213 dl♡ 2
- 🤗AEmotionStudio/kiwi-edit-instructmodel
- 🤗AEmotionStudio/kiwi-edit-referencemodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
