VIVID-10M: A Dataset and Baseline for Versatile and Interactive Video Local Editing

Jiahao Hu; Tianxiong Zhong; Xuebo Wang; Boyuan Jiang; Xingye Tian; Fei Yang; Pengfei Wan; Di Zhang

arXiv:2411.15260·cs.CV·July 15, 2025

VIVID-10M: A Dataset and Baseline for Versatile and Interactive Video Local Editing

Jiahao Hu, Tianxiong Zhong, Xuebo Wang, Boyuan Jiang, Xingye Tian, Fei Yang, Pengfei Wan, Di Zhang

PDF

Open Access 1 Datasets

TL;DR

This paper introduces VIVID-10M, a large-scale dataset, and VIVID, a versatile, interactive video editing model that enables efficient, high-quality local editing with improved user interactivity and state-of-the-art performance.

Contribution

The paper presents the first large-scale hybrid video editing dataset and a new interactive editing model supporting entity addition, modification, and deletion.

Findings

01

VIVID-10M contains 9.7 million samples covering diverse editing tasks.

02

VIVID model achieves state-of-the-art results in video local editing.

03

Interactive keyframe-guided editing reduces latency and improves user control.

Abstract

Diffusion-based image editing models have made remarkable progress in recent years. However, achieving high-quality video editing remains a significant challenge. One major hurdle is the absence of open-source, large-scale video editing datasets based on real-world data, as constructing such datasets is both time-consuming and costly. Moreover, video data requires a significantly larger number of tokens for representation, which substantially increases the training costs for video editing models. Lastly, current video editing models offer limited interactivity, often making it difficult for users to express their editing requirements effectively in a single attempt. To address these challenges, this paper introduces a dataset VIVID-10M and a baseline model VIVID. VIVID-10M is the first large-scale hybrid image-video local editing dataset aimed at reducing data construction and model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

KlingTeam/VIVID-10M
dataset· 288 dl
288 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCOVID-19 diagnosis using AI · Cell Image Analysis Techniques · Advanced Vision and Imaging