VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing

Andong Deng; Dawei Du; Zhenfang Chen; Wen Zhong; Fan Chen; Guang Chen; Chia-Wen Kuo; Longyin Wen; Chen Chen; Sijie Zhu

arXiv:2605.03276·cs.CV·May 12, 2026

VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing

Andong Deng, Dawei Du, Zhenfang Chen, Wen Zhong, Fan Chen, Guang Chen, Chia-Wen Kuo, Longyin Wen, Chen Chen, Sijie Zhu

PDF

TL;DR

VEBENCH is a comprehensive benchmark for evaluating large multimodal models' abilities in real-world video editing tasks, focusing on editing knowledge and operational reasoning.

Contribution

It introduces VEBENCH, a new benchmark with high-quality videos and QA tasks to assess models' editing techniques recognition and workflow reasoning capabilities.

Findings

01

Current models lag behind human-level editing cognition.

02

Extensive experiments show significant performance gaps.

03

VEBENCH provides a foundation for future research in video editing AI.

Abstract

Real-world video editing demands not only expert knowledge of cinematic techniques but also multimodal reasoning to select, align, and combine footage into coherent narratives. While recent Large Multimodal Models (LMMs) have shown remarkable progress in general video understanding, their abilities in multi-video reasoning and operational editing workflows remain largely unexplored. We introduce VEBENCH, the first comprehensive benchmark designed to evaluate both editing knowledge understanding and operational reasoning in realistic video editing scenarios. VEBENCH contains 3.9K high-quality edited videos (over 257 hours) and 3,080 human-verified QA pairs, built through a three-round human-AI collaborative annotation pipeline that ensures precise temporal labeling and semantic consistency. It features two complementary QA tasks: 1) Video Editing Technique Recognition, assessing models'…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.