From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding

Xiangfeng Wang; Xiao Li; Yadong Wei; Xueyu Song; Yang Song; Xiaoqiang Xia; Fangrui Zeng; Zaiyi Chen; Liu Liu; Gu Xu; Tong Xu

arXiv:2507.02790·cs.CV·October 6, 2025

From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding

Xiangfeng Wang, Xiao Li, Yadong Wei, Xueyu Song, Yang Song, Xiaoqiang Xia, Fangrui Zeng, Zaiyi Chen, Liu Liu, Gu Xu, Tong Xu

PDF

TL;DR

This paper introduces HIVE, a multimodal, human-inspired framework for automatic video editing that improves coherence and engagement by understanding narrative context, character interactions, and scene structure, outperforming existing methods.

Contribution

The paper presents a novel multimodal narrative understanding framework for automatic video editing, incorporating character, dialogue, and scene analysis, along with a new dataset for benchmarking.

Findings

01

Outperforms existing automatic editing baselines

02

Significantly narrows quality gap with human editing

03

Effective in both general and advertisement video editing tasks

Abstract

The rapid growth of online video content, especially on short video platforms, has created a growing demand for efficient video editing techniques that can condense long-form videos into concise and engaging clips. Existing automatic editing methods predominantly rely on textual cues from ASR transcripts and end-to-end segment selection, often neglecting the rich visual context and leading to incoherent outputs. In this paper, we propose a human-inspired automatic video editing framework (HIVE) that leverages multimodal narrative understanding to address these limitations. Our approach incorporates character extraction, dialogue analysis, and narrative summarization through multimodal large language models, enabling a holistic understanding of the video content. To further enhance coherence, we apply scene-level segmentation and decompose the editing process into three subtasks:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.