A Reinforcement Learning-Based Automatic Video Editing Method Using   Pre-trained Vision-Language Model

Panwen Hu; Nan Xiao; Feifei Li; Yongquan Chen; Rui Huang

arXiv:2411.04942·cs.CV·November 8, 2024

A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model

Panwen Hu, Nan Xiao, Feifei Li, Yongquan Chen, Rui Huang

PDF

Open Access

TL;DR

This paper introduces a novel two-stage automatic video editing approach that uses a pre-trained vision-language model for context extraction and reinforcement learning to improve editing decisions, applicable to diverse video content.

Contribution

The work presents a general editing framework leveraging VLM for context and RL for decision-making, addressing the limitations of scene-specific editing systems.

Findings

01

Effective context representation with VLM improves editing relevance.

02

RL-based framework enhances editing quality and decision accuracy.

03

Method demonstrates superior performance on a real movie dataset.

Abstract

In this era of videos, automatic video editing techniques attract more and more attention from industry and academia since they can reduce workloads and lower the requirements for human editors. Existing automatic editing systems are mainly scene- or event-specific, e.g., soccer game broadcasting, yet the automatic systems for general editing, e.g., movie or vlog editing which covers various scenes and events, were rarely studied before, and converting the event-driven editing method to a general scene is nontrivial. In this paper, we propose a two-stage scheme for general editing. Firstly, unlike previous works that extract scene-specific features, we leverage the pre-trained Vision-Language Model (VLM) to extract the editing-relevant representations as editing context. Moreover, to close the gap between the professional-looking videos and the automatic productions generated with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques

MethodsSoftmax · Attention Is All You Need