ESA: Energy-Based Shot Assembly Optimization for Automatic Video Editing
Yaosen Chen, Wei Wang, Tianheng Zheng, Xuming Wen, Han Yang, Yanru Zhang

TL;DR
This paper introduces ESA, an energy-based optimization framework for automatic video shot assembly that aligns sequences with artistic styles and narrative requirements, enhancing automated editing with style learning.
Contribution
The paper presents a novel energy-based model that learns from reference videos to automate shot assembly, capturing artistic and narrative styles in video editing.
Findings
Successfully automates shot sequencing aligning with reference styles
Learns and replicates artistic styles from reference videos
Enables users with no editing experience to produce compelling videos
Abstract
Shot assembly is a crucial step in film production and video editing, involving the sequencing and arrangement of shots to construct a narrative, convey information, or evoke emotions. Traditionally, this process has been manually executed by experienced editors. While current intelligent video editing technologies can handle some automated video editing tasks, they often fail to capture the creator's unique artistic expression in shot assembly. To address this challenge, we propose an energy-based optimization method for video shot assembly. Specifically, we first perform visual-semantic matching between the script generated by a large language model and a video library to obtain subsets of candidate shots aligned with the script semantics. Next, we segment and label the shots from reference videos, extracting attributes such as shot size, camera motion, and semantics. We then employ…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
