ESA: Energy-Based Shot Assembly Optimization for Automatic Video Editing

Yaosen Chen; Wei Wang; Tianheng Zheng; Xuming Wen; Han Yang; Yanru Zhang

arXiv:2511.02505·cs.CV·November 20, 2025

ESA: Energy-Based Shot Assembly Optimization for Automatic Video Editing

Yaosen Chen, Wei Wang, Tianheng Zheng, Xuming Wen, Han Yang, Yanru Zhang

PDF

Open Access

TL;DR

This paper introduces ESA, an energy-based optimization framework for automatic video shot assembly that aligns sequences with artistic styles and narrative requirements, enhancing automated editing with style learning.

Contribution

The paper presents a novel energy-based model that learns from reference videos to automate shot assembly, capturing artistic and narrative styles in video editing.

Findings

01

Successfully automates shot sequencing aligning with reference styles

02

Learns and replicates artistic styles from reference videos

03

Enables users with no editing experience to produce compelling videos

Abstract

Shot assembly is a crucial step in film production and video editing, involving the sequencing and arrangement of shots to construct a narrative, convey information, or evoke emotions. Traditionally, this process has been manually executed by experienced editors. While current intelligent video editing technologies can handle some automated video editing tasks, they often fail to capture the creator's unique artistic expression in shot assembly. To address this challenge, we propose an energy-based optimization method for video shot assembly. Specifically, we first perform visual-semantic matching between the script generated by a large language model and a video library to obtain subsets of candidate shots aligned with the script semantics. Next, we segment and label the shots from reference videos, extracting attributes such as shot size, camera motion, and semantics. We then employ…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications