From Shots to Stories: LLM-Assisted Video Editing with Unified Language Representations
Yuzhi Li, Haojun Xu, Feng Tian

TL;DR
This paper explores the use of Large Language Models in video editing by introducing a structured language representation called L-Storyboard and a reasoning strategy named StoryFlow, improving task accuracy and coherence.
Contribution
It introduces L-Storyboard as a novel intermediate representation and proposes the StoryFlow strategy to enhance the stability and logical consistency of LLM-based video editing tasks.
Findings
L-Storyboard improves mapping between visual info and language descriptions.
StoryFlow enhances logical consistency and stability in shot sequence ordering.
Experimental results show significant improvements in interpretability and coherence.
Abstract
Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable reasoning and generalization capabilities in video understanding; however, their application in video editing remains largely underexplored. This paper presents the first systematic study of LLMs in the context of video editing. To bridge the gap between visual information and language-based reasoning, we introduce L-Storyboard, an intermediate representation that transforms discrete video shots into structured language descriptions suitable for LLM processing. We categorize video editing tasks into Convergent Tasks and Divergent Tasks, focusing on three core tasks: Shot Attributes Classification, Next Shot Selection, and Shot Sequence Ordering. To address the inherent instability of divergent task outputs, we propose the StoryFlow strategy, which converts the divergent multi-path reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Video Analysis and Summarization
