REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing
Weihan Xu, Yimeng Ma, Jingyue Huang, Yang Li, Wenye Ma, Taylor Berg-Kirkpatrick, Julian McAuley, Paul Pu Liang, Hao-Wen Dong

TL;DR
REGen is a novel multimodal video editing framework that generates short videos with embedded clips from long videos, maintaining narrative coherence and supporting the creation of engaging documentary teasers.
Contribution
The paper introduces REGen, a retrieval-embedded generation system that combines large language models with a retrieval mechanism to produce coherent, quotable short videos from longer inputs.
Findings
Effective insertion of short clips while maintaining narrative coherence
Outperforms existing methods in coherence, alignment, and realism
Validated through objective evaluations and subjective surveys
Abstract
Short videos are an effective tool for promoting contents and improving knowledge accessibility. While existing extractive video summarization methods struggle to produce a coherent narrative, existing abstractive methods cannot `quote' from the input videos, i.e., inserting short video clips in their outputs. In this work, we explore novel video editing models for generating shorts that feature a coherent narrative with embedded video insertions extracted from a long input video. We propose a novel retrieval-embedded generation framework that allows a large language model to quote multimodal resources while maintaining a coherent narrative. Our proposed REGen system first generates the output story script with quote placeholders using a finetuned large language model, and then uses a novel retrieval model to replace the quote placeholders by selecting a video clip that best supports…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Video Analysis and Summarization · Multimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training
