REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing

Weihan Xu; Yimeng Ma; Jingyue Huang; Yang Li; Wenye Ma; Taylor Berg-Kirkpatrick; Julian McAuley; Paul Pu Liang; Hao-Wen Dong

arXiv:2505.18880·cs.CV·May 27, 2025

REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing

Weihan Xu, Yimeng Ma, Jingyue Huang, Yang Li, Wenye Ma, Taylor Berg-Kirkpatrick, Julian McAuley, Paul Pu Liang, Hao-Wen Dong

PDF

Open Access 1 Video

TL;DR

REGen is a novel multimodal video editing framework that generates short videos with embedded clips from long videos, maintaining narrative coherence and supporting the creation of engaging documentary teasers.

Contribution

The paper introduces REGen, a retrieval-embedded generation system that combines large language models with a retrieval mechanism to produce coherent, quotable short videos from longer inputs.

Findings

01

Effective insertion of short clips while maintaining narrative coherence

02

Outperforms existing methods in coherence, alignment, and realism

03

Validated through objective evaluations and subjective surveys

Abstract

Short videos are an effective tool for promoting contents and improving knowledge accessibility. While existing extractive video summarization methods struggle to produce a coherent narrative, existing abstractive methods cannot `quote' from the input videos, i.e., inserting short video clips in their outputs. In this work, we explore novel video editing models for generating shorts that feature a coherent narrative with embedded video insertions extracted from a long input video. We propose a novel retrieval-embedded generation framework that allows a large language model to quote multimodal resources while maintaining a coherent narrative. Our proposed REGen system first generates the output story script with quote placeholders using a finetuned large language model, and then uses a novel retrieval model to replace the quote placeholders by selecting a video clip that best supports…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Video Analysis and Summarization · Multimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training