Text-Driven Reasoning Video Editing via Reinforcement Learning on Digital Twin Representations
Yiqing Shen, Chenjia Li, Mathias Unberath

TL;DR
This paper introduces RIVER, a novel reinforcement learning model that performs reasoning-based implicit video editing using digital twin representations and outperforms existing methods on multiple benchmarks.
Contribution
The paper proposes RIVER, the first model for reasoning video editing with implicit queries, utilizing digital twins and multi-hop reasoning to improve editing accuracy and flexibility.
Findings
RIVER achieves state-of-the-art results on RVEBenchmark, VegGIE, and FiVE datasets.
RIVER effectively interprets implicit queries through multi-hop reasoning.
The digital twin representation enhances spatial, temporal, and semantic understanding.
Abstract
Text-driven video editing enables users to modify video content only using text queries. While existing methods can modify video content if explicit descriptions of editing targets with precise spatial locations and temporal boundaries are provided, these requirements become impractical when users attempt to conceptualize edits through implicit queries referencing semantic properties or object relationships. We introduce reasoning video editing, a task where video editing models must interpret implicit queries through multi-hop reasoning to infer editing targets before executing modifications, and a first model attempting to solve this complex task, RIVER (Reasoning-based Implicit Video Editor). RIVER decouples reasoning from generation through digital twin representations of video content that preserve spatial relationships, temporal trajectories, and semantic attributes. A large…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Graph Neural Networks
