RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives

Jaehong Yoon; Shoubin Yu; Mohit Bansal

arXiv:2405.18406·cs.CV·October 6, 2025

RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives

Jaehong Yoon, Shoubin Yu, Mohit Bansal

PDF

Open Access 2 Repos 3 Reviews

TL;DR

RACCooN is a flexible video editing framework that automatically generates natural language descriptions of videos and uses these narratives to guide diverse editing tasks, simplifying user interaction and enhancing editing accuracy.

Contribution

It introduces a multi-granular spatiotemporal pooling strategy for video description and leverages auto-generated narratives to improve video editing and content generation.

Findings

01

Effective multi-granular video description generation.

02

Versatile editing capabilities including removal, addition, and modification.

03

Enhanced video content generation using auto-generated narratives.

Abstract

Recent video generative models primarily rely on carefully written text prompts for specific tasks, like inpainting or style editing. They require labor-intensive textual descriptions for input videos, hindering their flexibility to adapt personal/raw videos to user specifications. This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework that supports multiple video editing capabilities such as removal, addition, and modification, through a unified pipeline. RACCooN consists of two principal stages: Video-to-Paragraph (V2P) and Paragraph-to-Video (P2V). In the V2P stage, we automatically describe video scenes in well-structured natural language, capturing both the holistic context and focused object details. Subsequently, in the P2V stage, users can optionally refine these descriptions to guide the video diffusion model, enabling…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

- The pipeline for video editing is novel and intuitive. The idea of leveraging the recent development of MLLMs to tackle video editing tasks is interesting. - The experiments are comprehensive. The results of single object prediction are good and outperform some strong baselines. The results of video editing show remarkable improvement. - The source code is provided in the supplementary materials. - The method can be integrated with an inversion-based video editing method, and the results a

Weaknesses

The author can consider providing more video results to demonstrate the actual editing performance in different settings. For example, the video comparison results with Fatezero and Token flow. Also the results of the proposed method with VideoCrafter and DynamiCrafter.

Reviewer 02Rating 5Confidence 4

Strengths

- The proposed Video-to-Paragraph (V2P) method serves as an effective video captioner, capable of capturing multi-granular spatiotemporal features. - This framework achieves state-of-the-art performance in video editing tasks. - A novel dataset is introduced, which can be used as a benchmark for video editing, which contains 7.2K high-quality detailed video paragraphs and 5.5K object-level detailed caption-mask pairs.

Weaknesses

1. The proposed editing approach heavily relies on the accuracy of the Video-to-Paragraph (V2P) method, leading to potentially unnatural modifications. - (1) For video object removal and modification, if the paragraph does not mention a specific object (e.g., in the description of Figure 4, the absence of the female character’s earrings), it raises the question of how to effectively remove or modify that object. - (2) While adding objects does not depend on the paragraph's accuracy, it is

Reviewer 03Rating 5Confidence 5

Strengths

+ The paper is well-written and easy to understand, and the research questions are quite interesting. + The experimental results are relatively sufficient.

Weaknesses

- The research goal of this work is not focused. The original objective is to address video editing, but in practice, it is solving the problem of video captioning. - The method lacks some technical innovations. The approach to video editing can be summarized as text-based video inpainting, but it seems that there is no focused and innovative design specifically tailored to this goal.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Image Retrieval and Classification Techniques

MethodsInpainting · Diffusion