RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives
Jaehong Yoon, Shoubin Yu, Mohit Bansal

TL;DR
RACCooN is a flexible video editing framework that automatically generates natural language descriptions of videos and uses these narratives to guide diverse editing tasks, simplifying user interaction and enhancing editing accuracy.
Contribution
It introduces a multi-granular spatiotemporal pooling strategy for video description and leverages auto-generated narratives to improve video editing and content generation.
Findings
Effective multi-granular video description generation.
Versatile editing capabilities including removal, addition, and modification.
Enhanced video content generation using auto-generated narratives.
Abstract
Recent video generative models primarily rely on carefully written text prompts for specific tasks, like inpainting or style editing. They require labor-intensive textual descriptions for input videos, hindering their flexibility to adapt personal/raw videos to user specifications. This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework that supports multiple video editing capabilities such as removal, addition, and modification, through a unified pipeline. RACCooN consists of two principal stages: Video-to-Paragraph (V2P) and Paragraph-to-Video (P2V). In the V2P stage, we automatically describe video scenes in well-structured natural language, capturing both the holistic context and focused object details. Subsequently, in the P2V stage, users can optionally refine these descriptions to guide the video diffusion model, enabling…
Peer Reviews
Decision·Submitted to ICLR 2025
- The pipeline for video editing is novel and intuitive. The idea of leveraging the recent development of MLLMs to tackle video editing tasks is interesting. - The experiments are comprehensive. The results of single object prediction are good and outperform some strong baselines. The results of video editing show remarkable improvement. - The source code is provided in the supplementary materials. - The method can be integrated with an inversion-based video editing method, and the results a
The author can consider providing more video results to demonstrate the actual editing performance in different settings. For example, the video comparison results with Fatezero and Token flow. Also the results of the proposed method with VideoCrafter and DynamiCrafter.
- The proposed Video-to-Paragraph (V2P) method serves as an effective video captioner, capable of capturing multi-granular spatiotemporal features. - This framework achieves state-of-the-art performance in video editing tasks. - A novel dataset is introduced, which can be used as a benchmark for video editing, which contains 7.2K high-quality detailed video paragraphs and 5.5K object-level detailed caption-mask pairs.
1. The proposed editing approach heavily relies on the accuracy of the Video-to-Paragraph (V2P) method, leading to potentially unnatural modifications. - (1) For video object removal and modification, if the paragraph does not mention a specific object (e.g., in the description of Figure 4, the absence of the female character’s earrings), it raises the question of how to effectively remove or modify that object. - (2) While adding objects does not depend on the paragraph's accuracy, it is
+ The paper is well-written and easy to understand, and the research questions are quite interesting. + The experimental results are relatively sufficient.
- The research goal of this work is not focused. The original objective is to address video editing, but in practice, it is solving the problem of video captioning. - The method lacks some technical innovations. The approach to video editing can be summarized as text-based video inpainting, but it seems that there is no focused and innovative design specifically tailored to this goal.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Image Retrieval and Classification Techniques
MethodsInpainting · Diffusion
