TL;DR
CutClaw is an autonomous multi-agent system that efficiently creates short, rhythm-aligned videos from hours-long footage by leveraging multimodal models and hierarchical decomposition.
Contribution
It introduces a novel multi-agent framework with hierarchical multimodal decomposition for automated, narrative-consistent video editing synchronized with music.
Findings
Outperforms state-of-the-art baselines in video quality and rhythm alignment.
Effectively captures both fine-grained details and global structures in videos.
Demonstrates the potential for autonomous, long-form video editing.
Abstract
Editing the video content with audio alignment forms a digital human-made art in current social media. However, the time-consuming and repetitive nature of manual video editing has long been a challenge for filmmakers and professional content creators alike. In this paper, we introduce CutClaw, an autonomous multi-agent framework designed to edit hours-long raw footage into meaningful short videos that leverages the capabilities of multiple Multimodal Language Models~(MLLMs) as an agent system. It produces videos with synchronized music, followed by instructions, and a visually appealing appearance. In detail, our approach begins by employing a hierarchical multimodal decomposition that captures both fine-grained details and global structures across visual and audio footage. Then, to ensure narrative consistency, a Playwriter Agent orchestrates the whole storytelling flow and structures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
