TL;DR
EditCtrl is a novel, efficient video editing framework that localizes computation to edited regions, significantly reducing costs and improving quality over existing methods.
Contribution
It introduces a local video context module and a lightweight global embedder, enabling real-time, high-quality, multi-region video editing with minimal computation.
Findings
EditCtrl is 10 times more compute-efficient than state-of-the-art methods.
It improves editing quality despite reduced computational cost.
Supports multi-region editing with text prompts and autoregressive content propagation.
Abstract
High-fidelity generative video editing has seen significant quality improvements by leveraging pre-trained video foundation models. However, their computational cost is a major bottleneck, as they are often designed to inefficiently process the full video context regardless of the inpainting mask's size, even for sparse, localized edits. In this paper, we introduce EditCtrl, an efficient video inpainting control framework that focuses computation only where it is needed. Our approach features a novel local video context module that operates solely on masked tokens, yielding a computational cost proportional to the edit size. This local-first generation is then guided by a lightweight temporal global context embedder that ensures video-wide context consistency with minimal overhead. Not only is EditCtrl 10 times more compute efficient than state-of-the-art generative editing methods, it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
