Unified Long Video Inpainting and Outpainting via Overlapping High-Order Co-Denoising
Shuangquan Lyu, Steven Mao, Yue Ma

TL;DR
This paper presents a unified method for long video inpainting and outpainting that extends diffusion models with high fidelity and temporal consistency, enabling arbitrarily long, spatially edited videos.
Contribution
The authors introduce a novel approach combining LoRA fine-tuning and overlap-and-blend co-denoising for high-quality, long-range video editing without seams or drift.
Findings
Outperforms baseline methods in PSNR/SSIM metrics
Enables editing over hundreds of frames
Maintains high perceptual realism (LPIPS)
Abstract
Generating long videos remains a fundamental challenge, and achieving high controllability in video inpainting and outpainting is particularly demanding. To address both of these challenges simultaneously and achieve controllable video inpainting and outpainting for long video clips, we introduce a novel and unified approach for long video inpainting and outpainting that extends text-to-video diffusion models to generate arbitrarily long, spatially edited videos with high fidelity. Our method leverages LoRA to efficiently fine-tune a large pre-trained video diffusion model like Alibaba's Wan 2.1 for masked region video synthesis, and employs an overlap-and-blend temporal co-denoising strategy with high-order solvers to maintain consistency across long sequences. In contrast to prior work that struggles with fixed-length clips or exhibits stitching artifacts, our system enables…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis · Image Enhancement Techniques
