TL;DR
PISCO is a novel video diffusion model enabling precise, controllable video instance insertion with minimal user input, maintaining scene fidelity and interaction.
Contribution
It introduces Variable-Information Guidance and Distribution-Preserving Temporal Masking to improve sparse control in video diffusion models, along with a new benchmark PISCO-Bench.
Findings
PISCO outperforms existing inpainting and editing baselines.
Performance improves with additional control signals.
It achieves high-fidelity, scene-consistent instance insertion.
Abstract
The landscape of AI video generation is undergoing a pivotal shift: moving beyond general generation - which relies on exhaustive prompt-engineering and "cherry-picking" - towards fine-grained, controllable generation and high-fidelity post-processing. In professional AI-assisted filmmaking, it is crucial to perform precise, targeted modifications. A cornerstone of this transition is video instance insertion, which requires inserting a specific instance into existing footage while maintaining scene integrity. Unlike traditional video editing, this task demands several requirements: precise spatial-temporal placement, physically consistent scene interaction, and the faithful preservation of original dynamics - all achieved under minimal user effort. In this paper, we propose PISCO, a video diffusion model for precise video instance insertion with arbitrary sparse keyframe control. PISCO…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
