Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance
Ziyun Zeng, Yiqi Lin, Guoqiang Liang, Mike Zheng Shou

TL;DR
Sparkle introduces a new large-scale dataset and benchmark for instruction-guided video background replacement, enabling more realistic and temporally consistent scene synthesis.
Contribution
We develop a scalable pipeline for high-quality background guidance data generation and create Sparkle, the largest dataset and benchmark for this task.
Findings
Our dataset and model outperform existing baselines on evaluation benchmarks.
Sparkle achieves more natural and temporally consistent background replacements.
The decoupled guidance approach improves data quality and model performance.
Abstract
In recent years, open-source efforts like Senorita-2M have propelled video editing toward natural language instruction. However, current publicly available datasets predominantly focus on local editing or style transfer, which largely preserve the original scene structure and are easier to scale. In contrast, Background Replacement, a task central to creative applications such as film production and advertising, requires synthesizing entirely new, temporally consistent scenes while maintaining accurate foreground-background interactions, making large-scale data generation significantly more challenging. Consequently, this complex task remains largely underexplored due to a scarcity of high-quality training data. This gap is evident in poorly performing state-of-the-art models, e.g., Kiwi-Edit, because the primary open-source dataset that contains this task, i.e., OpenVE-3M, frequently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
