DreamSwapV: Mask-guided Subject Swapping for Any Customized Video Editing
Weitao Wang, Zichen Wang, Hongdeng Shen, Yulei Lu, Xirui Fan, Suhui Wu, Jun Zhang, Haoqian Wang, Hao Zhang

TL;DR
DreamSwapV is a versatile, mask-guided video editing framework that enables high-fidelity subject swapping in videos, accommodating various scales and attributes with improved guidance and a dedicated dataset.
Contribution
It introduces a novel mask-guided, subject-agnostic framework with multiple conditions and an adaptive mask strategy for high-quality, customizable video subject swapping.
Findings
Outperforms existing methods on VBench indicators
Introduces a new DreamSwapV-Benchmark dataset
Achieves high-fidelity, flexible subject swapping
Abstract
With the rapid progress of video generation, demand for customized video editing is surging, where subject swapping constitutes a key component yet remains under-explored. Prevailing swapping approaches either specialize in narrow domains--such as human-body animation or hand-object interaction--or rely on some indirect editing paradigm or ambiguous text prompts that compromise final fidelity. In this paper, we propose DreamSwapV, a mask-guided, subject-agnostic, end-to-end framework that swaps any subject in any video for customization with a user-specified mask and reference image. To inject fine-grained guidance, we introduce multiple conditions and a dedicated condition fusion module that integrates them efficiently. In addition, an adaptive mask strategy is designed to accommodate subjects of varying scales and attributes, further improving interactions between the swapped subject…
Peer Reviews
Decision·ICLR 2026 Poster
S1) The method is well ablated and justified the design choices that yields finer details and more robust subject-context integration. S2) Adaptive mask makes a good design by dynamically adjusting the grid size based on the subject's scale and augmenting mask boundaries with geometric shapes. S3) Its great to see the discussion on handling long videos, as most related works can only edit videos on a certain length / training length.
W1) In Table 1, the automatic metrics "Video Quality & Video Consistency" show AnyV2V with high scores on most quantitative indicators (subject consistency, background consistency, motion smoothness), nearly on par with other leading methods. However, the user study metric (human-rated reference detail, subject interaction, and visual fidelity) shows AnyV2V achieving nearly zero. This shows the proposed metrics are insufficiently sensitive to qualitative breakdowns. It seems only the reference a
* The combination of a multi-condition fusion module and adaptive mask strategy is intuitive yet effective, addressing common artifacts, e.g., boundary leakage and poor subject-context blending. * The proposed DreamSwapV-Benchmark is a valuable addition to the field, with well-defined metrics and an attempt at quantitative evaluation for a task lacking standard benchmarks. * The paper is clearly written and easy to follow, with well-organized figures and methodological explanations.
* While the new benchmark is appreciated, the evaluation relies heavily on VBench-style metrics that may not capture identity preservation or semantic consistency robustly. A few qualitative examples are shown, but it remains unclear how the model generalizes to truly out-of-domain subjects or complex dynamic interactions. * Some ablations, e.g., condition fusion variants, mask augmentation parameters, are only qualitatively discussed. Quantitative ablations would make the argument stronger. * T
1. The proposal of the condition fusion module and adaptive mask strategy enables fine-grained control, resulting better subject-context interaction together with high-quality visual improvements. 2. The paper introduces a new benchmark DreamSwapV-Benchmark in customized image editing domain and the proposed method shows improvements over previous baselines with both quantitative metrics and user studies. 3. The model can be extended beyond subject swapping to related tasks like video inpainting
1. The method relies on many detailed design choices and a two-phase training process across multiple datasets. While these improve performance, they make the system complicated and harder to reproduce. 2. Some baselines, like AnyV2V, are training-free or designed for broader editing rather than subject swapping, which makes the comparison less fair.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security · Video Analysis and Summarization · Generative Adversarial Networks and Image Synthesis
