TL;DR
MAGREF is a novel framework for any-reference video generation that uses masked guidance and subject disentanglement to produce high-fidelity videos with consistent identities and reduced artifacts, conditioned on diverse references and text prompts.
Contribution
It introduces masked guidance and subject disentanglement mechanisms, along with a four-stage data pipeline, to improve identity consistency and reduce artifacts in any-reference video synthesis.
Findings
Outperforms state-of-the-art methods on benchmark datasets.
Effectively maintains identity consistency across generated videos.
Reduces copy-paste artifacts and subject confusion.
Abstract
We tackle the task of any-reference video generation, which aims to synthesize videos conditioned on arbitrary types and combinations of reference subjects, together with textual prompts. This task faces persistent challenges, including identity inconsistency, entanglement among multiple reference subjects, and copy-paste artifacts. To address these issues, we introduce MAGREF, a unified and effective framework for any-reference video generation. Our approach incorporates masked guidance and a subject disentanglement mechanism, enabling flexible synthesis conditioned on diverse reference images and textual prompts. Specifically, masked guidance employs a region-aware masking mechanism combined with pixel-wise channel concatenation to preserve appearance features of multiple subjects along the channel dimension. This design preserves identity consistency and maintains the capabilities of…
Peer Reviews
Decision·ICLR 2026 Poster
1. Technically sound: Masked guidance and pixel-wise concatenation are simple yet effective extensions of I2V backbones. 2. Comprehensive results: The experiments and ablations clearly support the claimed improvements.
1. My major concern is that the pixel-wise channel concatenation may limit scalability when the number of reference subjects grows. It is unclear how the model handles more subjects simultaneously. Would temporal or latent-level concatenation yield more flexible conditioning in such cases? 2. While pixel-wise concatenation effectively preserves subject appearance, it may inherently limit global-level customization. Since it injects spatially grounded features, the model mainly captures concrete
- The writing is easy to understand, and the painting is well-drawn. - The proposed region-aware masking method preserves subject identity without backbone changes. - Experimentally, the paper achieves best single-ID and multi-subject score.
- 1. In this paper, multi-subjects are introduced into the videos through a blank canvas. Several subjects are directly added to the canvas with their pixels values. Although this function is useful, my major concerns are listed below: - a) The canvas size is limited. How many subjects can be placed on the canvas without harming the model’s generation ability? - b) The positions of these subjects are randomly shuffled during training, in my opinion, the locations of different subject
- This paper proposed a novel structure to condition video generation on multiple images by combining multiple images into one. It also proposed data-pipeline to collect large-scale - The proposed subject disentanglement mechanism is novel and effective in text-prompt alignment for different subjects in the image. - The results show that the proposed framework is better than other baselines. Ablation study proves that all the proposed module is meaningful. - The paper is well-written and easy to
The paper is in general good, some minor points: - The computational power is not mentioned e.g. how many GPUs have been used - No details about the dataset e.g. source/size and possible privacy problem for human face data
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
