Identity-Consistent Video Generation under Large Facial-Angle Variations
Bin Hu, Zipeng Qi, Guoxi Huang, Zunnan Xu, Ruicheng Zhang, Chongjie Ye, Jun Zhou, Xiu Li, and Jingdong Wang

TL;DR
This paper introduces a multi-view conditioned framework, $ ext{Mv}^2 ext{ID}$, for identity-consistent video generation under large facial-angle variations, effectively balancing identity preservation and motion naturalness without costly cross-paired data.
Contribution
The paper proposes a novel multi-view conditioned approach with region-masking and reference decoupled-RoPE mechanisms, enabling identity-consistent video synthesis with in-paired supervision.
Findings
Significantly improves identity consistency in generated videos.
Maintains natural facial motion despite large angle variations.
Outperforms existing methods trained with cross-paired data.
Abstract
Single-view reference-to-video methods often struggle to preserve identity consistency under large facial-angle variations. This limitation naturally motivates the incorporation of multi-view facial references. However, simply introducing additional reference images exacerbates the \textit{copy-paste} problem, particularly the \textbf{\textit{view-dependent copy-paste}} artifact, which reduces facial motion naturalness. Although cross-paired data can alleviate this issue, collecting such data is costly. To balance the consistency and naturalness, we propose , a multi-view conditioned framework under in-paired supervision. We introduce a region-masking training strategy to prevent shortcut learning and extract essential identity features by encouraging the model to aggregate complementary identity cues across views. In addition, we design a reference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Visual Attention and Saliency Detection
