Identity-Consistent Video Generation under Large Facial-Angle Variations

Bin Hu; Zipeng Qi; Guoxi Huang; Zunnan Xu; Ruicheng Zhang; Chongjie Ye; Jun Zhou; Xiu Li; and Jingdong Wang

arXiv:2603.21299·cs.CV·March 24, 2026

Identity-Consistent Video Generation under Large Facial-Angle Variations

Bin Hu, Zipeng Qi, Guoxi Huang, Zunnan Xu, Ruicheng Zhang, Chongjie Ye, Jun Zhou, Xiu Li, and Jingdong Wang

PDF

Open Access

TL;DR

This paper introduces a multi-view conditioned framework, $ ext{Mv}^2 ext{ID}$, for identity-consistent video generation under large facial-angle variations, effectively balancing identity preservation and motion naturalness without costly cross-paired data.

Contribution

The paper proposes a novel multi-view conditioned approach with region-masking and reference decoupled-RoPE mechanisms, enabling identity-consistent video synthesis with in-paired supervision.

Findings

01

Significantly improves identity consistency in generated videos.

02

Maintains natural facial motion despite large angle variations.

03

Outperforms existing methods trained with cross-paired data.

Abstract

Single-view reference-to-video methods often struggle to preserve identity consistency under large facial-angle variations. This limitation naturally motivates the incorporation of multi-view facial references. However, simply introducing additional reference images exacerbates the \textit{copy-paste} problem, particularly the \textbf{\textit{view-dependent copy-paste}} artifact, which reduces facial motion naturalness. Although cross-paired data can alleviate this issue, collecting such data is costly. To balance the consistency and naturalness, we propose $Mv^{2} ID$ , a multi-view conditioned framework under in-paired supervision. We introduce a region-masking training strategy to prevent shortcut learning and extract essential identity features by encouraging the model to aggregate complementary identity cues across views. In addition, we design a reference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Visual Attention and Saliency Detection