X-NeMo: Expressive Neural Motion Reenactment via Disentangled Latent Attention
Xiaochen Zhao, Hongyi Xu, Guoxian Song, You Xie, Chenxu Zhang, Xiu Li, Linjie Luo, Jinli Suo, Yebin Liu

TL;DR
X-NeMo is a diffusion-based portrait animation method that effectively captures subtle facial expressions and prevents identity leakage by using a disentangled, latent motion descriptor controlled through cross-attention.
Contribution
It introduces a novel end-to-end training framework with a 1D identity-agnostic motion descriptor and dual GAN supervision, improving expressiveness and disentanglement in zero-shot facial reenactment.
Findings
Outperforms state-of-the-art methods in expression quality
Reduces identity leakage in facial reenactment
Captures fine-grained facial motions accurately
Abstract
We propose X-NeMo, a novel zero-shot diffusion-based portrait animation pipeline that animates a static portrait using facial movements from a driving video of a different individual. Our work first identifies the root causes of the key issues in prior approaches, such as identity leakage and difficulty in capturing subtle and extreme expressions. To address these challenges, we introduce a fully end-to-end training framework that distills a 1D identity-agnostic latent motion descriptor from driving image, effectively controlling motion through cross-attention during image generation. Our implicit motion descriptor captures expressive facial motion in fine detail, learned end-to-end from a diverse video dataset without reliance on pretrained motion detectors. We further enhance expressiveness and disentangle motion latents from identity cues by supervising their learning with a dual GAN…
Peer Reviews
Decision·ICLR 2025 Poster
1.This paper is well written, easy to follow. 2.This paper proposes a new portrait animation pipeline that effectively addresses the longstanding issues of identity entanglement and loss of motion expressiveness. 3.Extensive experiments demonstrate the effectiveness of this method. 4.Great work! The motivation and experimental results for each component are solid. The demo in the supplementary materials also looks very impressive; (if it isn’t cherry-picked)
1.Since the motion model is trained, could it struggle to adapt to out-of-distribution (OOD) motions, could you provide extreme or unusual facial expressions to demonstrate robustness? 2.In the results provided in the paper and the demo, the facial features of the driving and reference are quite similar. Could you provide more examples where facial features (such as eyes, mouth, nose, etc.) or face position or head pose are inconsistent? 3.As stated in W2, I also cannot tell if this paper truly
1. This work proposes a feasible solution to address the limitations of previous portrait animation methods that rely on explicit motion descriptors or the integration of motion information through PoseGuider and ControlNet. 2. This study demonstrates strong visual performance across various samples, showcasing its robust capabilities in motion transfer and stability. 3. This work includes comprehensive comparisons with prior methods and an ablation study to validate the proposed techniques.
1. A temporal evaluation of spatially aligned motion injection versus attention-based motion injection is recommended. Intuitively, spatially aligned motion injection is expected to provide better temporal consistency due to its stronger spatial priors. 2. Additional analysis and experiments are needed to clarify why X-NeMo achieves such high levels of temporal consistency. Other methods, such as LivePortrait and X-Portrait, also include a stage for training temporal modules, yet they still exhi
+ The paper is well-structured. + The problem of portrait animation with high expressiveness and identity preservation is important. + The use of a 1D latent motion descriptor and cross-attention for motion control is reasonable.
- What is the definition of zero-shot here? Firstly the model is trained. Secondly in the inference several reference images are provided. Thirdly the description of zero-shot is missing. - The method has three training stages. What does it mean by end-to-end learning as described in several places in the paper? It seems each components are trained separately. - The approach to get identity-agnostic feature is only to augment the images with color jitter, scaling and affine transformation. Such
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Human Motion and Animation
