IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer
Bo Chen, Tao Liu, Qi Chen, Xie Chen, Zilong Zheng

TL;DR
IMTalker introduces an efficient, high-fidelity talking face generation framework that uses implicit motion transfer via cross-attention, improving motion accuracy, identity preservation, and synchronization over prior explicit flow-based methods.
Contribution
The paper proposes a novel implicit motion transfer approach with a cross-attention mechanism and identity-adaptive module, enhancing global motion modeling and identity disentanglement in talking face synthesis.
Findings
Outperforms prior methods in motion accuracy and identity preservation.
Achieves real-time generation at 40-42 FPS on high-end GPU.
Demonstrates superior audio-lip synchronization quality.
Abstract
Talking face generation aims to synthesize realistic speaking portraits from a single image, yet existing methods often rely on explicit optical flow and local warping, which fail to model complex global motions and cause identity drift. We present IMTalker, a novel framework that achieves efficient and high-fidelity talking face generation through implicit motion transfer. The core idea is to replace traditional flow-based warping with a cross-attention mechanism that implicitly models motion discrepancy and identity alignment within a unified latent space, enabling robust global motion rendering. To further preserve speaker identity during cross-identity reenactment, we introduce an identity-adaptive module that projects motion latents into personalized spaces, ensuring clear disentanglement between motion and identity. In addition, a lightweight flow-matching motion generator…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
