MirrorMe: Towards Realtime and High Fidelity Audio-Driven Halfbody Animation

Dechao Meng; Steven Xiao; Xindi Zhang; Guangyuan Wang; Peng Zhang; Qi Wang; Bang Zhang; Liefeng Bo

arXiv:2506.22065·cs.CV·June 30, 2025

MirrorMe: Towards Realtime and High Fidelity Audio-Driven Halfbody Animation

Dechao Meng, Steven Xiao, Xindi Zhang, Guangyuan Wang, Peng Zhang, Qi Wang, Bang Zhang, Liefeng Bo

PDF

Open Access

TL;DR

MirrorMe is a real-time, high-fidelity audio-driven half-body animation framework that leverages a diffusion transformer model with novel identity, audio, and training innovations for improved temporal coherence and gesture control.

Contribution

The paper introduces MirrorMe, a novel real-time framework using diffusion transformers with new mechanisms for identity preservation, audio synchronization, and multi-level training for high-quality animated videos.

Findings

01

State-of-the-art fidelity and lip-sync accuracy

02

Enhanced temporal stability in animations

03

Effective gesture control including hand poses

Abstract

Audio-driven portrait animation, which synthesizes realistic videos from reference images using audio signals, faces significant challenges in real-time generation of high-fidelity, temporally coherent animations. While recent diffusion-based methods improve generation quality by integrating audio into denoising processes, their reliance on frame-by-frame UNet architectures introduces prohibitive latency and struggles with temporal consistency. This paper introduces MirrorMe, a real-time, controllable framework built on the LTX video model, a diffusion transformer that compresses video spatially and temporally for efficient latent space denoising. To address LTX's trade-offs between compression and semantic fidelity, we propose three innovations: 1. A reference identity injection mechanism via VAE-encoded image concatenation and self-attention, ensuring identity consistency; 2. A causal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Human Motion and Animation