TL;DR
OmniAvatar is a novel full-body video generation model that uses adaptive audio embedding and prompt control to produce natural, synchronized human animations across diverse scenarios.
Contribution
It introduces a pixel-wise multi-hierarchical audio embedding strategy and a LoRA-based training approach for improved full-body animation with precise control.
Findings
Outperforms existing models in facial and semi-body video generation
Achieves high lip-sync accuracy and natural movements
Enables versatile domain-specific video creation
Abstract
Significant progress has been made in audio-driven human animation, while most existing methods focus mainly on facial movements, limiting their ability to create full-body animations with natural synchronization and fluidity. They also struggle with precise prompt control for fine-grained generation. To tackle these challenges, we introduce OmniAvatar, an innovative audio-driven full-body video generation model that enhances human animation with improved lip-sync accuracy and natural movements. OmniAvatar introduces a pixel-wise multi-hierarchical audio embedding strategy to better capture audio features in the latent space, enhancing lip-syncing across diverse scenes. To preserve the capability for prompt-driven control of foundation models while effectively incorporating audio features, we employ a LoRA-based training approach. Extensive experiments show that OmniAvatar surpasses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFocus
