TL;DR
GenLCA introduces a scalable 3D diffusion model trained on millions of in-the-wild videos to generate and edit photorealistic full-body avatars with high fidelity and realism.
Contribution
A novel visibility-aware diffusion training strategy and a 3D token encoding method enable training on large-scale real-world video data for high-quality avatar generation.
Findings
Outperforms existing solutions in quality and realism.
Supports diverse and high-fidelity avatar generation and editing.
Effectively utilizes large-scale real-world videos for 3D diffusion training.
Abstract
We present GenLCA, a diffusion-based generative model for generating and editing photorealistic full-body avatars from text and image inputs. The generated avatars are faithful to the inputs, while supporting high-fidelity facial and full-body animations. The core idea is a novel paradigm that enables training a full-body 3D diffusion model from partially observable 2D data, allowing the training dataset to scale to millions of real-world videos. This scalability contributes to the superior photorealism and generalizability of GenLCA. Specifically, we scale up the dataset by repurposing a pretrained feed-forward avatar reconstruction model as an animatable 3D tokenizer, which encodes unstructured video frames into structured 3D tokens. However, most real-world videos only provide partial observations of body parts, resulting in excessive blurring or transparency artifacts in the 3D…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
