AvatarStudio: High-fidelity and Animatable 3D Avatar Creation from Text
Jianfeng Zhang, Xuanmeng Zhang, Huichao Zhang, Jun Hao Liew, Chenxu, Zhang, Yi Yang, Jiashi Feng

TL;DR
AvatarStudio is a novel generative framework that creates high-fidelity, animatable 3D human avatars from textual descriptions, combining coarse NeRF-based modeling with explicit mesh and diffusion models for pose control and high-resolution rendering.
Contribution
The paper introduces AvatarStudio, a new coarse-to-fine generative model that produces explicit textured 3D meshes from text, enabling animation and high-quality rendering, surpassing prior static or less controllable methods.
Findings
Outperforms previous text-to-avatar methods in quality and controllability.
Supports high-resolution rendering and precise pose control.
Enables multimodal avatar animations and style-guided creation.
Abstract
We study the problem of creating high-fidelity and animatable 3D avatars from only textual descriptions. Existing text-to-avatar methods are either limited to static avatars which cannot be animated or struggle to generate animatable avatars with promising quality and precise pose control. To address these limitations, we propose AvatarStudio, a coarse-to-fine generative model that generates explicit textured 3D meshes for animatable human avatars. Specifically, AvatarStudio begins with a low-resolution NeRF-based representation for coarse generation, followed by incorporating SMPL-guided articulation into the explicit mesh representation to support avatar animation and high resolution rendering. To ensure view consistency and pose controllability of the resulting avatars, we introduce a 2D diffusion model conditioned on DensePose for Score Distillation Sampling supervision. By…
Peer Reviews
Decision·Submitted to ICLR 2024
* The proposed method is able to generate high-quality human avatars from only text input, and the generated avatars have clear appearance details. Experiments show that the proposed method outperforms existing pipelines. Moreover, the authors also demonstrate stylized avatar creation given a style image as an additional condition, which is very impressive. * The authors propose to using DensePose-conditioned ControlNet for SDS supervision. Experiments show that it can achieves precise and sta
* In Abstract and Introduction, the authors claim that using ControlNet conditioned on DensePose offers a benefit on view consitency, but I cannot find any experiments to support this claim. In Figure 6(b), the authors conduct an ablation study to evaluate the effects of different SDS supervision, but the results only show that leveraging skeleton-based ControlNet may suffer from leg pose error. Existing methods like DreamHuman and DreamAvatar are typically based on original Stable Diffusion or
- The paper is well-written and easy to follow - The proposed method achieves SOTA results in text-to-3D avatar generation. - The paper introduces several well-motivated techniques to improve the generation and animation quality, including using deep marching tetrahedra, densepose-guided ControlNet, part-based super-resolution, and SDS optimization in both canonical and deformed space. While some of these techniques have been used in other related tasks, demonstrating their effectiveness in this
While one main focus of the paper is to improve animation, the animation still lacks realism. The animation is modeled via pure LBS with SMPL skinning weights and topology, thus cannot generate realistic non-linear cloth deformation, and cannot deal with loose clothing with other topologies such as skirts (skirts are split as shown in the animation results on the webpage).
The paper tries to solve an important problem of current times. The strategy seems to be working and not counter intuitive. The results are also very encouraging.
The paper stitches together the existing methods and produces an intuitive pipeline used for other 3D asset creation from texts. Nerf with SMPL followed by Score distillation sampling seems to be very intuitive. Hence the novelty is a concern. The text prompts are also simple. The time taking to create the avatar is 2.5 hours which is too much of time. While the ablation shows favorable results but I am not sure why the coarse and fine stages are separately required?
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · 3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis
MethodsDiffusion
