Pose Modulated Avatars from Video
Chunjin Song, Bastian Wandt, Helge Rhodin

TL;DR
This paper introduces a novel two-branch neural network that explicitly models pose-dependent frequency variations to improve the realism and detail of neural radiance field-based human avatars reconstructed from sparse video data.
Contribution
It proposes a frequency-aware, explicit modeling approach with a graph neural network and global frequency modulation, outperforming existing methods in detail preservation and generalization.
Findings
Outperforms state-of-the-art in detail preservation
Enhances generalization to unseen poses
Effectively models pose-dependent cloth and skin deformation
Abstract
It is now possible to reconstruct dynamic human motion and shape from a sparse set of cameras using Neural Radiance Fields (NeRF) driven by an underlying skeleton. However, a challenge remains to model the deformation of cloth and skin in relation to skeleton pose. Unlike existing avatar models that are learned implicitly or rely on a proxy surface, our approach is motivated by the observation that different poses necessitate unique frequency assignments. Neglecting this distinction yields noisy artifacts in smooth areas or blurs fine-grained texture and shape details in sharp regions. We develop a two-branch neural network that is adaptive and explicit in the frequency domain. The first branch is a graph neural network that models correlations among body parts locally, taking skeleton pose as input. The second branch combines these correlation features to a set of global frequencies…
Peer Reviews
Decision·ICLR 2024 poster
In terms of novelty, the problem of animatable neural human avatar creation is a widely studied one. However this paper proposes the original new idea of modulating the frequency bands used in Nerfs to correctly learn to model wrinkles on clothes based on the deforming pose. This idea is conceptually sound and provides an interesting novel insight to the problem of human avatar creation. Modeling the deformation of loose clothing is still a fairly unsolved problem within this domain and hence ad
The main weaknesses are in terms of the results and experiments. 1. Overall the results in the supplementary videos are quite blurry. The effect of the improvement in texture quality of the wrinkles with the proposed method are also subtle and hard to really appreciate. The numerical results in Table 2 of the paper correlate with this fact and show marginal numerical improvement in the reported metrics. Do the authors believe these numerical improvements are statistically significant? 2. For
- The problem of avatar modeling has high practical significance - The frequency modulation approach makes sense to introduce the details in cases where they are required, which can serve as a regularization measure - The paper is fairly well written
- The comparison lacks modern baselines, such as Vid2Avatar, MonoHuman, and HumanNeRF, which were referenced in the related work. - Video results have very low FPS, and therefore, the temporal smoothness of the proposed approach cannot be evaluated. - It is unclear whether or not GNNs are actually needed for this task, ex. Vid2Avatar uses pose conditioning without GNNs to directly produce the embeddings via an MLP - No experiments on loose clothing where the method's effectiveness for high-frequ
* The paper is well-written. Technical details are also well-elaborated. * Based on the evaluation, the overall quality of the results seems to be satisfactory. Additionally, the quantitative results show better performance compared to DANBO.
Motivation: * I find the motivation in Figure 1 to be unclear. The two poses are quite different - the first contains wrinkles while the second doesn't - but their frequency distribution appears quite similar. Perhaps it would be better to choose a sample with poses that are closer to each other and have more significant differences in frequency. * The paper mentioned that even when a subject is in a similar pose, the frequency distributions can still be distinct. This seems contradictory to the
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · 3D Shape Modeling and Analysis
MethodsGraph Neural Network
