Learning Neural Volumetric Representations of Dynamic Humans in Minutes

Chen Geng; Sida Peng; Zhen Xu; Hujun Bao; Xiaowei Zhou

arXiv:2302.12237·cs.CV·February 27, 2023

Learning Neural Volumetric Representations of Dynamic Humans in Minutes

Chen Geng, Sida Peng, Zhen Xu, Hujun Bao, Xiaowei Zhou

PDF

Open Access 1 Repo

TL;DR

This paper introduces a fast method for reconstructing dynamic human scenes as neural volumetric videos from sparse multi-view videos, achieving similar quality to slower methods in just minutes.

Contribution

A novel part-based voxelized human representation and 2D motion parameterization scheme enable rapid learning of neural volumetric videos from sparse views.

Findings

01

Training time reduced to about 5 minutes on a single GPU.

02

Achieves competitive visual quality with prior methods.

03

Model is 100 times faster than traditional per-scene optimization.

Abstract

This paper addresses the challenge of quickly reconstructing free-viewpoint videos of dynamic humans from sparse multi-view videos. Some recent works represent the dynamic human as a canonical neural radiance field (NeRF) and a motion field, which are learned from videos through differentiable rendering. But the per-scene optimization generally requires hours. Other generalizable NeRF models leverage learned prior from datasets and reduce the optimization time by only finetuning on new scenes at the cost of visual fidelity. In this paper, we propose a novel method for learning neural volumetric videos of dynamic humans from sparse view videos in minutes with competitive visual quality. Specifically, we define a novel part-based voxelized human representation to better distribute the representational power of the network to different human parts. Furthermore, we propose a novel 2D motion…

Tables2

Table 1. Table 1 : Quantitative comparison of our method and baseline methods on the ZJU-MoCap and MonoCap datasets. We use bold text for the best and underlined text for the second best metric value across methods. Our method achieves the fastest training speed and shows competitive rendering results. Note that the NHP [ 34 ] and PixelNeRF [ 99 ] are additionally pretrained for 10 hours. LPIPS ∗ = LPIPS × 10 3 absent superscript 10 3 \times 10^{3} .

		ZJU-MoCap			MonoCap
	Training Time	PSNR $↑$	SSIM $↑$	LPIPS^∗ $↓$	PSNR $↑$	SSIM $↑$	LPIPS^∗ $↓$
Ours	~5 min	31.01	0.971	38.45	32.61	0.988	16.68
HumanNeRF[92]	10 h	30.66	0.969	33.38	32.68	0.987	15.52
AS[57]	~10 h	30.38	0.975	37.23	32.48	0.988	13.18
AN[56]	~10 h	29.77	0.965	46.89	31.07	0.985	19.47
NB[58]	~10 h	29.03	0.964	42.47	32.36	0.986	16.70
NHP[34]	~1 h fine-tuning	28.25	0.955	64.77	30.51	0.980	27.14
PixelNeRF[99]	~1 h fine-tuning	24.71	0.892	121.86	26.43	0.960	43.98

Table 2. (a)

	PSNR	SSIM	LPIPS^∗
Ours	32.09	0.982	23.47
Ours w/o Part	30.11	0.974	45.84
Ours w/o UV	31.40	0.979	30.99
Ours w/o Perc	30.55	0.976	44.33

Equations16

Φ_{LBS} (x, w, ρ) = (j = 1 \sum J w_{k, j} G_{j})^{- 1} x,

Φ_{LBS} (x, w, ρ) = (j = 1 \sum J w_{k, j} G_{j})^{- 1} x,

ΔΦ (u, v, t) = MLP_{res} (ψ_{res} (u, v, t)),

ΔΦ (u, v, t) = MLP_{res} (ψ_{res} (u, v, t)),

Φ (x, w, u, v, ρ, t) = Φ_{LBS} (x, w, ρ) + ΔΦ (u, v, t) .

V_{k} = {v_{i} ∣ argmax w_{i} \in Ω_{k}},

V_{k} = {v_{i} ∣ argmax w_{i} \in Ω_{k}},

E_{k} = {(v_{i}, v_{j}) ∣ v_{i} \in V_{k}, v_{j} \in V_{k}} .

(σ_{k}, z) = MLP_{σ_{k}} (ψ_{k} (x)),

(σ_{k}, z) = MLP_{σ_{k}} (ψ_{k} (x)),

c_{k} = MLP_{c_{k}} (z, d, ℓ_{t}) .

c_{k} = MLP_{c_{k}} (z, d, ℓ_{t}) .

(σ, c) = (σ_{k^{*}}, c_{k^{*}}), where k^{*} = k argmax σ_{k} .

(σ, c) = (σ_{k^{*}}, c_{k^{*}}), where k^{*} = k argmax σ_{k} .

L_{rgb} = ∥ \tilde{I}_{P} - I_{P} ∥_{2} + ∥ F_{vgg} (\tilde{I}_{P}) - F_{vgg} (I_{P}) ∥_{2},

L_{rgb} = ∥ \tilde{I}_{P} - I_{P} ∥_{2} + ∥ F_{vgg} (\tilde{I}_{P}) - F_{vgg} (I_{P}) ∥_{2},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zju3dv/instant-nvr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · 3D Shape Modeling and Analysis

Full text

Learning Neural Volumetric Representations of Dynamic Humans in Minutes

Chen Geng111Equal contribution Sida Peng111Equal contribution Zhen Xu111Equal contribution Hujun Bao Xiaowei Zhou

State Key Laboratory of CAD&CG, Zhejiang University

Abstract

This paper addresses the challenge of quickly reconstructing free-viewpoint videos of dynamic humans from sparse multi-view videos. Some recent works represent the dynamic human as a canonical neural radiance field (NeRF) and a motion field, which are learned from videos through differentiable rendering. But the per-scene optimization generally requires hours. Other generalizable NeRF models leverage learned prior from datasets and reduce the optimization time by only finetuning on new scenes at the cost of visual fidelity. In this paper, we propose a novel method for learning neural volumetric videos of dynamic humans from sparse view videos in minutes with competitive visual quality. Specifically, we define a novel part-based voxelized human representation to better distribute the representational power of the network to different human parts. Furthermore, we propose a novel 2D motion parameterization scheme to increase the convergence rate of deformation field learning. Experiments demonstrate that our model can be learned 100 times faster than prior per-scene optimization methods while being competitive in the rendering quality. Training our model on a $512\times 512$ video with 100 frames typically takes about 5 minutes on a single RTX 3090 GPU. The code will be released on our project page: https://zju3dv.github.io/instant_nvr.

1 Introduction

Creating volumetric videos of human performers has many applications, such as immersive telepresence, video games, and movie production. Recently, some methods [58, 92] have shown that high-quality volumetric videos can be recovered from sparse multi-view videos by representing dynamic humans with neural scene representations. However, they typically require more than 10 hours of training on a single GPU. The expensive time and computational costs limit the large-scale application of volumetric videos. Generalizable methods [99, 34] utilize learned prior from datasets of dynamic humans to reduce the training time by only finetuning on novel human performers. These techniques could increase the optimization speed by a factor of 2-5 at the cost of some visual fidelity.

To speed up the process of optimizing a neural representation for view synthesis of dynamic humans, we analyze the structural prior of the human body and motion and propose a novel dynamic human representation that achieves 100x speedup during optimization while maintaining competitive visual fidelity. Specifically, to model a dynamic human, we first transform world-space points to a canonical space using a novel motion parameterization scheme and inverse linear blend skinning (LBS) [35]. Then, the color and density of these points are estimated using the canonical human model.

The innovation of our proposed representation is two-fold. First, we observe that different human parts have different shape and texture complexity. For example, the face of a human performer typically exhibits more complexity than a flatly textured torso region, thus requiring more representational power to depict. Motivated by this, our method decomposes the canonical human body into multiple parts and represents the human body with a structured set of voxelized NeRF [47] networks to bring the convergence rate of these different parts to the same level. In contrast to a single-resolution representation, the part-based body model utilizes the human body prior to represent the human shape and texture efficiently, heuristically distributing variational representational power to human parts with spatially varying complexity.

Second, we notice that human motion typically occurs around a surface instead of in a volume, that is, near-surface points that are projected to neighboring regions on a parametric human model have similar motion behavior. Thus we propose a novel motion parameterization technique that models the 3D human deformation in a 2D domain. This idea is similar to the displacement map and bump map [15, 14] in traditional Computer Graphics to represent detailed deformation on a 2D texture domain. We extend the technique of displacement map [15, 14] to represent human motions by restricting the originally 3D deformation field [59, 40, 56] onto the 2D surface of a parametric human model, such as SMPL [43]. This technique significantly increases the convergence rate of a deformation field by reducing the dimensionality at which the neural representation needs to model the motion.

Experiments demonstrate that our method significantly accelerates the optimization of neural human representations while being competitive with recent human modeling methods on the rendering quality. As shown in Figure 1, our model can be trained in around 5 minutes to produce a volumetric video of a dynamic human from a 100-frame monocular video of $512\times 512$ resolution on an RTX 3090 GPU.

To summarize, our key contributions are:

•

A novel part-based voxelized human representation for more efficient human body modeling.

•

A 2D motion parameterization scheme to for more efficient deformation field modeling.

•

100x speedup in optimization compared to previous neural human representations while maintaining competitive rendering quality.

2 Related work

Implicit neural representation and rendering.

There have been many 3D scene representations, such as multi-view images [77, 60], textured meshes [85, 38], point clouds [61, 1], and voxels [75, 42]. Recently, some methods [76, 11, 96, 44, 51, 41, 103] propose implicit neural representations to represent scenes, which uses MLP networks to predict scene properties for any point in 3D space, such as occupancy [44, 67], signed distance [51, 41], and semantics [103, 23]. This enables them to describe continuous and high-resolution 3D scenes. To perform novel view synthesis, neural radiance field (NeRF) [47] models scenes as implicit fields of density and color. NeRF is optimized from images with volume rendering techniques, which produces impressive image synthesis results. Many works improve NeRF in various aspects, including rendering quality [3, 4], rendering speed [63, 39, 98, 27, 22], scene scale [81, 88, 93, 64], and reconstruction quality [95, 89, 50]. Some methods [52, 59, 18, 37] extend NeRF to dynamic scenes.

Human modeling.

Reconstructing high-quality 3D human models is essential for synthesizing free-viewpoint videos of human performers. Traditional methods leverage multi-view stereo techniques [71, 72, 24] or depth fusion [79, 13, 17] to reconstruct human geometries, which require complicated hardware, such as dense camera arrays or depth sensors. To reduce the requirement of the capture equipment, some methods [67, 68, 2] train networks to learn human priors from datasets containing a large amount of 3D ground-truth human models, enabling them to infer human geometry and texture from even a single image. However, due to the limited diversity of training data, these methods do not generalize well to humans under complex poses. Recently, some methods [69, 9, 86, 46] model the shapes of dynamic humans as implicit neural representations and attempt to optimize them from human scans. Another line of works [58, 34, 94, 102, 101, 92, 30, 56, 40, 100, 36, 91, 104, 65, 28] exploits dynamic implicit neural representations and differentiable renderers to reconstruct 3D human models from videos. To represent dynamic humans, Neural Actor [40] augments the neural radiance field with the linear blend skinning model [35]. It additionally adopts a residual deformation field to better predict human motions. To overcome the inaccuracy of input human poses, [78, 92] optimize the parameters of human poses jointly with the human representations during training. These methods typically require a lengthy training process to produce high-quality human models. In contrast, we introduce a part-based voxelized human representation to model the canonical human body, which significantly accelerates the optimization process. Although [16, 46] have proposed part-based implicit functions, they focus on human shape modeling and do not show that the part-based representation can be used to reduce the training time.

Accelerating the optimization of neural representations.

Many differentiable rendering-based methods [47, 39, 3] optimize a separate neural representation for each scene. The optimization process generally takes several hours on a modern GPU, which is time-consuming and costly to scale. Inspired by multi-view stereo matching [72], some methods [90, 8, 12, 87, 73, 99, 45] train a network on multi-view datasets to learn to infer radiance fields from input images. This enables them to quickly fine-tune neural representations to unseen scenes. [62, 19] leverage the auto-decoder [51] to capture the scene priors for efficient fine-tuning. [5, 82] utilize meta-learning techniques [49, 20] to initialize network parameters, thereby improving the training speed. Some methods [70, 97, 7, 80, 55, 10] attempt to design scene representations that support efficient training. [83, 74, 48, 21] augments the approximation ability of networks by designing encoding techniques. Multiresolution hash encoding [48] defines multiresolution feature vector arrays for a scene and uses the hash technique [84] to assign each input coordinate a feature vector as the encoded input, which significantly improves the training speed.

3 Method

This paper aims to quickly create a 3D video from a sparse multi-view video that captures a dynamic human. Following Neural Actor [40], we assume that the cameras are calibrated, and the human pose and foreground human mask of each image are provided.

In this section, we build a dynamic human model that is comprised of a part-based voxelized human representation and a dimensionality reduction motion parameterization scheme (Section 3.1). Then, Section 3.2 discusses how to efficiently optimize the proposed representation. Finally, we provide implementation details in Section 3.3.

3.1 Proposed human representation

As shown in Figure 2, our dynamic human representation consists of a motion parameterization field and a part-based voxelized human model. (a) For a query point $\mathbf{x}$ , the motion parameterization field first transforms it to the canonical space correspondence $\mathbf{x}^{can}$ using the inverse LBS [35] algorithm and by parameterizing the 3D points onto 2D UV coordinates to predict the residual deformation $\Delta\mathbf{x}$ . (b) Then, $\mathbf{x}^{can}$ is fed into the part-based voxelized human model in canonical space to predict and aggregate the density and color $(\sigma,\mathbf{c})$ , where the canonical human body is decomposed into $K$ parts, each of which is represented using an MHE-augmented [48] NeRF network.

Motion parameterization on 2D surface domain.

To regress the canonical correspondence $\mathbf{x}^{can}$ of a query point $\mathbf{x}$ , we first find its nearest surface point $\mathbf{p}$ on the posed SMPL mesh. Using the strategy in [40], the blend weight $\mathbf{w}$ and UV coordinate $(u,v)$ of surface point $\mathbf{p}$ are obtained from the SMPL model.

Given the blend weight $\mathbf{w}$ and UV coordinate $(u,v)$ , the motion field maps the query point $\mathbf{x}$ to the canonical space correspondence $\mathbf{x}^{can}$ . The motion field is comprised of an inverse LBS module [35] and a residual deformation module. Given a query point $\mathbf{x}$ and blend weight $\mathbf{w}$ , we use the inverse LBS module to transform it to the unposed space, which is defined as:

[TABLE]

where $\boldsymbol{\rho}$ denotes the human pose and $\{G_{j}\}_{j=1}^{J}$ are transformation matrices derived from $\boldsymbol{\rho}$ [43]. The detailed derivation of the inverse LBS algorithm can be found in the supplementary material.

The transformed point $\Phi_{\text{LBS}}(\mathbf{x},\mathbf{w},\boldsymbol{\rho})$ is then deformed to the human surface using the residual deformation module. Specifically, the current time $t$ is first concatenated with the UV coordinate $(u,v)$ to serve as the parameterization of the query point $\mathbf{x}$ at frame $t$ . This motion parameterization is inspired by the displacement map and bump map techniques [15, 14] in the traditional Computer Graphics pipelines. It essentially reduces the dimensionality of a 4D space-time sequence down to the 3D surface-time domain utilizing the human deformation prior. Then, we apply the multiresolution hash encoding [48] $\psi_{\text{res}}$ to $(u,v,t)$ and forward the encoded input through a network $\text{MLP}_{\text{res}}$ to regress the residual $\delta$ . The full human motion at frame $t$ is defined as:

[TABLE]

There are two main observations that inspired us to use the $(u,v,t)$ motion parameterization. First, we observe that a typical human motion happens at a surface level instead of a volumetric level. Near-surface points sharing similar UV coordinate of the parametric model shows similar motions. Utilizing this prior with a surface parameterization [15, 14], we can reduce the required 4D volumetric motion to the 3D surface-time domain, greatly decreasing the amount of information the deformation network has to learn. Based on a similar idea, [6] diffuses the surface motion to the full 3D space. Second, a naive $(x,y,z,t)$ encoding would introduce quartic memory overhead on an explicitly defined voxel structure, which is intractable to use in practice. Instead, by parameterizing the motion to $(u,v,t)$ , we can reduce the memory footprint to a more practical cubic level. Experiments demonstrate that the motion parameterization scheme effectively reduces the dimensionality of the deformation field, thus greatly increasing the convergence rate of the human model.

Part-based voxelized human representation.

Muller et al. [48] propose the multi-resolution hash encoding (MHE) to improve the approximation ability and training speed of implicit neural representations. MHE is defined on an explicit set of voxel grids of different resolutions. Given an input coordinate, it applies the hash encoding on each level and queries the corresponding voxel grid to trilinearly interpolate the feature of the input point for this level. Then, the concatenated multi-resolution feature is fed into a small MLP network to predict the target value. Note that [48] concatenates the features of multi-resolution hash encoding for the same point to mitigate the effect of hash collision, while our part-based voxelized human representation introduces spatially varying resolution to efficiently encode human parts with different complexity.

In contrast to [56, 40] which use a single neural radiance field (NeRF) to represent the canonical human model, we decompose the human body into multiple parts with different complexity and adopt a structured set of MHE-augmented NeRF with varying resolutions as the body representation. Specifically, we manually divide the human body into multiple parts based on a parametric human model (such as SMPL [43]), as shown in Figure 2. Note that other parametric human models [66, 54] can also be used in our method. We use the blend weights defined in SMPL model [43] to decompose SMPL template mesh $\mathcal{M}=(\mathcal{V},\mathcal{E})$ , where $\mathcal{V}$ represents the vertices and $\mathcal{E}$ represents the edges. Let the $i$ -th vertice $v_{i}$ have blend weight $w_{i}$ and for each part $k$ we define $\Omega_{k}$ as the set of bones that belong to this part. The detailed setting of $\Omega_{k}$ can be found in the supplementary material. The mesh of the $k$ -th part is defined as $\mathcal{M}_{k}=(\mathcal{V}_{k},\mathcal{E}_{k})$ , where:

[TABLE]

To regress the density and color of a query point $\mathbf{x}$ , we first find the nearest surface point $\mathbf{p}_{k}$ on each human part $\mathcal{M}_{k}$ of the posed SMPL mesh. Using the strategy in [40], the blend weight $\mathbf{w}_{k}$ and UV coordinate $(u_{k},v_{k})$ of surface point $\mathbf{p}_{k}$ are obtained from the SMPL model. With $(u_{k},v_{k},t)$ , we use the motion parameterization scheme defined in Section 3.1 to transform the query point to the space of the $k$ -th human part. We predefine the parameters of the multiresolution hash encoding function $\psi_{k}$ for the $k$ -th part. Given the transformed point, we first apply the multiresolution hash encoding to the transformed point and then feed the encoded point $\psi_{k}(\mathbf{x})$ to a small NeRF network to predict the density and color. The density network $\text{MLP}_{\sigma_{k}}$ is defined as:

[TABLE]

where $\sigma_{k}$ means the density and $\mathbf{z}$ is a feature vector. Then, we take the feature vector $\mathbf{z}$ and the viewing direction $\mathbf{d}$ as the input for the color regression. Similar to [52], a latent embedding $\boldsymbol{\ell}_{t}$ for each video frame $t$ is introduced to model the temporally-varying appearance. The color network is defined as:

[TABLE]

Finally, we have $K$ predictions $\{(\sigma_{k},\mathbf{c}_{k})\}_{k=1}^{K}$ . The density and color $(\sigma,\mathbf{c})$ of the query point $\mathbf{x}$ is calculated based on:

[TABLE]

In contrast to [56, 40] that represent the body with a single NeRF network, our part-based voxelized human representation can assign different densities of model parameters to different human parts with different complexity, thereby enabling us to efficiently distribute the representational power of the network. Experiments show that our proposed body representation significantly improves the rate of convergence.

3.2 Training

The proposed representation can be learned from sparse multi-view videos by minimizing the difference between rendered and observed images. The volume rendering technique [47, 32] is used to synthesize the pixel color. Give a pixel at frame $t$ , we emit a camera ray and sample points along the ray. Then, the sampled points are fed into the dynamic human representation to predict their colors and densities, which are finally accumulated into the pixel color. During each training iteration, we randomly sample an image patch from the input image and compute the Mean Squared Error (MSE) loss and perceptual loss [31] to train the model parameters, which are defined as:

[TABLE]

where $\tilde{I}_{P}$ is the rendered image patch, $I_{P}$ is the ground truth image patch, and $F_{\text{vgg}}$ extracts image features using the pretrained VGG network [31]. Ablation study demonstrates that perceptual loss is essential for rendering quality and fast training.

In addition to the image rendering loss, two regularization techniques are used to facilitate the learning of the neural representations. First, we apply the regularizer in [4] to concentrate densities on the human surface. Second, the residual deformation field is regularized to be small and smooth. More details of the loss terms are described in the supplementary material.

3.3 Implementation details

We adopt the Adam optimizer [33] with a learning rate of $5e^{-4}$ . We train our model on an RTX 3090 GPU, which takes around 5 minutes to produce photorealistic results. Our method is implemented purely with the PyTorch framework [53] to demonstrate the effectiveness of our representation It also enables us to fairly compare with baseline methods [58, 56, 34] implemented in PyTorch. The details of the network architectures and hyper-parameters are presented in the supplementary material.

4 Experiments

4.1 Datasets

ZJU-MoCap [58] dataset is a widely-used benchmark for human modeling from videos. It provides foreground human masks and SMPL parameters. Following [92], we select 6 human subjects (377, 386, 387, 392, 393, 394) from this dataset to conduct our experiments. One camera is used for training, and the remaining cameras are used for evaluation. For each human subject, we select 1 frame every 5 frames and collect 100 frames for training. Please refer to the supplementary material for more detailed experiment settings of all characters.

MonoCap dataset contains four multi-view videos collected by [57] from the DeepCap dataset [26] and the DynaCap dataset [25]. It provides camera parameters and human masks. [57] additionally estimate the SMPL parameters for each image. We adopt the setting of training and test camera views in [57]. For each subject, 100 frames are selected for training, and we sample 1 frame every 5 frames. Detailed configurations of all sequences are described in the supplementary material.

4.2 Comparison with the state-of-the-art methods

Baselines.

We compare our method with subject-specific optimization methods [58, 56, 92, 57], generalizable methods [99, 34]. All the baselines are implemented in pure PyTorch[53] for a fair comparison. Here we list only the average metric values of all selected characters on a dataset due to the size limit. We provide more detailed qualitative and quantitative comparisons in the supplementary material.

(1) Subject-specific optimization methods. Neural Body (NB) [58] anchors a set of latent codes to the SMPL mesh and regresses the radiance field from the posed latent codes. Animatable NeRF (AN) [56] deforms the canonical NeRF with the skeleton-driven framework and models non-rigid deformations by learning blend weight fields. [57] extend [56] with a signed distance field and pose-dependent deformation field to better model the residual deformation and geometric details of dynamic humans. [92] optimizes for a volumetric representation of the person in a canonical space along with the estimated human pose.

(2) Generalizable methods. PixelNeRF [99] trains a network to infer the radiance field from a single image. Neural Human Performer (NHP) [34] anchors image features to vertices of the SMPL mesh and aggregates temporal features using a transformer, which are decoded into a human model. For each evaluated subject (e.g. one subject of MonoCap), we first pretrain the network on the other dataset (e.g. ZJU-MoCap) and then finetune it on the evaluated subject until it converges.

Results on the ZJU-MoCap dataset.

Table 1 compares our method with NB [58], AN [56], PixelNeRF [99], NHP [34], HN [92] and AS [57] on novel view synthesis. Our proposed representation can be optimized within around 5 minutes to produce photorealistic rendering results, while [58, 56, 92, 57] require around 10 hours to finish training and [99, 34] require 10 hours of pretraining and 1 hour of fine-tuning. [92, 57] exhibit better results than [58, 56]. However, they all require a lengthy optimization process and fail to produce reasonable renderings in only 5 minutes because their models have not converged yet. Generalizable methods [99, 34] failed to render humans with reasonable shapes under the monocular setting. Our method achieves comparable results on all of the three evaluated metrics even when only trained in minutes, which shows the effectiveness of our novel human representation. We present qualitative results of our method and baselines in Figure 3.

Results on the MonoCap dataset.

Table 1 summarizes the quantitative comparison between our method and other baselines on the MonoCap dataset. Our model again achieves competitive visual quality while only requiring 1/100 of the training time due to our efficient part-based voxelized human representation and effective motion parameterization scheme. Figure 3 indicates that our method can produce better appearance details than [58, 56, 57]. Although [58, 56] have shown impressive rendering results given 4-view videos, they do not perform well on monocular inputs. [58] implicitly aggregates the temporal information using structured latent codes, which may not work well on monocular videos with complex human motions. [56] uses a learnable blend weight field to model human motion, which has a higher dimension and could be hard to converge well given single-view supervision. [92, 57] demonstrate similar visual quality and show that representing the human motion with LBS model and residual deformation works particularly well, but their models require 100x more time to optimize.

4.3 Ablation Studies

We perform ablation studies on the 377 sequence of the ZJU-MoCap dataset to analyze how the proposed components affect the performance and training speed of our method.

4.3.1 Ablation Studies on Proposed Components.

Table 2(a) lists the quantitative result of ablation studies on our proposed components. All models are trained for 5 minutes. ”Ours w/o Part” represents the canonical human body with a single MHE-augmented NeRF [48] network, which drastically degrades the rate of convergence. To keep the comparison fair, this variant of our method has a similar number of parameters (302M) to ours (286M). However, this change leads to a significant decrease in PSNR of $1.98\text{dB}$ because of its unwise design of considering all parts equally complex and wasting representational power. In ”Ours w/o UV”, the residual deformation network $\text{MLP}_{\text{res}}$ takes hash encoded $(\mathbf{x},t)$ as input, which observed a PSNR degradation from $32.09\text{dB}$ to $31.40\text{dB}$ with the same training time because of severe hash collision and limited resolution. ”Ours w/o Perc” does not adopt the perceptual loss during training, which in turn increases the LPIPS distance. This comparison illustrates the importance of perceptual loss for visual fidelity. Figure 4 and Figure 5 provide a more intuitive qualitative comparison among variants of the proposed method.

Analysis of the part-based voxelized human representation.

MHE [48] defines a multiresolution hash table of trainable features to embed input coordinates to a high-dimension space. We find that simply increasing the size of the hash table does not always result in better performance with the same training time, because a bigger hash table leads to higher memory consumption and increases the time of each training iteration. The proposed part-based voxelized human representation allows us to adapt the hash table size according to the complexity of the human part, allowing us to efficiently represent the human body. Table 2(a) demonstrates the effectiveness of the part-based voxelized human representation. To further validate this representation, we design two variants that use the hash tables of size $2^{15}$ and $2^{20}$ in all human parts respectively. Table 2(c) summarizes the ablation studies, indicating that varying the model parameters in human parts improves the performance.

Analysis of the motion parameterization scheme.

Table 2(a) shows that our model works better when the residual deformation network $\text{MLP}_{\text{res}}$ takes parameterized 3D surface-time $(u,v,t)$ coordinates as input, compared with taking 4D space-time $(\mathbf{x},t)$ as input. Note that $(u,v,t)$ makes MHE much more memory efficient than $(\mathbf{x},t)$ . To further validate the effectiveness of our motion parameterization, we additionally design three variants of the $\text{MLP}_{\text{res}}$ input. (1) PE: positional encoded $(\mathbf{x},t)$ . (2) XYZ-Code: hash encoded $\mathbf{x}$ and a per-frame learnable latent code [57]. (3) XYZ-Pose: hash encoded $\mathbf{x}$ and the pose parameter [40, 92]. When taking the positional encoded $(\mathbf{x},t)$ as input, we use a larger network for $\text{MLP}_{\text{res}}$ . The results in Table 2(b) indicate that hash encoded $(u,v,t)$ achieves the best performance.

Analysis of Robustness.

To evaluate the robustness of the proposed system, we measure the time needed to achieve an evaluation PSNR of 30 for five times on “377” sequence. This results in a training time with a mean value of $76.00s$ and a standard derivation of $13.56s$ , showing the stability of the proposed method.

5 Limitations

Although our method can quickly reconstruct high-quality human models from videos, there are still some challenges. First, it is desirable to infer human models from sparse multi-view images. Improving generalizable methods like [99, 34] with our proposed components may be a direction to solve this problem. Second, our method currently relies on accurate SMPL parameters, which could be difficult to obtain under in-the-wild settings. It is interesting to utilize the techniques in [78, 92] to optimize the human pose parameters along with the training of human avatars. Third, we can only reconstruct foreground dynamic humans, while dynamic scenes typically include foreground and background entities. It might be plausible to combine our method with ST-NeRF [29] to quickly reconstruct dynamic scenes containing foreground and background objects.

6 Conclusion

We introduced a novel dynamic human representation that can be quickly optimized from videos and used for generating free-viewpoint videos of the human performer. This representation consists of a part-based voxelized human model in the canonical space and a motion parameterization scheme that transforms points from the world space to the canonical space. The part-based voxelized human model decomposes the human body into multiple parts and represents each part with an MHE-augmented NeRF network, which efficiently distributes network representational power and significantly improves the training speed. When predicting the motion of a query point, the motion field reparameterizes the point coordinate as 2D surface-level UV coordinate, which effectively reduces the dimensionality of motion the network is required to model, resulting in a boost in convergence rate. Experiments demonstrate that our proposed representation can be optimized at 1/100 of the time of previous methods while still maintaining competitive rendering quality. We show that given a 100-frame monocular video of $512\times 512$ resolution, our method can produce photorealistic free-viewpoint videos in minutes on an RTX 3090 GPU.

Bibliography104

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Kara-Ali Aliev, Artem Sevastopolsky, Maria Kolos, Dmitry Ulyanov, and Victor Lempitsky. Neural point-based graphics. In ECCV , 2020.
2[2] Thiemo Alldieck, Mihai Zanfir, and Cristian Sminchisescu. Photorealistic monocular 3d reconstruction of humans wearing clothing. ar Xiv preprint ar Xiv:2204.08906 , 2022.
3[3] Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 5855–5864, 2021.
4[4] Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. ar Xiv preprint ar Xiv:2111.12077 , 2021.
5[5] Alexander Bergman, Petr Kellnhofer, and Gordon Wetzstein. Fast training of neural lumigraph representations using meta learning. Advances in Neural Information Processing Systems , 34, 2021.
6[6] Bharat Lal Bhatnagar, Cristian Sminchisescu, Christian Theobalt, and Gerard Pons-Moll. Loopreg: Self-supervised learning of implicit surface correspondences, pose and shape for 3d human mesh registration. In Neur IPS , 2020.
7[7] Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. ar Xiv preprint ar Xiv:2203.09517 , 2022.
8[8] Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 14124–14133, 2021.