Disentangled Human Body Embedding Based on Deep Hierarchical Neural   Network

Boyi Jiang; Juyong Zhang; Jianfei Cai; Jianmin Zheng

arXiv:1905.05622·cs.CV·April 20, 2020

Disentangled Human Body Embedding Based on Deep Hierarchical Neural Network

Boyi Jiang, Juyong Zhang, Jianfei Cai, Jianmin Zheng

PDF

1 Repo

TL;DR

This paper introduces a hierarchical neural network architecture for learning disentangled 3D human body shape and pose embeddings, enabling accurate reconstruction and flexible body generation.

Contribution

It proposes a novel hierarchical reconstruction pipeline and a large dataset for improved disentangled embedding learning of 3D human bodies.

Findings

01

Achieves superior reconstruction accuracy.

02

Enables flexible 3D human body generation.

03

Demonstrates effectiveness in various applications.

Abstract

Human bodies exhibit various shapes for different identities or poses, but the body shape has certain similarities in structure and thus can be embedded in a low-dimensional space. This paper presents an autoencoder-like network architecture to learn disentangled shape and pose embedding specifically for the 3D human body. This is inspired by recent progress of deformation-based latent representation learning. To improve the reconstruction accuracy, we propose a hierarchical reconstruction pipeline for the disentangling process and construct a large dataset of human body models with consistent connectivity for the learning of the neural network. Our learned embedding can not only achieve superior reconstruction accuracy but also provide great flexibility in 3D human body generation via interpolation, bilinear interpolation, and latent space sampling. The results from extensive…

Tables9

Table 1. TABLE I: The second column shows the number of our constructed neutral and pose meshes. We also present the number of meshes used from existing datasets.

DataSet	Neutral	SPRING [56]	SCAPE [5]	Hasler et al. [21]
number	$𝟑𝟏𝟖𝟑$	3048	70	517
DataSet	Pose	FAUST [9]	Dyna [44]	MANO [47]
number	$𝟐𝟒𝟏𝟏$	99	907	818

Table 2. TABLE II: MED( m m 𝑚 𝑚 mm ) for the test dataset consisting of 160 neutral meshes and 160 pose meshes.

Test Dataset	Ours	Baseline	meshVAE
Neutral (160)	$4.67$	4.99	5.26
Pose (160)	$2.75$	3.19	3.13

Table 3. TABLE III: Quantitative comparison of different methods on our shape scan dataset. The mean PMD( m m 𝑚 𝑚 mm ), the standard deviation, and the valid number of points for testing (without hand part) are given. We also give the errors and the standard deviations (wh) for all points in the last two columns just for reference, although our method does not consider the hand part.

Methods	mean	std	#points	mean(wh)	std(wh)
Ours	$4.9$	$6.8$	545263	6.5	10.9
Baseline	5.2	7.5	543848	7.0	11.9
meshVAE	5.4	7.2	544794	6.9	11.1
SMPL	6.4	8.5	546020	6.4	8.2
SMPL-X	6.1	7.2	543853	$6.1$	$7.1$
Adam	12.1	13.3	547843	11.5	12.9

Table 4. TABLE IV: Quantitative comparison of different methods on DFaust [ 10 ] scan dataset. The mean PMD ( m m 𝑚 𝑚 mm ), the standard deviation, and the valid number of points for testing (without hand part) are reported. We also give the errors (wh) for all points in the last two columns just for reference, although our method does not consider the hand part.

Methods	mean	std	#points	mean(wh)	std(wh)
Ours	$2.9$	$4.5$	30953504	$3.6$	7.9
Ours_s	3.1	4.7	30952186	3.9	8.1
Baseline	3.3	4.8	30956373	4.1	8.2
Baseline_s	3.6	4.9	30956386	4.4	8.2
SMPL	4.6	5.5	31015202	4.5	$5.8$
SMPL_s	4.8	5.8	31012640	4.8	6.1
SMPL-X	4.8	6.8	30972558	4.8	7.2
SMPL-X_s	4.9	6.9	30975033	4.9	7.1
meshVAE	3.2	4.6	30956136	4.0	8.0
Adam	14.2	15.9	30956046	14.0	15.9
Adam_s	14.2	15.9	30956413	14.0	15.9

Table 5. TABLE V: Quantitative comparison of sparse reconstruction on DFaust [ 10 ] scan dataset with different methods. The mean PMD( m m 𝑚 𝑚 mm ), standard deviation, and the valid number of points for testing (without hand part) are reported.

Methods	mean	std	#points
Ours	6.3	6.4	30962584
Ours_s	$6.2$	$6.3$	30963799
SMPL	6.9	7.5	31017249
SMPL_s	6.7	7.2	31018236

Table 6. TABLE VI: Quantitative evaluations of 3D pose estimation on H3.6M [ 25 ] . Superscripts 1 and 2 stand for ground truth and estimated 2D joints input, respectively. Ours_e is our pose expanded model. The error is the mean Euclidean distance(mm) after Procrustes Analysis [ 18 ] .

Methods	$O u r s^{1}$	$O u r s _ e^{1}$	$O u r s _ e^{2}$	SMPLify [8]
Mean	95.4	65.8	86.7	82.3
Median	89.2	55.9	76.0	69.3

Table 7. TABLE VII: Comparison of the FAUST challenge. AE and WE denote the average and worst errors (cm), respectively. Superscripts 1 and 2 denote the corresponding errors computed with the registered mesh expressed by our model and by the optimized vertex coordinates, respectively. In the table, n.a. indicates that quantitative results are not available. Among the supervised methods, we achieve the best results.

Methods	Inter AE	Inter WE	Intra AE	Intra We
FMNet [35]	4.83	9.56	2.44	26.16
FARM [39]	4.12	9.98	2.81	19.42
Oshri et al. [19]	n.a.	n.a.	2.51	24.36
LBS-AE [33]	4.08	10.38	2.16	6.07
$O u r s^{1}$	2.27	3.16	1.40	2.52
$O u r s^{2}$	2.22	3.46	1.37	3.06
$O u r s _ e^{1}$	2.52	3.59	1.17	2.39
$O u r s _ e^{2}$	$1.99$	$2.99$	$1.01$	$2.08$

Table 8. TABLE VIII: Structure information of MLPs in the decoder, which includes the number of the composed units of each MLP and the dimensions of the input and output feature of each stacked unit in the MLP.

MLP	$𝒞_{s}$ & $𝒟_{s}$	$𝒞_{p}$ & $𝒟_{p}$	$𝒯_{c}$	$𝒯_{d}$
units number	2	2	1	1
dimensions	$50, 400, 800$	$72, 400, 800$	$800, 144$	$800, 9 \| 𝒱 \|$

Table 9. TABLE IX: The ablation study of hyperparameters in training loss. We report the loss on the test dataset of models trained with different parameter settings. For the ℓ 1 subscript ℓ 1 \ell_{1} training loss, we fix the KL loss weights λ s subscript 𝜆 𝑠 \lambda_{s} and λ p subscript 𝜆 𝑝 \lambda_{p} to 1.0, and set the ratios of ( λ r 1 : λ r c 1 ) : subscript 𝜆 subscript 𝑟 1 subscript 𝜆 subscript 𝑟 subscript 𝑐 1 (\lambda_{r_{1}}:\lambda_{r_{c_{1}}}) and ( λ r 2 : λ r c 2 ) : subscript 𝜆 subscript 𝑟 2 subscript 𝜆 subscript 𝑟 subscript 𝑐 2 (\lambda_{r_{2}}:\lambda_{r_{c_{2}}}) to (1:0.6). We test different values for the two groups of parameters. In the last row, we report the optimal test error of the model trained with the ℓ 2 subscript ℓ 2 \ell_{2} loss. Its KL loss is larger than the results trained by the ℓ 1 subscript ℓ 1 \ell_{1} loss when they achieve equivalent test accuracy.

$λ_{r_{1}}, λ_{r_{2}}$	$E_{s K L}$	$E_{p K L}$	$E_{L 1_{1}}$	$E_{L 1_{2}}$	$E_{L 1_{c_{1}}}$	$E_{L 1_{c_{2}}}$
1e3,1e4	73.51	17.30	0.063	0.055	0.062	0.049
2.5e3,2.5e4	94.97	86.97	0.035	0.053	0.031	0.047
5e3,5e4	150.30	147.36	0.029	0.054	0.026	0.047
1e4,5e4	199.96	280.00	0.021	0.054	0.019	0.047
$ℓ_{2}$	126.85	168.38	0.031	0.055	0.028	0.048

Equations43

ar g min_{T_{i}} j \in N (i) \sum c_{ij} ∥ (p_{i} - p_{j}) - T_{i} (q_{i} - q_{j}) ∥_{2}^{2}

ar g min_{T_{i}} j \in N (i) \sum c_{ij} ∥ (p_{i} - p_{j}) - T_{i} (q_{i} - q_{j}) ∥_{2}^{2}

ar g min_{T_{V_{k}}} i \in V_{k} \sum ∥ (p_{i} - \overset{ˉ}{p}_{V_{k}}) - T_{V_{k}} (q_{i} - \overset{ˉ}{q}_{V_{k}}) ∥_{2}^{2},

ar g min_{T_{V_{k}}} i \in V_{k} \sum ∥ (p_{i} - \overset{ˉ}{p}_{V_{k}}) - T_{V_{k}} (q_{i} - \overset{ˉ}{q}_{V_{k}}) ∥_{2}^{2},

r_{V_{k}} = u_{V_{k}} (θ_{V_{k}} + 2 π m),

r_{V_{k}} = u_{V_{k}} (θ_{V_{k}} + 2 π m),

m = ar g min_{j} ∥ u_{V_{k}} (θ_{V_{k}} + 2 π j) - \frac{1}{∣ V _{k} ∣} i \in V_{k} \sum r_{i} ∥_{2}^{2} .

m = ar g min_{j} ∥ u_{V_{k}} (θ_{V_{k}} + 2 π j) - \frac{1}{∣ V _{k} ∣} i \in V_{k} \sum r_{i} ∥_{2}^{2} .

ar g min_{{p_{i}}} j \in N (i) \sum c_{ij} ∥ (p_{i} - p_{j}) - T_{i} (q_{i} - q_{j}) ∥_{2}^{2}

ar g min_{{p_{i}}} j \in N (i) \sum c_{ij} ∥ (p_{i} - p_{j}) - T_{i} (q_{i} - q_{j}) ∥_{2}^{2}

2 j \in N (i) \sum c_{ij} e_{ij} = j \in N (i) \sum c_{ij} (T_{i} + T_{j}) (q_{j} - q_{i}),

2 j \in N (i) \sum c_{ij} e_{ij} = j \in N (i) \sum c_{ij} (T_{i} + T_{j}) (q_{j} - q_{i}),

f = W (C (e_{s}, e_{p})) + D (e_{s}, e_{p})

f = W (C (e_{s}, e_{p})) + D (e_{s}, e_{p})

C (e_{s}, e_{p}) = T_{c} (C_{s} (e_{s}) + C_{p} (e_{p})),

C (e_{s}, e_{p}) = T_{c} (C_{s} (e_{s}) + C_{p} (e_{p})),

D (e_{s}, e_{p}) = T_{d} (D_{s} (e_{s}) + D_{p} (e_{p})) .

b

b

s . t . W_{ij} \geq 0

E_{L 1_{1}} = \frac{1}{9∣ V ∣} ∥ \hat{f}_{s} - f_{s} ∥_{1}, E_{L 1_{2}} = \frac{1}{9∣ V ∣} ∥ \hat{f} - f ∥_{1} .

E_{L 1_{1}} = \frac{1}{9∣ V ∣} ∥ \hat{f}_{s} - f_{s} ∥_{1}, E_{L 1_{2}} = \frac{1}{9∣ V ∣} ∥ \hat{f} - f ∥_{1} .

E_{L 1_{c_{1}}} = \frac{1}{9 \times 16} ∥ \hat{g}_{s} - g_{s} ∥_{1}, E_{L 1_{c_{2}}} = \frac{1}{9 \times 16} ∥ \hat{g} - g ∥_{1} .

E_{L 1_{c_{1}}} = \frac{1}{9 \times 16} ∥ \hat{g}_{s} - g_{s} ∥_{1}, E_{L 1_{c_{2}}} = \frac{1}{9 \times 16} ∥ \hat{g} - g ∥_{1} .

E_{sK L} = D_{K L} (q (e_{s} ∣ f) ∥ p (e_{s})),

E_{sK L} = D_{K L} (q (e_{s} ∣ f) ∥ p (e_{s})),

E_{p K L} = D_{K L} (q (e_{p} ∣ f) ∥ p (e_{p})),

L oss

L oss

+ λ_{p} E_{p K L} + λ_{r_{2}} E_{L 1_{2}} + λ_{r_{c_{2}}} E_{L 1_{c_{2}}} .

ar g min_{R, t, p_{i}, w} E_{prior} + λ_{1} E_{icp} + λ_{2} E_{lan} + λ_{3} ∥ w ∥_{1}

ar g min_{R, t, p_{i}, w} E_{prior} + λ_{1} E_{icp} + λ_{2} E_{lan} + λ_{3} ∥ w ∥_{1}

E_{prior} = i \sum j \in N (i) \sum c_{ij} ∥ (p_{i} - p_{j}) - T_{i} (w) (q_{i} - q_{j}) ∥_{2}^{2}

E_{prior} = i \sum j \in N (i) \sum c_{ij} ∥ (p_{i} - p_{j}) - T_{i} (w) (q_{i} - q_{j}) ∥_{2}^{2}

E_{icp} = i \sum ∥ n_{l (i)}^{T} (R p_{i} + t - v_{l (i)}) ∥

E_{icp} = i \sum ∥ n_{l (i)}^{T} (R p_{i} + t - v_{l (i)}) ∥

E_{lan} = i \in L \sum ∥ R p_{i} + t - v_{l (i)} ∥_{2}^{2}

E_{lan} = i \in L \sum ∥ R p_{i} + t - v_{l (i)} ∥_{2}^{2}

e_{s}, e_{p}, R, t min λ i \sum ∣ V ∣ ∥ R p_{i} (e_{s}, e_{p}) + t - q_{i} ∥_{2}^{2} + λ_{β} ∥ e_{s} ∥_{2}^{2} + λ_{θ} ∥ e_{p} ∥_{2}^{2}

e_{s}, e_{p}, R, t min λ i \sum ∣ V ∣ ∥ R p_{i} (e_{s}, e_{p}) + t - q_{i} ∥_{2}^{2} + λ_{β} ∥ e_{s} ∥_{2}^{2} + λ_{θ} ∥ e_{p} ∥_{2}^{2}

e_{s}, e_{p}, R, t min j o in t i \sum λ ρ (Π_{K} (R J_{i} (e_{s}, e_{p}) + t) - j_{i})

e_{s}, e_{p}, R, t min j o in t i \sum λ ρ (Π_{K} (R J_{i} (e_{s}, e_{p}) + t) - j_{i})

+ λ_{g} E_{g} (R (C (e_{s}, e_{p}))) + λ_{β} ∥ e_{s} ∥_{2}^{2} + λ_{θ} ∥ e_{p} ∥_{2}^{2},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Juyong/DHNN_BodyRepresentation
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Disentangled Human Body Embedding Based on Deep Hierarchical Neural Network

Boyi Jiang, Juyong Zhang†, Jianfei Cai, and Jianmin Zheng B. Jiang and J. Zhang are with School of Mathematical Sciences, University of Science and Technology of China. J. Cai is with Faculty of IT, Monash University. J. Zheng is with School of Computer Science and Engineering, Nanyang Technological University.†Corresponding author. Email: [email protected].

Abstract

Human bodies exhibit various shapes for different identities or poses, but the body shape has certain similarities in structure and thus can be embedded in a low-dimensional space. This paper presents an autoencoder-like network architecture to learn disentangled shape and pose embedding specifically for the 3D human body. This is inspired by recent progress of deformation-based latent representation learning. To improve the reconstruction accuracy, we propose a hierarchical reconstruction pipeline for the disentangling process and construct a large dataset of human body models with consistent connectivity for the learning of the neural network. Our learned embedding can not only achieve superior reconstruction accuracy but also provide great flexibility in 3D human body generation via interpolation, bilinear interpolation, and latent space sampling. The results from extensive experiments demonstrate the powerfulness of our learned 3D human body embedding in various applications.

Index Terms:

3D Body Shape, 3D Human Articulated Body Model, Variational Autoencoder, Deformation Representation, Hierarchical Structure.

1 Introduction

This paper considers the problem of learning a parametric 3D human body model, which can map a low-dimensional latent representation into a high-quality 3D human body mesh via a deep hierarchical neural network. Parametric human body models have a wide range of applications in computer graphics and computer vision. Examples of such applications are body tracking [55, 57], body reconstruction [2, 8, 36] and pose estimation [41, 30]. However, building an expressive and reliable parametric body model is challenging. This is because the human body has abundant variations due to many factors, such as gender, ethnicity, and stature. In particular, different poses may introduce significant deformations of the body, which are difficult to model by conventional linear techniques such as principal component analysis (PCA).

The state-of-the-art work SMPL (skinned multi-person linear model) [37] separates human body variations into shape-related variations and 3D pose variations. The shape-related variations are modeled by a low dimensional linear shape space with shape parameters. The 3D pose variations are handled by a skeleton skinning method with pose parameters derived from 3D joint angles. SMPL has a clear pose definition and can express different human poses of large scales. The parameter-to-mesh computation in SMPL is fast and robust. However, the reconstruction accuracy of the skeleton skinning method relies on the linear shape space of neutral body shapes. The skinning weights of SMPL are shared for different neutral shapes of different identities, which further restricts its reconstruction ability. To capture the pose-dependent deformations and reduce skinning artifacts around joints, SMPL introduces independent pose-related blend shapes as the complement for the intrinsic shape space defined by shape parameters. The pose parameters of SMPL explicitly define the movements of the human skeleton and are very suitable for character animation editing. Due to the excess expression ability of the pose parameters, specific human body pose prior is always needed for applications to avoid occurrences of unnatural body meshes. [8, 1] used some pose prior constraints like joint angle assignment range and self-intersection penalty energy to generate plausible body shapes. [30] adopted a network discriminator to judge whether the generated pose parameters obey the distribution of human motion during training. Different from these works, we introduce a novel disentangled body representation that can achieve better accuracy in body shape reconstruction and whose pose latent parameters encode the prior of human pose distribution to some extent.

With the advance of deep learning, the encoder-decoder based architecture has demonstrated its capability of extracting latent representations of face geometry [6, 45, 27]. Compared with face shape, human body shape is more complicated as it contains many joints and very complex movements. Therefore, directly extending the neural network-based method for face shape to human body shape cannot achieve good performance. Recently, Litany et al. [34] proposed a graph convolution-based variational autoencoder for 3D body shape, which directly uses the Euclidean coordinates as the vertex feature and encodes the whole shape without disentangling identity and posture attributes. However, Euclidean-domain based encoder-decoder architecture may produce non-natural deformation bodies from latent embedding. Different from Euclidean coordinates, a mesh deformation representation called ACAP (as consistent as possible) introduced in [15] can handle arbitrarily large rotations in a stable way and has great interpolation and extrapolation properties. The recent studies in [50, 51] show that learning on deformation features with autoencoder or VAE [32] can achieve more powerful latent representation. However, [50, 51] are designed for learning latent representation of general 3D shapes. When applied to 3D human body modeling, they only provide one latent embedding that entangles both shape and pose variations, which is not sufficient for practical uses.

Therefore, in this paper, we propose to utilize the neural network to learn two disentangled latent representations from ACAP features: one for shape variations and the other for pose variations, both of which are specifically designed and learned for the human body modeling. Moreover, a coarse-to-fine reconstruction pipeline is integrated into the disentangling process to improve the reconstruction accuracy. Our major contributions are twofold:

•

We propose a general framework based on variational autoencoder architecture for learning disentangled shape and pose latent embedding of 3D human body. Our framework introduces a hierarchical representation design. The basic transformation module has great design freedom.

•

Learning on ACAP features [15] requires mesh data to have the same connectivity while existing human body datasets do not satisfy this requirement. To address this issue, we re-mesh a large set of meshes from multiple existing human body datasets into standard connectivity via a novel non-rigid registration method and construct a new large scale human body dataset. The dataset consists of over 5000 human body mesh models with the same connectivity, where each identity has a standard or neutral pose.111Our full framework is available at https://github.com/Juyong/DHNN_BodyRepresentation.

We have conducted extensive experiments, including various applications. The experimental results demonstrate the powerfulness of our learned 3D human body embedding in terms of modeling accuracy, generation flexibility, etc.

2 Related Work

Human shape models. Human body shape is often constructed and represented via its shape variations [3, 48, 56, 43]. For example, Anguelov et al. [5] proposed to process shape completion by computing the deformation of triangles between the template and target meshes. Performing PCA on the transformation matrices further yields more robust results. Allen et al. [3] and Seo et al. [48] applied PCA to mesh vertex displacements to characterize the non-rigid deformation of human body shapes. Moreover, Allen et al. [3] constructed a correspondence between a set of semantic parameters of body shapes and PCA parameters by linear regression, which facilitates the manipulation of human body shapes. Zhou et al. [58] used a similar idea to semantically reshape human bodies from a single image. To extract more local and fine-grained semantic parameters from body shape representation, Yang et al. [56] introduced local mapping between semantic parameters and per triangle deformation matrix, which provides precise semantic control of human body shapes.

Human pose models. To represent human shape with poses, skeleton skinning is often used, which can directly compute positions of vertices on the body shape. Allen et al. [4] proposed to learn skinning weights for corrective enveloping and solve a highly nonlinear equation to find the relation among pose, skeleton, and skinning weights. Joo et al. [29] stitched hand, face, and body models together to obtain an expressive model that can capture the motion of humans. SMPL [37] explicitly defines body joints, uses the skeleton to represent body pose, and computes vertex positions with the standard skinning method. Hesse et al. [22] followed SMPL’s design and learned a statistical 3D infant body model from sequences of incomplete, low-quality RGB-D images of freely moving infants. Pavlakos et al. [42] expanded SMPL to capture the hand pose and facial expression with a unified representation.

Deformation-based models. Mesh deformations have been used to analyze 3D human body shape and pose [5, 12, 13, 21, 24, 20]. The most representative work is SCAPE [5], which analyzes body shape and pose deformation in terms of the deformation of triangles with respect to a reference mesh. The deformation representation can encode detailed shape variations, but an optimization process is required to obtain the mesh from the deformation representation. The conversion usually causes some time, which constrains it from real-time applications [52]. Chen et al. [11] extended the SCAPE [5] approach for real-time reconstruction of an animating human body. Jain et al. [26] used a common skinning approach for modeling pose-dependent surface variations instead of using per-triangle transformation, which makes the pose estimation much faster than SCAPE [5].

Deep learning for geometric representation. Bagautdinov et al.[6] introduced a ladder VAE architecture to effectively encode face shape in different scales, which can achieve high reconstruction accuracy. Anurag et al.[45] defined upsampling and downsampling operations on the face mesh and used graph structure convolution to encode latent representation, which can obtain high reconstruction accuracy even for extreme facial expressions. The method proposed in [27] disentangles identity and expression attributes with two VAE branches and then fuses them back to the input mesh. By exploiting the strong non-linear expression capability of neural network and a deformation representation, the method outperforms previous methods in the decomposition of facial identity and expression. Hamu et al. [7] introduced a 3D shape generative model for genus-zero shapes and adopted a novel 3D shape tensor representation to make it suitable for arbitrary connectivity. However, the lack of disentangled representation restricts its wide application. Higgins et al. [23] proposed a novel strategy to automatically learn disentangled latent representations from raw data in a completely unsupervised manner.

Deformation representation. Geometric representation based on Euclidean coordinates is not invariant under translation and rotation, and cannot handle large-scale deformations well [15]. Gao et al. [14] proposed to use the rotation difference on each directed edge to define the deformation. This representation is called RIMD (rotation-invariant mesh difference), which is translation and rotation invariant. RIMD is suitable for mesh interpolation and extrapolation, but reconstructing vertex coordinates from RIMD requires to solve a very complicated optimization problem. The RIMD feature encodes a plausible deformation space. With the RIMD feature, Tan et al. [51] designed a fully connected mesh variational autoencoder network to extract latent deformation embedding. However, it does not provide disentangled shape and pose latent embeddings for 3D human modeling.

Gao et al. [15] further proposed another representation called ACAP (as-consistent-as-possible) feature, which allows more efficient reconstruction and derivative computations. Using the ACAP feature, Tan et al. [50] proposed a convolutional autoencoder to extract localized deformation components from mesh data sets. Gao et al. [16] also used the ACAP feature to achieve an automatic unpaired shape deformation transfer between two sets of meshes. Wu et al. [53] used a simplified ACAP representation to model caricature face geometry.

3 Deformation Representation

This section presents our shape features that are used to represent the human body. We adopt a hierarchical architecture to represent and reconstruct the body shape. In particular, we propose a coarse-level shape feature based on anatomical body components and the ACAP feature to represent the human body shape with a general pose.

ACAP Feature. Assume that a mesh dataset consists of $N$ meshes with the same connectivity. We choose one mesh as the reference and the other meshes are considered to be deformed from the reference. We denote the $i$ -th vertex coordinates of the reference mesh and the target mesh by $\mathbf{q}_{i}\in{\mathbb{R}^{3}}$ and $\mathbf{p}_{i}\in{\mathbb{R}^{3}}$ , respectively. The deformation at vertex $\mathbf{p}_{i}$ is described locally by an affine transform matrix $\mathbf{T}_{i}\in\mathbb{R}^{3\times 3}$ that maps the one-ring neighbor of $\mathbf{q}_{i}$ in the reference mesh to its corresponding vertex on target mesh. The matrix is computed by minimizing

[TABLE]

where $c_{ij}$ is the cotangent weight and $\mathcal{N}(i)$ is the index set of one ring neighbor of the $i$ -th vertex. Using polar decomposition, $\mathbf{T}_{i}=\mathbf{R}_{i}\mathbf{S}_{i}$ , the deformation matrix $\mathbf{T}_{i}$ is decomposed into a rigid component represented by a rotation matrix $\mathbf{R}_{i}$ and a non-rigid component represented by a real symmetry matrix $\mathbf{S}_{i}$ . Following [15], the rotation matrix $\mathbf{R}_{i}$ can be further represented by a vector $\mathbf{r}_{i}\in{\mathbb{R}^{3}}$ , and the symmetric matrix $\mathbf{S}_{i}$ can be represented by a vector $\mathbf{s}_{i}\in{\mathbb{R}^{6}}$ . To process the ambiguity of axis-angle representation of rotation matrix, Gao et al. [15] proposed an integer programming approach to solve for optimal $\mathbf{r}_{i}$ globally and make all $\mathbf{r}_{i}$ as consistent as possible. Interested readers can refer to [15] for details. Once $\mathbf{r}_{i}$ and $\mathbf{s}_{i}$ are available, we concatenate all $[{\mathbf{r}_{i},\mathbf{s}_{i}}]$ together to form the ACAP feature vector $\mathbf{f}\in{\mathbb{R}^{9|\mathcal{V}|}}$ for the target mesh, where $\mathcal{V}$ represents the entire set of mesh vertices. In this way, we convert the target mesh into its ACAP feature representation. As shown in [15], by eliminating the ambiguity of axis-angle representation globally, ACAP feature demonstrates excellent linear interpolation property. Thus ACAP is a good linear space mapping of 3D shape collections with the same connectivity.

Coarse Level Deformation Feature. A human body is composed of some anatomical components, and the deformation of a component can be viewed as the main deformation for each vertex belonging to the component. According to the segmentation of [5], we partition a human body into 16 anatomical parts as shown in Fig. 1. We denote by $\mathcal{V}_{k}$ the set of mesh vertices belonging to the $k$ -th part. Similar to Eq. (1), we compute its deformation $\mathbf{T}_{\mathcal{V}_{k}}$ :

[TABLE]

where $\bar{\mathbf{p}}_{\mathcal{V}_{k}}$ is the mean position of the target mesh’s $k$ -th part. Similarly, we can represent $\mathbf{T}_{\mathcal{V}_{k}}$ using $\mathbf{r}_{\mathcal{V}_{k}}\in{\mathbb{R}^{3}}$ and $\mathbf{s}_{\mathcal{V}_{k}}\in{\mathbb{R}^{6}}$ . While axis-angle vector represents the same rotation for $2\pi$ cycle on radian values, which causes ambiguity for $\mathbf{r}_{\mathcal{V}_{k}}$ , the ACAP feature has eliminated the ambiguities for all $\mathbf{r}_{i}$ , $i\in{\mathcal{V}_{k}}$ . This means that all $\mathbf{r}_{i}$ have consistent radian values. Therefore, we choose the specific $\mathbf{r}_{\mathcal{V}_{k}}$ that is closest to the mean of all $\mathbf{r}_{i}$ of the $k$ -th part. Specifically, we modify $\mathbf{r}_{\mathcal{V}_{k}}$ into

[TABLE]

where $\theta_{\mathcal{V}_{k}}$ and $\mathbf{u}_{\mathcal{V}_{k}}$ are the length and the normalized vector of the initial $\mathbf{r}_{\mathcal{V}_{k}}$ , respectively, and $m$ is computed by solving the following optimization problem

[TABLE]

Once $\mathbf{r}_{\mathcal{V}_{k}}$ and $\mathbf{s}_{\mathcal{V}_{k}}$ are found for all parts, we concatenate all $[\mathbf{r}_{\mathcal{V}_{k}},\mathbf{s}_{\mathcal{V}_{k}}]$ together to form the coarse-level feature $\mathbf{g}\in{\mathbb{R}^{144}}$ . Each $[\mathbf{r}_{\mathcal{V}_{k}},\mathbf{s}_{\mathcal{V}_{k}}]$ encodes the optimal affine transformation of the $k$ -th part relative to the reference part. In the first column of Fig. 2, we show a group of coarse level deformation shapes of target meshes.

ACAP to Mesh. Converting a given ACAP feature vector $\mathbf{f}\in{\mathbb{R}^{9|\mathcal{V}|}}$ to the target mesh is easy. In particular, we directly reconstruct $\mathbf{T}_{i}$ from $[\mathbf{r}_{i},\mathbf{s}_{i}]$ [15]. The vertex coordinates $\mathbf{p}_{i}$ of the target mesh can be obtained by solving

[TABLE]

which is equivalent to the following system of linear equations:

[TABLE]

where $\mathbf{e}_{ij}=\mathbf{p}_{j}-\mathbf{p}_{i}$ . Note that Eq. (6) is translation-independent. Thus we need to specify the position of one vertex. Then the amended linear system can be rewritten as $\mathbf{A}\mathbf{p}=\mathbf{b}$ where $\mathbf{A}$ is a fixed and sparse coefficient matrix, for which a pre-decomposing operation can be executed to save the computation time.

Scaling Deformation Feature. Following the strategy of Tan et al. [51], we rescale each dimension of $\mathbf{f}$ and $\mathbf{g}$ to $[-0.95,0.95]$ independently. This strategy normalizes each dimension of the features and reduces learning difficulty of reconstructing deformation features $\mathbf{f}$ and $\mathbf{g}$ .

4 Overview

This section gives a detailed description of our proposed representation for 3D human body. We adopt the ACAP feature to represent human body shape considering its linear space mapping for large scale deformation. With the ACAP feature, we can use addition operations to represent the composition of non-rigid deformations.

Our proposed human body representation is motivated by the following two factors: semantics and precision. For semantics, an identity and pose disentangled body representation is required for many human body related applications. Therefore, for an ACAP feature $\mathbf{f}$ of a human body, we denote its latent parameters by a set of disentangled parameters $\{\mathbf{e}_{s},\mathbf{e}_{p}\}$ , where $\mathbf{e}_{s}$ and $\mathbf{e}_{p}$ control the shape variations determined by identity and posture, respectively. We define the neutral pose shape feature $\mathbf{f}_{s}$ of $\mathbf{f}$ as the ACAP feature decoded from $\{\mathbf{e}_{s},\mathbf{0}\}$ . The last column of Fig. 2 shows a posed human body and its corresponding neutral body. In this paper, the latent representation denotes a compressed representation of the original shape model, which is the only information the decoder is allowed to use to reconstruct the input shape model as faithfully as possible.

To improve the representation accuracy, we adopt a hierarchical strategy. Specific to the human body, a natural idea is to utilize the deformation of anatomical components as the bridge to the final shape. From Section 3, we know that the deformation of body components is encoded by the coarse level deformation feature $\mathbf{g}\in{\mathbb{R}^{9\times 16}}$ of $\mathbf{f}\in{\mathbb{R}^{9|\mathcal{V}|}}$ . We use $\mathcal{C}(\mathbf{e}_{s},\mathbf{e}_{p})$ to represent the mapping from the latent parameters to $\mathbf{g}$ , and denote the coarse feature of $\mathbf{f}_{s}$ by $\mathbf{g}_{s}$ . The deformation of components encoded by $\mathbf{g}$ has much lower dimensions than $\mathbf{f}$ , and each vertex feature of $\mathbf{f}$ encompasses similar base deformation determined by related components. Therefore, based on the articulate structure, we model a base part $\mathbf{b}=\mathcal{B}(\mathbf{e}_{s},\mathbf{e}_{p})$ of $\mathbf{f}$ with $\mathcal{W}(\mathcal{C}(\mathbf{e}_{s},\mathbf{e}_{p}))$ , where $\mathcal{W}$ is a linear blend skinning operation that recovers the deformation of each vertex on $\mathbf{b}$ by linearly blending the deformations of related components on $\mathbf{g}$ . Similarly, we use $\mathbf{b}_{s}$ to represent the neutral counterpart of $\mathbf{b}$ . A group of coarse features and base features is visualized in the first two columns of Fig. 2.

Considering that base feature $\mathbf{b}$ only encodes the optimal affine transformation relative to the reference mesh based on anatomical components, which does not include the fine-scale deformations caused by identities, soft tissues movement and different postures, we introduce difference features $\mathbf{d}=\mathcal{D}(\mathbf{e}_{s},\mathbf{e}_{p})$ and $\mathbf{d}_{s}=\mathcal{D}(\mathbf{e}_{s},\mathbf{0})$ to recover $\mathbf{f}$ and $\mathbf{f}_{s}$ better. Our final proposed human body representation can be expressed as:

[TABLE]

and we can further represent $\mathcal{C}$ and $\mathcal{D}$ as:

[TABLE]

$\mathcal{C}(\mathbf{e}_{s},\mathbf{e}_{p})$ aims to reconstruct coarse level deformation feature $\mathbf{g}$ by summing two independent parts $\mathcal{C}_{s}(\mathbf{e}_{s})$ and $\mathcal{C}_{p}(\mathbf{e}_{p})$ and then applying a mapping $\mathcal{T}_{c}$ , which is introduced to enhance the non-linearity of the representation and thus improve its expression ability. For difference feature $\mathcal{D}(\mathbf{e}_{s},\mathbf{e}_{p})$ , we follow the same design. As shown in the first row of Fig. 2, we can get all neutral pose counterpart features $\mathbf{g}_{s},\mathbf{b}_{s}$ and $\mathbf{d}_{s}$ of all corresponding features by setting $\mathbf{e}_{p}$ to $\mathbf{0}$ . For body representation in Eq. (7), each mapping can be implemented with MLP (multilayer perceptron) with arbitrary complexity. In this way, an end-to-end neural network can be integrated with this representation.

In the next few sections, we will give the implementation details of our proposed human body representation. In particular, we first present our neural network architecture and loss function design in Section 5. Then we give the construction of our body shape dataset in Section 6, and we show how to use the proposed learned embedding in Section 7. Finally, the detailed experimental evaluations are reported in Section 8.

5 Embedding Learning

5.1 Network Architecture

In this work, our goal is to learn a disentangled human body representation with a hierarchical reconstruction pipeline. We define the coarse shape, the base shape, the difference shape and the body shape as $\mathbf{g},\mathbf{b},\mathbf{d},\mathbf{f}$ , respectively. To learn disentangled and hierarchical representation, we need large scale training data with ground truth $\{\mathbf{f},\mathbf{g},\mathbf{f}_{s},\mathbf{g}_{s}\}$ to supervise our embedding learning.

We use a VAE like architecture in our end-to-end representation learning. Fig.3 shows the proposed architecture. For the encoder, we first feed $\mathbf{f}\in{\mathbb{R}^{9|\mathcal{V}|}}$ into a shared MLP (multilayer perceptron) $\mathcal{T}$ to generate a 400 dimension hidden feature. Then we use the standard VAE [32] encoder structure $\{\mathcal{T}_{s},\mathcal{T}_{p}\}$ to generate the shape and pose latent representations $\{\mathbf{e}_{s},\mathbf{e}_{p}\}$ separately. Specially, $\mathcal{T}$ is composed of two fully connected layers with $tanh$ as the activation function. $\{\mathcal{T}_{s},\mathcal{T}_{p}\}$ have similar structure and they use a fully connected layer without activation to output the mean values and another fully connected layer with $2\times sigmoid$ activation to output the standard deviation. We set the shape embedding $\mathbf{e}_{s}$ to $50$ dimensions and the pose embedding $\mathbf{e}_{p}$ to $72$ dimensions, i.e., $\mathbf{e}_{s}\in{\mathbb{R}^{50}}$ and $\mathbf{e}_{p}\in{\mathbb{R}^{72}}$ , to roughly match the dimensions of the shape and pose parameters in SMPL [37].

Our decoder follows the design of Eq. (7). There are two paths called base path and difference path. Each path takes $\{\mathbf{e}_{s},\mathbf{e}_{p}\}$ as input, and corresponds to $\mathcal{W}(\mathcal{C}(\mathbf{e}_{s},\mathbf{e}_{p}))$ and $\mathcal{D}(\mathbf{e}_{s},\mathbf{e}_{p})$ in Eq. (7), respectively. The decoder outputs $\hat{\mathbf{f}}$ by summing reconstructed base feature $\mathbf{b}$ and difference feature $\mathbf{d}$ of the two paths and produces $\hat{\mathbf{g}}$ with $\mathcal{C}(\mathbf{e}_{s},\mathbf{e}_{p})$ , and $\{\hat{\mathbf{f}},\hat{\mathbf{g}}\}$ aims to reconstruct $\{\mathbf{f},\mathbf{g}\}$ . Meanwhile, the decoder outputs $\{\hat{\mathbf{f}}_{s},\hat{\mathbf{g}}_{s}\}$ by another calculation with $\{\mathbf{e}_{s},\mathbf{0}\}$ as inputs, where $\{\hat{\mathbf{f}}_{s},\hat{\mathbf{g}}_{s}\}$ aim to reconstruct $\{\mathbf{f}_{s},\mathbf{g}_{s}\}$ . The detailed structure of the decoder is given in the Appendix.

The learnable skinning layer $\mathcal{W}$ is introduced to construct base feature $\mathbf{b}\in{\mathbb{R}^{9|\mathcal{V}|}}$ from coarse level feature $\mathbf{g}\in{\mathbb{R}^{144}}$ . The skinning method has showed its ability for human body modeling based on Euclidean coordinates [37]. Our learnable skinning layer exploits this method in the feature space. Particularly, we use a learnable sparse matrix $\mathbf{W}\in{\mathbb{R}^{|\mathcal{V}|\times 16}}$ to transform coarse level feature $\mathbf{g}\in{\mathbb{R}^{16\times 9}}$ to base feature $\mathbf{b}\in{\mathbb{R}^{|\mathcal{V}|\times 9}}$ , i.e.,

[TABLE]

where each row of $\mathbf{b}$ is a convex combination of rows of coarse-level feature $\mathbf{g}$ . Moreover, we constrain $\mathbf{W}_{i}$ to be non-zero only on the nearby parts of the $i$ -th vertex to avoid an overfitting and non-smoothing solution.

5.2 Loss Function

We use $\ell_{1}$ error for the feature reconstruction:

[TABLE]

Similarly, for coarse-level feature reconstruction, we define

[TABLE]

For the shape and pose embedding, since we use VAE as the encoder, KL divergence losses are needed to regularize the distribution of latent parameters:

[TABLE]

where $q(\mathbf{e}|\mathbf{f})$ is the posterior probability, $p(\mathbf{e})$ is the prior multivariate normal distribution, and $D_{KL}$ is the KL divergence formulation. See [32] for more details of the KL divergence formulation. The total loss is given in the following form:

[TABLE]

The configuration details of all related hyperparameters and the choice of loss function are given in the Appendix.

6 Constructing Training Data

To facilitate data-driven 3D human body analysis, we need to have a large number of 3D human mesh models. Thus, we collect data from several publicly available datasets. In particular, SCAPE [5] and FAUST [9] provide meshes of several subjects with different poses. Hasler et al. [21] provide 520 body meshes for about 100 subjects with relatively low resolution. MANO [47] collects the body and hand shapes of several people. Dyna [44] and DFaust [10] release the alignments of several subjects’ movement scan sequences. For the rest-pose body data set, CAESAR database [46] is the largest commercially available dataset that contains 3D scans of over 4500 American and European subjects in a standard pose. Yang et al. [56] convert a large part of the CAESAR dataset to the same connectivity with the SCAPE dataset. All these datasets have very different connectivity structures and different poses for each identity.

Our proposed embedding learning network has two main requirements for the training data. First, the connectivity of the whole dataset must be the same to facilitate the ACAP feature computation. Second, to disentangle human body variations into shape and pose latent embeddings, we need to define a neutral pose as the specific pose that represents the body variations only caused by identity, i.e., intrinsic factors among individuals. In other words, we need to construct a neutral pose mesh for each identity in our dataset.

For the first requirement, we need to convert our collected public datasets, like FAUST [9], SCAPE [5] and Hasler et al. [21] into the same connectivity. Considering vertex density and data amount, we modify the connectivity shared by SCAPE [5] and SPRING [56] to eliminate several non-manifold faces and treat this connectivity as the standard one. Specifically, we set the mesh graph structure with $|\mathcal{V}|=12500$ vertices and 24495 faces, which is much denser than SMPL [37] that has 6890 vertices. We choose one mesh of SCAPE [5] as the reference mesh, as shown in Fig. 1, for the ACAP feature computation.

For the second requirement, SPRING [56] is a dataset with a consistent and simple pose, which can be regarded as our neutral pose.

Connectivity Conversion. We formulate our connectivity conversion problem as a non-rigid registration problem from the reference connectivity to a mesh in a target connectivity dataset. We adopt the data-driven non-rigid deformation method of Gao et al. [15] to solve our problem. First, we define the prior human body deformation space by a base of ACAP features. We use 70 pose meshes of SCAPE [5] to cover the pose variations, and choose 70 shape meshes of different individuals from SPRING [56] to cover the shape variations. With the computed 140 ACAP features (see Section 3), we get a matrix $\mathbf{F}\in\mathbb{R}^{9|\mathcal{V}|\times 140}$ . Then, we extract a sparse base $\mathbf{C}\in\mathbb{R}^{9|\mathcal{V}|\times K}$ from $\mathbf{F}$ , by using the sparse dictionary learning method [40]. Unlike [15], we extract the sparse base based on human body parts instead of automatically selecting the basis deformation center. See Fig. 1 for the segmentation of human body parts. In this way, we can now use a vector $\mathbf{w}\in{\mathbb{R}^{K}}$ to obtain an ACAP feature: $\mathbf{f}(\mathbf{w})=\mathbf{C}\mathbf{w}.$

Second, we manually mark a set of corresponding vertices between the reference and the target connectivity, denoted as $\{i,l(i)\},i\in{\mathcal{L}}$ , where $\mathcal{L}$ is the index set of markers on our reference connectivity and $l(i)$ represents the index of the corresponding marker on the target connectivity.

Finally, we formulate our connectivity conversion problem as:

[TABLE]

where $\mathbf{R}$ and $\mathbf{t}$ represent the rotation and translation of the global rigid transformation, $E_{\textrm{icp}}$ is the point-to-plane ICP energy, $\mathbf{n}_{l(i)}$ is the normal of vertex $\mathbf{v}_{l(i)}$ on the target mesh, $\mathbf{p}_{i}$ is a vertex to be optimized on the reference mesh connectivity, $\mathbf{q}_{i}$ is a vertex of the reference mesh, and $\mathbf{E}_{\textrm{lan}}$ is for sparse landmark constraints. $E_{\textrm{prior}}$ is the formulation from Gao et al. [15], which uses the extracted sparse deformation base $\mathbf{C}$ to generate transformation $\mathbf{T}_{i}(\mathbf{w})$ so as to constrain the movements of $\mathbf{p}_{i}$ . By default, we set $\lambda_{1}$ , $\lambda_{2}$ and $\lambda_{3}$ to 5.0, 1.0 and 0.3, respectively.

By using this connectivity conversion method, we convert 916 meshes from Dyna [44], all 100 meshes of FAUST [9], 517 meshes of Hasler et al. [21] and 852 meshes of MANO [47] to the standard connectivity and align the converted meshes to the reference mesh.

Neutral Pose Construction. We compute the average shape of SPRING [56] as the target neutral pose. For each subject, we choose the posture mesh with the smallest rigid transformation to the target as the reference mesh, and apply ARAP (as rigid as possible) deformation [49] to deform the reference mesh to the target neutral pose. Specifically, we manually label several landmark pairs for both meshes on arms, forearms, legs, spine, etc. Then we use the deviation of orientations determined by each landmark pair on both meshes as soft constraints to deform the posture mesh to the neutral pose with ARAP deformation. In this way, we generate another 135 neutral meshes.

Finally, with the method described above, we obtain 2385 converted pose meshes plus another 70 from SCAPE [5], and 135 deformed neutral meshes plus 3048 from SPRING [56]. We compute their ACAP features $\mathbf{f}$ and corresponding coarse level features $\mathbf{g}$ using the method described in Section 3. After removing a few bad results, we eventually get 5594 pair features. We choose the corresponding neutral features $\{\mathbf{f}_{s},\mathbf{g}_{s}\}$ for every pair $\{\mathbf{f},\mathbf{g}\}$ , and construct the final dataset. Then, we randomly choose 160 neutral meshes, and 160 pose meshes as testing data. And the rest are used as training data. Table I shows the numbers of meshes used from each dataset and in our constructed dataset.

7 Use of the Embedding

Once the embedding learning is done, we only need to keep the trained decoder plus the ACAP-to-mesh converter of Eq. (6). We denoted this generator by $\mathcal{M}(\mathbf{e}_{s},\mathbf{e}_{p})$ , which takes shape and pose parameters $\{\mathbf{e}_{s},\mathbf{e}_{p}\}$ as input and outputs a mesh in the predefined connectivity. For various online applications such as reconstruction, we just need to optimize the low-dimensional embedding $\mathcal{M}(\mathbf{e}_{s},\mathbf{e}_{p})$ to fit the input data, which could be the image, video, point cloud, mesh, etc.

Let us use the mesh input as an example. Given a mesh with our pre-determined connectivity, we want to find the optimal $\{\mathbf{e}_{s}^{*},\mathbf{e}_{p}^{*}\}$ whose $\mathcal{M}(\mathbf{e}_{s}^{*},\mathbf{e}_{p}^{*})$ best reconstructs the given mesh. Here, we do not use our trained encoder to obtain $\{\mathbf{e}_{s},\mathbf{e}_{p}\}$ since the encoder requires to convert the given mesh into ACAP features, which is complex and time-consuming. Instead, we optimize $\{\mathbf{e}_{s},\mathbf{e}_{p}\}$ directly by only using the decoder:

[TABLE]

where rotation $\mathbf{R}$ and translation $\mathbf{t}$ are the global rigid transformation parameters, ${\mathbf{p}}_{i}$ is the $i$ -th vertex position of the decoded mesh of $\mathcal{M}(\mathbf{e}_{s},\mathbf{e}_{p})$ , and $\mathbf{q}_{i}$ is the $i$ -th vertex of the given mesh. For this optimization with per vertex constraints, we assign $\lambda$ to $1.0\times 10^{6}$ , $\lambda_{\beta}$ and $\lambda_{\theta}$ to $1.0$ . This model generally takes about 200 iterations to achieve millimeter reconstruction accuracy with Adam optimization.

8 Experiments

In this section, we quantitatively evaluate our model’s capability for reconstruction and present some qualitative results and potential applications. We set several baseline methods for comparison in different tasks. To show the benefit of our hierarchical reconstruction pipeline, we train a baseline architecture called “Baseline” that removed the base path in the decoder. To compare the effect of disentangling shape and pose variations, we train the non-disentangled meshVAE [51] architecture on our dataset. To evaluate the representation ability, we compare our method with the widely used SMPL model [37] and its variant SMPL-X model [42]. We also perform a comparison with the Adam body model [54], even though it is mainly designed to estimate body movement rather than body geometry. We integrate the official gender-neutral model code into the PyTorch framework and implement the optimization in the same framework with Adam [31].

Computation Time. Our implementation is based on PyTorch. Our mesh decoder $\mathcal{M}(\mathbf{e}_{s},\mathbf{e}_{p})$ takes about 10ms to map an embedding to a mesh on TITAN Xp GPU.

8.1 Quantitative Evaluation

In this section, we evaluate the performance of reconstruction, 3D pose estimation, and alignment on FAUST. For reconstruction, we perform quantitative evaluations on two types of data. The first type of data is from our test dataset, where all the meshes have the same connectivity. We use the mean Euclidean distance of vertices (MED) as the measurement. The second type of data is the general scan data of human bodies. We compute the distance between each point of scan point cloud and the corresponding reconstructed mesh as the measure. The distance is computed with the AABB tree, and we denote this error measurement by PMD (point-mesh distance). Note that all test point clouds are obtained by scanning the human body with an open hand, while the fists of our template body mesh are closed. Thus it is unfair to include the scan points of hand parts when comparing with SMPL and SMPL-X as they have an open hand model. Therefore when computing the PMD values, we ignore the hand part and mainly focus on the body part. For reference, we also report the errors of all points in related tables.

Reconstruction from Our Test Dataset. We compare the reconstruction capability of our method, Baseline, and meshVAE on our test dataset. We obtain the embedding by solving Eq. (15) for each method. Tab. II reports the MED errors, and Fig. 4 visualizes the reconstruction results and their respective error maps. It can be seen that our method outperforms Baseline and meshVAE. In particular, the MED of our method is lower than that of other methods, which demonstrates the effectiveness of our disentangled and hierarchical architecture design.

Reconstruction from Shape Scan Data. We now test the performance of reconstruction from scan data of human bodies with different identities. We use highly accurate scan data of six males and females with varying body shapes at the same neutral pose. These subjects are irrelevant to our train dataset, and all wear tight clothes. The scan system includes 4 Xtion sensors. When we scan a subject, the subject stands in the center of the scene, and four sensors rotate around the subject. We use the collected multiview RGB and depth data to recover the high accuracy geometry of the subject.

We label eight corresponding landmarks on the scans and use this sparse correspondence to generate coarse alignment with scan data. Then we use point-to-plane iterative closest point (ICP) optimization. For our method, Baseline, and meshVAE, we use the latent parameter regularization of Eq. (15). As for the optimization of SMPL, SMPL-X, and Adam models, we adopt the pose prior from [8] and the shape regularization to constrain their parameters. All the optimizations are implemented based on the Adam method with PyTorch.

We compute PMD for each point in the scan data, and draw the Cumulative Errors Distribution (CED) curve in Fig. 5. Tab. III gives numerical comparison and Fig. 6 shows two examples on the shape scan dataset. Again, our method has the best reconstruction accuracy.

Reconstruction from General Scan Data. We run the reconstruction for different poses using our method and the baselines on DFaust [10] dataset. DFaust provides ten subjects with several sequences of motion scan, represented as registered meshes. DFaust contains a few subjects that are also in our train dataset Dyna, Faust, or MANO. We choose three subjects from DFaust, labeled with 50007, 50009, and 50020, as our test set. We remove those subjects, which appear in DFaust, from our train set, and use 1973 poses and 3021 neutral shapes to re-train our model, Baseline model, and meshVAE. We sample data from DFaust with 40 frames interval and finally obtain 108, 65, and 69 test data sets for the three subjects, respectively.

We use the similar point-to-plane ICP registration method with 79 sparse landmarks to carry out a coarse alignment for general pose scan data. For methods that disentangle shape and pose, we perform another optimization by sharing shape parameters among all scan data of one subject, and we denote this approach by a suffix $s$ in the method’s name.

We compute PMD for each point in the scan data and draw the Cumulative Errors Distribution (CED) curve in Fig. 7. Tab. IV gives numerical comparison. Fig. 8 presents several sets of scan data, the reconstructed meshes of our method and SMPL, and the error maps on scan point clouds. It can be seen that our method has the best reconstruction accuracy, and Ours_s achieves the second-best reconstruction accuracy. The results indicate that our method can effectively disentangle shape and pose variations of a human body.

Reconstruction with Sparse Constraints. In this experiment, we test our reconstruction with the constraints of sparse marker points. Motion capture systems usually use sparse markers to capture human movements, and thus the ability to reconstruct 3D human body from sparse markers is important. In particular, we perform the test on the selected data of DFaust. We manually mark 39 landmarks in the registered mesh of DFaust, our template, and SMPL template. We use these sparse corresponding landmarks to reconstruct the mesh and compute PMD errors for scan data. Tab. V shows the numerical results on the test dataset.

Even without careful optimization for locations and offsets of sparse markers on the human body as Mosh [36] did, we still get a similar accuracy as SMPL. Moreover, we select two motion sparse marker sequences from CMU MOCAP222 mocap.cs.cmu.edu to test our method. Fig. 9 shows the reconstruction results. These experimental results indicate that our latent embedding achieves a reasonable dimensionality reduction for the human body shape manifold and can reproduce plausible human body shape with few markers constraints.

3D Body Pose Estimation from 2D Joints. Although our representation does not define explicit skeleton-like SMPL [37], we can also get a rough estimation of a joint by taking the average of manually selected related points on the body mesh. Taking the joint of the elbow as an example, we select vertices around the elbow as the related points. Using this simple strategy, we can generate estimated positions of joints for wrist, knee, and others.

Given 2D human joint positions, we can use our representation to reconstruct the 3D human body model by solving

[TABLE]

where rotation $\mathbf{R}$ and translation $\mathbf{t}$ are the global rigid transformation parameters, ${J}_{i}(\mathbf{e}_{s},\mathbf{e}_{p})$ is $i$ -th joint position of the decoded mesh from ${\mathbf{e}_{s},\mathbf{e}_{p}}$ , $\mathbf{j}_{i}$ is the $i$ -th 2D joint position, $\Pi_{K}$ is the given camera projection matrix with intrinsic parameters $K$ , $\rho$ is the robust differentiable Geman-McClure penalty function [17] and $\mathcal{R}$ is the operation computing relative rotation of two articulated anatomical components from reconstructed coarse feature $\mathcal{C}(\mathbf{e}_{s},\mathbf{e}_{p})$ . We compute the relative rotations of elbows and knees and use a similar penalty function $E_{g}$ of [8] to prevent unnatural bending. We use weights $\lambda$ , $\lambda_{g}$ , $\lambda_{\beta}$ and $\lambda_{\theta}$ to control the importance of each term in the objective function. In our experiments, we set the values of these weights to 55, 400, 5 and 10 as the default configuration.

We use an initialization strategy used in SMPLify [8] and its experiment configuration on H3.6M [25]. The only difference is that we compute the results with five frames interval. In Tab. VI, we give the quantitative results under different configurations.

First, we use ground truth 2D joints as our input and get a mean error of 95.4mm. We think that more abundant pose data can improve the result because our training pose dataset lacks large-scale actions like sitting down. Therefore, we sample the Moshed CMU dataset [38] and use 28600 meshes with abundant poses as our training set to train a new model. As these meshes have the same connectivity with the SMPL template, we use the joints regressor of SMPL to estimate the 3D joints. We use this pose expanded model (Ours_e) to perform an evaluation on H3.6M with ground truth 2D joints and get a mean error of 65.8mm. In Fig. 10, we visualize several results of different sequences with the largest error. We can observe that our estimated body pose is reasonable with the 2D joints, even if it has a notable error due to the ambiguity of joint depth.

Then, to compare with SMPLify [8], we use its supplied estimated 2D joints as input and get a mean error of 86.7mm. The results of SMPLify is better than ours. However, SMPLify utilizes some prior knowledge like a gender-specific model, specific joints regressor, collision penalty, and a pose prior, while our method does not utilize any prior knowledge except the train data.

We also estimate 3D pose on the LSP [28] dataset. Some qualitative results are depicted in Fig. 12. The results show that our representation can roughly recover the human body from 2D joint locations in images in the wild.

Performance on FAUST. In this experiment, we evaluate the alignment of our method on the FAUST benchmark [9], which consists of 200 real test scans of human bodies. The ground-truth correspondences of this challenge are not available, and the accuracy evaluation is obtained by submitting correspondence results online.

Given each challenge pair, we use our model to register each scan individually. Then we use the registered meshes as the common domain to establish a point-to-point correspondence between the pair scans.

We adopt the optimization strategy of the connectivity conversion of Eq. (14). Instead of constraining the reconstructed mesh within the shape space of our model, we use our model as a geometry prior and introduce free body mesh points as additional optimization parameters. We optimize our model parameters $\{\mathbf{e}_{s},\mathbf{e}_{p}\}$ and the free vertex coordinates simultaneously, and finally get two registered body meshes for each scan. One is expressed by our model directly and the other is the result mesh with optimized vertex coordinates. Fig. 11 visualizes two examples of registered results of test scans.

The optimization pipeline described above needs some sparse landmarks as initialization. To test the robustness of our method, we first use only the five landmarks estimated by [39] to do initialization. It works well for poses without large bending of arms and legs, but it cannot generate good registration results for large-scale poses like a deep squat. Therefore, we add another five landmarks in the areas of elbows, knees, and butt. In this way, with totally ten landmarks, we can get correct registration results for all the 200 test scans.

We report the quantitative comparison with the official ranking in Tab. VII. With manually placed landmarks as supervision, our method can achieve the best performance. As our body mesh does not model the hand part, which might result in large errors, we also report the results of Ours_e model, which achieves better results.

8.2 Qualitative Evaluation

Pose Transfer. To demonstrate the robustness and disentanglement of our proposed model, we use it for pose transfer by retrieving $\mathbf{e}_{s}$ of a body mesh and combining it with $\mathbf{e}_{p}$ from another body mesh to generate a new one. Fig. 13 gives four examples of pose transfer. It can be observed that the generated meshes look natural and have similar pose and identity as the reference meshes.

Global Interpolation. To test the capability of our representation for interpolation between two random people with different poses, we qualitatively compare our method with Baseline, meshVAE, and SMPL. Given the source and target meshes, we first use the reconstruction methods described in Section 8.1 to extract the respective parameter values. Then we linearly interpolate between the source and target parameters to generate a list of new parameter values, and finally we use the decoder to construct the meshes. Fig. 15 shows the front view and the back view of the resulting meshes. The four methods produce plausible results for interpolation (i.e., the interpolation parameter lies within [0,1]), but for extrapolation, SMPL generates weird body movements compared to our method. The pose parameters of SMPL record the relative rotation between two joints, which does not consider human body movement prior. This may explain the weird extrapolation results produced by SMPL.

Bilinear Interpolation. Our representation separates shape and pose parameters, which allows us to perform interpolation on each category of the parameters. For example, given two meshes with different shapes and poses, we first extract their shape and pose parameters, and then we linearly interpolate the shape or pose parameters. Fig. 14 shows the results of such interpolation. We can see that each column has a consistent pose, and each row corresponds to a specific person. Even for extrapolation, the results are reasonably good.

New Model Generation. Since we encode our shape parameters and pose parameters with VAE architecture separately, we can generate new body models by randomly sampling the two sets of embedding parameters.

In Fig. 16, we generate neutral meshes by randomly sampling on the embedded shape space. The generated shapes have abundant variations. In Fig. 17, we randomly generate pose meshes by sampling on the embedded shape and pose spaces. The generated meshes have plausible and different postures.

Registration to Depth Images. We also fit our representation model to a sequence of depth images. Eq. (16) is adapted for this purpose. To smooth the results in the temporal domain, we apply the smooth energy for pose parameters and share one shape parameter for the entire sequence. We use Kinect v2 to collect depth data. For each frame, we convert the depth image to a mesh for the convenience of point-to-plane ICP registration. Besides the depth data, we also use the 3D joint locations predicted by the SDK of Kinect v2. The prediction is not very accurate. It just provides a coarse initialization. Fig. 18 shows an example of such registration to a sequence of depth images, where the color images are not used in our algorithm and just for visualization.

9 Limitations

Our work has several limitations. First, while our representation defines a coarse-level shape, it lacks an explicit and simple method for computing the position of body skeleton from latent embedding. To estimate a joint of the skeleton, currently we just average the positions of those mesh vertices related to the joint. This estimation, however, is not very accurate and may introduce errors into the target human pose.

Second, for the neutral pose, we directly use the common pose of SPRING [56]. Nevertheless, the postures of SPRING are not totally consistent. There exist small misalignments in this dataset. For example, arms may have small swings, and heads may have some deviations in their orientation. These misalignments affect learning accuracy.

Third, as the ACAP feature used in the algorithm is defined on triangular meshes, our method cannot be directly applied to other data types such as voxels or point clouds.

Fourth, our method requires mesh data to have consistent connectivity. To use or handle scan data, usually we need to perform time-consuming connectivity conversion. A possible solution to avoid the connectivity conversion is to adopt a self-supervised training loss and a discriminator on decoded ACAP feature. We will explore this problem as future work.

10 Conclusions

We have presented a general framework for learning and reconstructing 3D human body models. A VAE like architecture is used to learn disentangled human body shape and pose embedding and train our model end-to-end. A coarse-to-fine pipeline is proposed to reconstruct high accurate body models. To make full use of the great fitting ability of neural network, we have constructed a large dataset consisting of models with consistent connectivity. These models are represented by neutral shapes corresponding to their identities and deformation information for individual shape variations. Experimental results have demonstrated the advantages of our learned embedding in terms of the accuracy of reconstruction and the flexibility for model recreation. The trained model and the constructed dataset will be made publicly available. We believe that the learned embedding and dataset will be useful for various human body related applications.

Acknowledgments

We thank VRC Inc. (Japan) for sharing the scanned human shape models with us in Fig. 6 and Tab. III. This research is partially supported by National Natural Science Foundation of China (No. 61672481), Youth Innovation Promotion Association CAS (No. 2018495), NTU Data Science and Artificial Intelligence Research Center (DSAIR) (No. M4082285), MoE Tier-2 Grant (2016-T2-2-065, 2017-T2-1-076) of Singapore, and the National Research Foundation, Singapore under its International Research Centres in Singapore Funding Initiative. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not reflect the views of the National Research Foundation, Singapore.

Appendix

Network Details.

In our network, all basic transformation modules except for the learnable skinning layer are designed as an MLP, which is a stack of a unit structure. The unit structure is composed of a fully connected layer, followed by a $tanh$ activation function. Tab. VIII gives the detailed information of each MLP in the decoder.

Training Details.

The hyperparameters $\lambda_{s}$ , $\lambda_{p}$ , $\lambda_{r_{1}}$ , $\lambda_{r_{2}}$ , $\lambda_{r_{c_{1}}}$ and $\lambda_{r_{c_{2}}}$ in Eq. (13) control the trade-off between the KL loss and the reconstruction loss. To find the optimal configuration, we fix $\lambda_{s}$ and $\lambda_{p}$ to 1 and adjust the reconstruction related hyperparameters. In Tab. IX, we show the ablation study of these parameters. To balance the KL loss and the reconstruction loss, we use the second configuration to train for about 1600 epochs, and then fine-tune the trained model with the first configuration for another 600 epochs. We set the batch size to 24. Each batch is composed of two sets of data of equal amounts from the Neutral and Pose datasets. The learning rate is set to $1.0\times 10^{-4}$ . The entire training can be completed in less than 15 hours on a single NVIDIA TITAN Xp GPU.

Unlike [51, 50], we use $\ell_{1}$ instead of $\ell_{2}$ as the reconstruction loss because we find that the $\ell_{2}$ loss often results in a higher KL loss to achieve equivalent feature reconstruction accuracy. The last row of Tab. IX shows the optimal test error of the model trained with the $\ell_{2}$ loss.

Bibliography58

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] T. Alldieck, M. Kassubeck, B. Wandt, B. Rosenhahn, and M. Magnor. Optical flow-based 3d human motion estimation from monocular video. In German Conference on Pattern Recognition , pages 347–360. Springer, 2017.
2[2] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll. Video based reconstruction of 3d people models. In IEEE Conference on Computer Vision and Pattern Recognition , 2018.
3[3] B. Allen, B. Curless, and Z. Popović. The space of human body shapes: reconstruction and parameterization from range scans. In ACM transactions on graphics (TOG) , volume 22, pages 587–594. ACM, 2003.
4[4] B. Allen, B. Curless, Z. Popović, and A. Hertzmann. Learning a correlated model of identity and pose-dependent body shape variation for real-time synthesis. In Proceedings of the 2006 ACM SIGGRAPH/Eurographics symposium on Computer animation , pages 147–156. Eurographics Association, 2006.
5[5] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis. Scape: shape completion and animation of people. In ACM transactions on graphics (TOG) , volume 24, pages 408–416. ACM, 2005.
6[6] T. Bagautdinov, C. Wu, J. Saragih, P. Fua, and Y. Sheikh. Modeling facial geometry using compositional vaes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2018.
7[7] H. Ben-Hamu, H. Maron, I. Kezurer, G. Avineri, and Y. Lipman. Multi-chart generative surface modeling. ACM Transactions on Graphics (TOG) , 37(6):215, 2019.
8[8] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In European Conference on Computer Vision , pages 561–578. Springer, 2016.