Learning Body Shape and Pose from Dense Correspondences

Yusuke Yoshiyasu; Lucas Gamez

arXiv:1907.11955·cs.CV·July 30, 2019

Learning Body Shape and Pose from Dense Correspondences

Yusuke Yoshiyasu, Lucas Gamez

PDF

TL;DR

This paper introduces a novel method for learning 3D human body shape and pose from 2D images using dense correspondences, eliminating the need for 3D annotations or motion capture data.

Contribution

It proposes a 'deform-and-learn' training strategy that leverages dense correspondences and deformable surface registration to learn 3D human models from 2D images without 3D labels.

Findings

01

Successfully learns 3D human shape and pose from 2D images.

02

Does not require 3D pose annotations or motion capture data.

03

Achieves comparable results to methods using 3D supervision.

Abstract

In this paper, we address the problem of learning 3D human pose and body shape from 2D image dataset, without having to use 3D dataset (body shape and pose). The idea is to use dense correspondences between image points and a body surface, which can be annotated on in-the wild 2D images, and extract and aggregate 3D information from them. To do so, we propose a training strategy called ``deform-and-learn" where we alternate deformable surface registration and training of deep convolutional neural networks (ConvNets). Unlike previous approaches, our method does not require 3D pose annotations from a motion capture (MoCap) system or human intervention to validate 3D pose annotations.

Tables4

Table 1. Table 1: Comparisons with state of the art. MPJPE [mm] is used for error metric.

Kudo et al. [KOMO18]	Rhodin et al. [RSF18]	HMR [KBJM18]	HMR (paired)	Ours	Our cGANs	Ours (refine)
173.2	131.7	106.84	87.97	106.25	139.9	108.46

Table 2. Table 2: Comparisons to HMR (paired) in terms of per-pixel error and per-vertex error

	HMR (paired)	Ours	Ours (refine)
COCO per-pixel err. [pixel]	13.9	18.6	12.02
H3.6M per-pixel err. [pixel]	7.3	9.9	9.2
H3.6M per-vertex recon. err. [mm]	75.0	102.7	97.2

Table 3. Table 3: Comparisons of MPJPE [mm] between training dataset. Note that the first iteration results are compared.

MS COCO	Human3.6M	Both
181.7	147.4	137.5

Table 4. Table 4: Comparisons between training strategies. MPJPE [mm] is used for error metric.

Single step (Eq. (1))	Single step (w/o $ℒ_{a d v}$ & $ℒ_{j o i n t}$ )	def-learn (1 iter 200 epoch)	def-learn (5 iter)
n/a	148.1	134.8	106.25

Equations30

L = a L_{d e n se} + b L_{g eo} + c L_{a d v}

L = a L_{d e n se} + b L_{g eo} + c L_{a d v}

x = s Π (R X (S, a)) + t

x = s Π (R X (S, a)) + t

L_{a d v}^{G}

L_{a d v}^{G}

L_{a d v}^{D}

L_{r a t i o} = e \in B \sum ∥ \frac{l _{e}}{l _{t r u nk}} - \frac{l ˉ _{e}}{l ˉ _{t r u nk}} ∥^{2}

L_{r a t i o} = e \in B \sum ∥ \frac{l _{e}}{l _{t r u nk}} - \frac{l ˉ _{e}}{l ˉ _{t r u nk}} ∥^{2}

L_{s y m} = i, j \in B_{s} \sum ∥ l_{i} - l_{j} ∥^{2}

L_{s y m} = i, j \in B_{s} \sum ∥ l_{i} - l_{j} ∥^{2}

L^{G} = ϵ L_{a d v}^{G} + (L_{r a t i o} + L_{s y m})

L^{G} = ϵ L_{a d v}^{G} + (L_{r a t i o} + L_{s y m})

L_{r e g i s t}

L_{r e g i s t}

+ ω_{s c a l e} L_{s c a l e} + ω_{j o in t} L_{j o in t} + ω_{d e t} L_{d e t}

L_{d e n se} = i \in C \sum ∥ p_{i} - x_{idx (i)} ∥^{2}

L_{d e n se} = i \in C \sum ∥ p_{i} - x_{idx (i)} ∥^{2}

L_{K P} = i \in J \sum ∥ x_{i} - \overset{x}{ˉ}_{i} ∥^{2} + i \in J \sum ∥ y_{i} - \overset{y}{ˉ}_{i} ∥^{2} + i \in J \sum ∥ z_{i} - z_{i}^{G A N} ∥^{2}

L_{K P} = i \in J \sum ∥ x_{i} - \overset{x}{ˉ}_{i} ∥^{2} + i \in J \sum ∥ y_{i} - \overset{y}{ˉ}_{i} ∥^{2} + i \in J \sum ∥ z_{i} - z_{i}^{G A N} ∥^{2}

L_{s c a l e} = e \in B \sum ∥ S_{e} - S_{a d j (e)} ∥^{2}

L_{s c a l e} = e \in B \sum ∥ S_{e} - S_{a d j (e)} ∥^{2}

L_{j o in t} = i \in J \sum ∥ a_{i} ∥^{2} + i \in J^{'} \sum ∥ e x p (a_{i}) ∥^{2}

L_{j o in t} = i \in J \sum ∥ a_{i} ∥^{2} + i \in J^{'} \sum ∥ e x p (a_{i}) ∥^{2}

L_{d e t} = exp (- d e t (R))

L_{d e t} = exp (- d e t (R))

L_{c o n v} = α L_{r e g r ess} + β L_{d e n se} + γ L_{K P}

L_{c o n v} = α L_{r e g r ess} + β L_{d e n se} + γ L_{K P}

L_{r e g r ess} = i \sum s m oo t h_{L 1} (θ_{i} - θ_{i}^{a nn o})

L_{r e g r ess} = i \sum s m oo t h_{L 1} (θ_{i} - θ_{i}^{a nn o})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\SpecialIssueSubmission\BibtexOrBiblatex\electronicVersion\PrintedOrElectronic

Learning Body Shape and Pose from Dense Correspondences

Y. Yoshiyasu and L. Gamez

CNRS-AIST JRL

Abstract

In this paper, we address the problem of learning 3D human pose and body shape from 2D image dataset, without having to use 3D dataset (body shape and pose). The idea is to use dense correspondences between image points and a body surface, which can be annotated on in-the wild 2D images, and extract and aggregate 3D information from them. To do so, we propose a training strategy called “deform-and-learn" where we alternate deformable surface registration and training of deep convolutional neural networks (ConvNets). Unlike previous approaches, our method does not require 3D pose annotations from a motion capture (MoCap) system or human intervention to validate 3D pose annotations.

1 Introduction

Estimating 3D human pose and body shape from a single image is a challenging yet important problem, with a wide variety of applications such as computer animation and virtual try-on in fashion.

Capturing and modeling of 3D body shape and pose has been mostly done in controlled settings using specialized 3D scanners such as a whole-body laser range scanner and motion capture (MoCap) system. With the progress of deep convolutional neural networks (ConvNets), 3D body shape can be estimated from a single image by regressing the parameters of statistical human body models. Most of current methods rely on 3D database for both body shape and pose, which still requires expensive 3D scanning systems to construct and extend their dataset that they use for training.

Nonetheless, the capability of the current methods for expressing body shape and pose is rather limited because of two main reasons. Firstly, regression of body shape parameters is inherently difficult for deep ConvNets. The mapping between the input image and parameters of statistical body models is highly nonlinear and is currently difficult to learn. The second challenge is the lack of a large-scale 3D dataset. In fact, most of 3D body shape dataset are limited to the age range from young adult to middle age. Also, MoCap dataset for 3D human pose estimation are limited to a small variety of subjects, since it needs a complicated experimental setup where MoCap and RGB video cameras have to be synchronized. Because those motion data are acquired in a controlled environment, they are somewhat different from the natural poses that can be found in the in-the-wild images.

“Can we learn 3D human body shape and pose directly from 2D images?” In this paper, we tackle this challenging problem to bypass the 3D dataset scarcity problem by extracting and aggregating 3D information from dense correspondences annotated on 2D images. We propose a strategy called “deform-and-learn" where we alternates deformable surface registration that fits a 3D model to 2D images and training of deep neural network that predicts 3D body shape and pose from a single image. Given dense image-to-surface correspondences, the first registration step fits a template model to images. The result is then used as supervisional signals of 3D body shape and pose for training deep ConvNets in the second registration step. These two processes are iterated to improve accuracy. Unlike previous approaches, our method does not require statistical body shape models, 3D pose annotations from MoCap dataset or human interventions to validate 3D pose.

The contributions of this paper are summarized as follows:

•

We propose a deform-and-learn training strategy that alternates deformable registration and training of a deep ConvNets for estimating human body shape and pose. It uses dense correspondences annotated on 2D images, without having to use 3D dataset (body shape and pose).

•

To design a pose prior from 2D pose dataset, we propose a conditional generative adversarial networks (cGANs) for detecting 3D human joint positions from 2D keypoints. We incorporate geometric constraints in cGANs to further constrain 3D human pose predictions. The results are used as soft constraints to guide the training of deep ConvNets for body shape and pose estimation.

•

We propose a skeleton-based deformable registration technique using back propagation, which can be implemented using a deep learning framework and parallelized with GPU. With the autograd technique, adding and minimization of various types of losses can be made simple, which frees our method from relying on 3D dataset and pre-built 3D statistical models.

•

We propose a deep ConvNets that predicts body shape and pose using scalings of body segments as body shape representation. With the final refinement step based on deformable registration using dense correspondence predictions, we can further align a body mesh model to an image.

2 Related Work

**Human body shape modeling and surface registration ** Previously, modeling of 3D body shape is done with 3D scanners. The first approach in this line of work is done by Allen et al. [ACP03] where the authors fit a template 3D body model to Caesar dataset that contains a couple of thousand subjects and used principal component analysis (PCA) to model the space of human body shape. Later, several techniques are proposed to extend the method of Allen et al. to handle both body shape and pose variations (such as SCAPE [ASK*∗*05] and SMPL [BKL*∗*16]) and even dynamic deformations (eg. Dyna [PMRMB15]). Nonrigid surface registration techniques have been used in body shape modeling to fit a template mesh models to 3D scans [HAWG08, ACP03, SP04, ARV07, YLSL10].

**Estimating 3D joint positions from an image ** Early approaches predict 3D joint positions from key points by assuming that the almost perfect 2D key points are already extracted from an image [RKS12]. The first method based on ConvNets directly regresses 3D joint positions with an image [LC14]. Recent techniques achieves higher accuracy with an end-to-end framework that predicts 2D joints with heatmaps and then regresses 3D joint positions or depths from them [ZHS*∗*17, MRC*∗*16]. Martinez et al. [MHRL17] on the other hand proposed a very simple network architecture that maps 2D joint coordinates to 3D joint positions, resulting in a two separate networks which can also achieve high accuracy. Pavlakos et al. [PZDD16] used a volumetric heatmap representation which is a ConvNet friendly representation and can avoid regressing the real values in a highly nonlinear manner. Some methods regress kinematic parameters [ZSZ*∗*16, YSAM18] to preserve human skeletal structures.

**Body shape from an image ** A common way to predict 3D human body shape and pose from an image is to employ pre-built statistical human models. Guan et al. [GWBB09] first manually extract 2D keypoints and silhouettes of human body. The first automatic method was proposed in SMPLify [BKL*∗*16] where the human statistical model called SMPL was fitted to the 2D keypoints estimated from an image using ConvNets by an optimization technique. Tan et al. [TBC17] proposed an indirect approach to learn body shape and pose by minimizing the estimated and real silhouettes. Tung et al. [TWYF17] proposed a self-supervised learning motion capture technique that optimizes SMPL body parameters and Kanazawa et al. [KBJM18] proposed an end-to-end learning system of human body and shape based on generative adversarial networks (GANs). More recently, silhouettes [PZZD18, VCR*∗*18] and part segmentations [OLPM*∗*18] are incorporated to improve prediction accuracy. On the other hand, in DensePose [RNI18] uv coordinates and part indices are directly annotated on images to establish image-to surface dense correspondences but this is still not complete 3D representation. The most similar approach to ours would be Lassner et al. [LRK*∗*17] where the authors proposed a method to construct a 3D human body shape and pose dataset by fitting a SMPL model to images. Compared to them, our approach does not require statistical 3D pose/shape priors or human interventions to validate pose fits.

**Unsupervised and weakly-supervised approaches ** Given 2D keypoints of human joints, Kudo et al. [KOMO18] and Chen et al. [CTA*∗*19] used conditional generative adversarial networks (GANs) to estimate 3D human pose only from 2D pose dataset. Rodin et al. [RSF18] used auto-encoder to compress multi-view images into latent variables and they reconstructed 3D human pose from them, which does not need a large amount of 3D pose supervisions.

3 Problem formulation

The goal of our work is to learn a model that can predict 3D body shape and pose from a single image using deep ConvNets, without having to use 3D dataset. To the best of our knowledge, this paper is the first one to achieve it. To that end, we use dense correspondence annotations (Fig. 4) between image points and a body surface, which can be annotated on 2D images in-the-wild and provides rich information about body shape and pose. Compared to silhouettes and part segmentation, dense correspondence annotations are less noisy around boundaries and can be obtained with some more additional human efforts whose annotation time is almost the same as that of part segmentation [RNI18].

Although dense correspondences between a body surface and image points contain rich information, they themselves are not sufficient for recovering 3D body shape and pose, especially for depth. The strategy we take in this paper is to incorporate geometric and kinematic losses imposed on body parameters as well as an adversarial loss defined from 2D key points to constrain the space of body shape and pose, as we do not have a direct access to 3D dataset that we can use for training. Consequently, the total loss we define and wish to minimize is as follows:

[TABLE]

where ${\mathcal{L}}_{\mathrm{d}ense}$ is the dense correspondence loss which penalizes the inconsistency of fits between the body model and images defined in terms of dense correspondences, ${\mathcal{L}}_{\mathrm{g}eo}$ is the geometric and kinematic loss for regularization and ${\mathcal{L}}_{\mathrm{a}dv}$ is the adversarial loss to constrain the distribution of the predicted poses close to that of 2D keypoint annotations. The weights $a$ , $b$ and $c$ control the relative strengths of the terms.

**Deform-and-learn iterative training strategy ** Directly minimizing all of the losses in Eq. (1) at the same time is difficult and in fact we experienced that the error stayed high. Instead, we decouple the problem into three components: A) training of a conditional generative adversarial networks that predicts 3D joints from 2D keypoints; B) optimization of latent body parameters based on image-surface registration; C) learning of body parameters by providing latent supervisions obtained in step B. The first component is trained once and provides soft constraints of 3D joints in the second and third step. The second and third components are iterated for several times to improve the accuracy, which we refer to as the “deform-and-learn” training strategy.

We found that decoupling of the training phase into three steps and providing supervisions on latent variables works effective. In fact, recent approaches showed that providing supervisions on latent body parameters are effective in stabilizing and improving training [OLPM*∗*18]. Here we reconstruct latent variables from dense annotations by deformable surface registration. Note that deformable registration is an local optimizer and is sensitive to an initial solution. This is why we propose the iterative training strategy “deform-and-learn”, where we alternate between deformable registration and learning. This strategy will gradually improve performance by updating the initial solution of the registration phase and then the latent supervisions in the learning phase.

**Body shape and pose model ** To fit a template mesh model to an image, we use a skeleton-based parametric deformable model which is a modified version of SMPL [BKL*∗*16]. The template mesh consists of $n$ vertices, where the number of vertex $n$ is 6980 in this paper. The vertex positions of the template, $\mathbf{v}_{1}\ldots\mathbf{v}_{n}$ , are denoted by a $n\times 3$ vector, $\mathbf{v}=[\mathbf{v}_{1}\ldots\mathbf{v}_{n}]^{\mathrm{T}}$ . The pose of the body is defined by a skeleton rig with 23 joints where the pose parameters ${\mathbf{a}}\in\mathbb{R}^{24\times 3}$ is defined by the axis angle representation of the relative rotation between segments. The body model is posed by a joint parameters ${\mathbf{a}}$ via forward kinematics. Instead of using a low-dimensional shape space as in [BKL*∗*16], which can be learned from thousands of registered 3D scans, we use segment scales to model a body shape, which is parametrized by segment scales ${\mathbf{S}}\in\mathbb{R}^{24}$ . This way, body shape can be modeled more flexibly without the need to use 3D body scans—it does not have to be confined in the space of statistical models. Using linear blend skinning, the body deformation model is defined as a function ${\mathbf{v}}=X({\mathbf{S}},{\mathbf{a}})$ .

We use the weak-perspective camera model and solve for the global rotation ${\mathbf{R}}\in\mathbb{R}^{3\times 3}$ , translation ${\mathbf{t}}\in\mathbb{R}^{2}$ and global scale $s\in\mathbb{R}$ . Rather than using other rotational representation such as axis angle, we directly optimize for a rotation matrix with 9 parameters due to its property to represent orientations uniquely in 3D space. Since this approach makes a transformation deviating from a rotation matrix, we applied the Gram Schmidt normalization to ortho-normalize the matrix. Thus the total number of the parameters representing human body is 108, ${\mathbf{\theta}}=[{\mathbf{a}},{\mathbf{S}},{\mathbf{R}},s,{\mathbf{t}}]$ . With the body parameters ${\mathbf{\theta}}$ , deformation and projection of vertices ${\mathbf{v}}=X({\mathbf{S}},{\mathbf{a}})$ into an image is achieved as:

[TABLE]

where $\Pi$ is an orthogonal projection.

4 Overview

The overview of our approach is depicted in Fig. 1. We train a conditional generative adversarial networks (cGANs) that predicts 3D joint positions from 2D joint positions, which will guide the registration and training of deep ConvNets for body shape and pose (Section 5). The deform-and-learn training strategy alternates deformable surface registration that fits a 3D model to 2D images and training of deep neural network that predicts 3D body shape and pose from a single image (Section 6). In the very beginning, the initial pose of registration is in the T-pose, ${\mathbf{\theta}}_{0}$ . Given image-surface dense correspondences, the first registration step fits a template model to images. After registration, we obtain a collection of body parameters ${\mathbf{\theta}}_{\mathrm{f}it}$ which is then used as supervisional signals ${\mathbf{\theta}}_{\mathrm{a}nno}$ in order to train deep ConvNets that predicts body parameters ${\mathbf{\theta}}_{\mathrm{c}onv}$ (Section 7). The results are used as initial poses of surface registration in the next round. This training process is iterated for several times to get better results.

In the inference phase, we optionally perform refinement based on deformable surface registration. We first use the trained deep ConvNets to predict body shape and pose parameters, we refine the result using the registration technique starting from the ConvNet result as an initial solution. Thus the overall component used here is the same as the training phase, except that the order is flipped.

5 Conditional generative adversarial networks for 3D human pose with geometric constraints

We propose a conditional generative adversarial networks (GANs) to predict depths of joints from 2D keypoints in an unsupervised manner. The results of the generator is used as soft constraints to guide image-surface registration in the next section.

We take a similar approach as Kudo et al. [KOMO18] and Chen et al. [CTA*∗*19] where the 3D joint positions produced by a generator network ( $G$ ) is projected to the image plane to obtain 2D joint positions and a discriminator ( $D$ ) judges real or fake in 2D image space. The key difference of our model from previous approaches [KOMO18] is that our approach incorporates geometric constraints, such as bone symmetry constraints, to further constrain the space of solution. The network architecture is depicted in Fig. 2. The input to the generator is the 2D key points of $N$ joints and the output is depths of those joints. The predicted depths values $z_{i}$ are then concatenated with $x_{i}$ and $y_{i}$ coordinates, rotated around the vertical axis and projected to the image space. The discriminator inputs the projected joint positions as $fake$ and the 2D keypoint data as $real$ . For both networks, we use multi-layer perceptron (MLP) with eight linear layers to map 2D coordinates to depths and binary class.

Let ${\mathbf{u}}$ be the 2D joint positions of a skeleton. Also let us denote an angle around the vertical axis as $\phi$ . Our 3D human pose cGANs uses the following standard adversarial loss functions for $G$ and $D$ :

[TABLE]

where $f$ denotes the rotation and the projection function. Note that we validate the pose from multiple views where we used angles [deg], $\phi=\{45,60,90,135,180,235,270\}$ for each pose.

In addition to the adversarial loss, the geometric loss is also applied. Specifically, we use the bone symmetry loss ${\mathcal{L}}_{\mathrm{s}ym}$ that constrain the left and right limb be similar and the bone ratio loss ${\mathcal{L}}_{\mathrm{r}atio}$ that minimizes the difference between the normalized bone length prediction and that of dataset. The bone ratio loss ${\mathcal{L}}_{\mathrm{r}atio}$ is defined as:

[TABLE]

where $\frac{l_{e}}{l_{\mathrm{t}runk}}$ is the ratio of the bone length for bone $e$ in a set of bones ${\mathcal{B}}$ in a skeleton with respect to the trunk length and $\frac{\bar{l}_{e}}{\bar{l}_{\mathrm{t}runk}}$ is that of the average skeleton. Let ${\mathcal{B}}_{s}$ be the set of symmetry pairs of bone segments which contains indices of bones e.g., the left and right forearm. Then the bone symmetry loss ${\mathcal{L}}_{\mathrm{s}ym}$ is defined as:

[TABLE]

where $l_{i}$ and $l_{j}$ is the lengths of the bone for symmetry bone pairs. We mix the above losses to train the generator such that the loss is:

[TABLE]

where $\epsilon$ is the weight for controlling the strength of the adversarial term, which we set to 0.1 in this paper.

6 Image-surface deformable registration

We propose a deformable surface registration technique to fit a template mesh model to images to obtain 3D body shape and pose annotations for training deep ConvNets. Here deformable registration is formulated as a gradient-based method based on back propagations, which can be implemented with a deep learning framework and parallelized with GPU. With the automatic differentiation mechanisms provided with a deep learning framework, adding and minimizing various kinds of losses have made easy and straightforward. As a result, the proposed deformable registration technique thus incorporates kinematic, geometric and correspondence losses.

Given image-surface dense correspondences annotated on images, the template mesh is fitted to images by optimizing body parameters ${\mathbf{\theta}}=[{\mathbf{a}},{\mathbf{S}},{\mathbf{R}},s,{\mathbf{t}}]$ subject to kinematic and geometric constraints. In total, the overall loss function for our registration is of the form:

[TABLE]

where ${\mathcal{L}}_{\mathrm{d}ense}$ and ${\mathcal{L}}_{\mathrm{K}P}$ are the dense correspondence and key point losses that penalize the alignment inconsistency of the body model and images defined in terms of dense correspondences and key points. The losses ${\mathcal{L}}_{\mathrm{s}cale}$ and ${\mathcal{L}}_{\mathrm{j}oint}$ is the segment scaling smoothness and kinematic loss for regularization. The transformation determinant loss ${\mathcal{L}}_{\mathrm{d}et}$ makes the determinant of the global transformation positive. In addition, $\omega_{\mathrm{d}ense}$ , $\omega_{\mathrm{K}P}$ , $\omega_{\mathrm{s}cale}$ , $\omega_{\mathrm{j}oint}$ and $\omega_{\mathrm{d}et}$ are the respective weights for the above defined losses. The initialization of body parameters is provided from the predictions of deep ConvNets. For the very first iteration where the Convnet predictions are not available, segment scale ${\mathbf{S}}$ is set 1 for all segments and pose ${\mathbf{a}}$ is set to 0 for all joints, which means that registration is started from the T pose.

6.1 Correspondence fit loss

The correspondence loss comprises two losses: the dense correspondence loss ${\mathcal{L}}_{\mathrm{D}ense}$ and keypoint loss ${\mathcal{L}}_{\mathrm{K}P}$ .

**Dense correspondence loss ** Let us define a set of image-surface correspondences $\mathcal{C}=\{(\mathbf{p}_{1},\mathbf{v}_{\mathrm{idx}(1)})\ldots(\mathbf{p}_{N},\mathbf{v}_{\mathrm{idx}(N)})\}$ , where $\mathbf{p}$ is the image points. In addition $\mathrm{idx}(i)$ is the index of the mesh vertices that is matched with image point $i$ . Now we can define the dense correspondence loss as:

[TABLE]

where a mean squared error (MSE) is used to calculate the loss.

**Key point loss ** To produce 3D poses with statistically valid depths, the results of cGAN is used to guide deformable registration. Instead of attaching a discriminator to the registration framework, the depth values from cGAN and the ground truth 2D joint coordinates are provided as a soft constraint to constrain the position of the 3D joints based on the MSE loss:

[TABLE]

where $\bar{x}_{i}$ and $\bar{y}_{i}$ are the ground truth of 2D key points. Also $z_{i}^{\mathrm{G}AN}$ is the depth at joint $i$ predicted by cGANs.

6.2 Geometric and kinematic loss

Since we attract the template mesh to 2D image coordinates, the problem is ill-posed and deformations are not constrained. Thus we introduce the regularization terms that avoids extreme deformations.

**Segment scaling smoothness ** To avoid extreme segment scalings, we introduce the scaling smoothness loss, which minimizes difference between scalings of adjacent segments:

[TABLE]

**Joint angle smoothness and limit loss ** To prevent extreme poses, we introduce joint angle smoothness loss and joint limit loss. The joint smoothness loss is enforced at every joint in a skeleton, $\mathcal{J}$ , and will contribute to avoid extreme bending. To avoid hyper-extensions which will bend certain joints like the elbows and knees (where we represent as $\mathcal{J^{\prime}}$ ) in the negative direction, we introduce the joint limit loss. The regularizations that act on joints are thus represented as:

[TABLE]

where the first term minimizes joint angles whereas the latter term penalizes rotations violating natural constraints by taking exponential and minimizing it.

**Transformation determinant loss ** Since we use a rotation matrix for representing the global rotation at the root, it is necessary to apply a constraint on a matrix to keep its determinant to positive. Thus, we define the transformation determinant loss as:

[TABLE]

7 Estimating 3D human body shape and pose from a single image

7.1 Deep ConvNets for body shape and pose regression

Using the results obtained by deformable registration as annotations for training deep ConvNets, we regress body shape and pose parameters with an image. We also add the dense correspondence and keypoint losses as in Section 6.1 for additional supervisions. In total, we minimize the loss function of the form:

[TABLE]

where ${\mathcal{L}}_{\mathrm{r}egress}$ is the regression loss for body parameters. $\alpha$ , $\beta$ and $\gamma$ are the respective weights. Let $\theta_{i}$ be the parameters for $i$ -th sample, the regression loss is defined as:

[TABLE]

where ${\theta}_{i}^{\mathrm{a}nno}$ is the annotation provided from the registration step. Here we use the smooth L1 loss because of its robustness to outliers. This choice was more effective than the L2 loss in contributing to decreasing the error during the iterative training strategy in the presence of potential outliers and noisy annotations.

The body model is similar to the one we used for registration, except for the pose representation, where we found that the use of quaternions improved stability and convergence of training than axis angle, which is probably due to the fact that the values of quaternions are in between -1 and 1 and is easier for ConvNets to learn with than axis angles. Other parameters are same as the ones used in Section 6, which results in 132 parameters in total. Note that the global rotation is regressed using 9 parameters and the Gram Schmidt orthogonalization is used to make a transformation into a rotation. We use ResNet50 [HZRS15] pretrained on the ImageNet dataset as the base network.

7.2 Inference and final refinement based on registration

During the inference phase, there are two steps: 3D body parameter prediction and skeleton-based deformation. Since body shape/pose parameters are highly non-linear and are difficult to regress and predict accurately using deep ConvNets, we optionally provide a way to refine ConvNet predictions. This is based on the deformable registration technique proposed in Section 6. In order to define the dense correspondence term, we use DensePose [RNI18] to obtain dense uv maps and part indices (Fig. 3), which are then converted to image-surface correspondences. In addition, the simple baseline 2D human pose detector [XWW18] is used to obtain 2D human joint positions from an image and to define the key point loss. The pre-trained models from [RNI18] and [XWW18] are used.

8 Experimental results

8.1 Implementation and training detail

Our method is implemented using Pytorch. Training takes 2-3 days using a NVIDIA Quadro P6000 graphics card with 24 GB memory. We use the Adam optimizer for all the steps in our approach. The multi-view cGANs is trained for 60 epochs with the batch size of 1024 and the learning rate of 0.0002. At each iteration, the body regressor is trained for 50 epochs with the batch size of 30 and the learning late of 0.0001. From the 1st to 4th iteration of training, we used both Human3.6M dataset and MS COCO dataset. At the last iteration (5th), we fine-tune the network on Human3.6M dataset only. We set the parameters in the loss function to $\alpha=\gamma=1$ and $\beta=10$ . For deformable surface registration, we use the learning rate of 0.1 and batch size of 10. We empirically set the parameters to $\omega_{\mathrm{d}ense}=1000$ , $\omega_{\mathrm{K}P}=1$ , $\omega_{\mathrm{s}cale}=10$ , $\omega_{\mathrm{j}oint}=0.001$ and $\omega_{\mathrm{d}et}=1$ . For the first training iteration, in order to recover a global rotation, we set $\omega_{\mathrm{s}cale}=100$ and $\omega_{\mathrm{j}oint}=1$ to make the body model stiff, which is a common strategy in deformable registration [ARV07]. We perform 300 forward-backward passes during the registration step at the 1st iteration. From the second iteration, 100 forward-backward passes were sufficient, since we start from the ConvNet prediction.

8.2 Dataset

**DensePose ** DensePose dataset [RNI18] contains images with dense annotations of part-specific UV coordinates (Fig. 4), which are provided on the MS COCO images. To obtain part-specific UV coordinates, body surfaces of a SMPL human body model are partitioned into 24 regions and each of them are unwrapped so that vertices have UV coordinates. Thus, every vertex on the model have unique parameterizations. Based on this, images are manually annotated with part indices and UV coordinates to establish dense image-to surface correspondences. To use this dense correspondences in 3D model fitting, we find the closest points from image pixels to surface vertices in UV coordinates of every part. The nearest neighbor search is done in this direction because image pixels are usually coarser than surface vertices. We were able to obtain approximately 15k annotated training images with the whole body contained in the image region.

**Human3.6M ** Human 3.6M dataset is a large scale dataset [IPOS14] for 3D human pose detection. This dataset contains 3.6 million images of 15 everyday activities, such as walking, sitting and making a phone call, which is performed by 7 professional actors and is taken from four different views. 3D positions of joint locations captured by MoCap systems are also available in the dataset. In addition, 2D projections of those 3D joint locations into images are available. To obtain dense annotations for this dataset, we use Mosh [LMB14] to obtain SMPL body and pose parameters from the raw 3D Mocap markers and then projected mesh vertices onto images to get dense correspondences between images and a template mesh. Note that some of the results are not well-aligned to markers and camera coordinates, resulting in a training dataset containing around $17k$ images and dense correspondence annotations.

**MPII 2D human pose ** The images from MPII 2D human pose dataset [APGB] is used for testing and was not used in training. Also, 2D keypoint labels in this dataset were used to trained the cGANs.

8.3 Protocol and metric

We followed the same evaluation protocol used in previous approaches [PZDD16, ZHS*∗*17] for evaluation on Human3.6M dataset, where we use 5 subjects (S1, S5, S6, S7, S8) for training and the rest 2 subjects (S9, S11) for testing. The error metric for evaluating 3D joint positions is called mean per joint position error (MPJPE) in $mm$ . Following [ZHS*∗*17] the output joint positions from ConvNets is scaled so that the sum of all 3D bone lengths is equal to that of a canonical average skeleton.

We also evaluate the fit of the body model to images based on the mean per pixel error and mean per vertex error which measures distances from the ground truth to the predicted vertices in 2D image space and 3D space. Prior to calculate per-vertex error, we obtain a similarity transformation by Procrustes analysis and align the predicted vertices to the ground truth; this is similar to the reconstruction error in the 3D joint estimation.

8.4 Qualitative results

In Figs. 5, 10 and 11, we show our results on body shape and pose estimation before and after refinement. As we can see from the figures, our technique can predict 3D body shape and pose from in-the-wild images. Before refinement, the predicted poses are close to the images but still there are misalignments especially at hands and feet. After refinement, the mesh is attracted toward image points based on dense correspondence predictions.

8.5 Comparison with state-of-the-art

**3D joint position and rotation ** We compared our method with state-of-the-art techniques (Table 1). Here we only deal with unsupervised or weakly-supervised techniques which do not use full 3D supervisions. Kudo et al. [KOMO18] uses conditional GANs to predict depths from 2D joint coordinates, which is the only technique that learns a model from 2D information only, except for ours. Rhodin et al. [RSF18] use an auto-encoder to compress visual features and reconstruct 3D pose from it, which does not require a large amount of 3D human poses. Our technique outperforms them in terms of MPJPE accuracy and is able to not only predict joints but also the orientations of limb segments as well as body shape represented in the form segment scales. Note that our 3D human pose cGANs even outperforms [KOMO18] by incorporating geometric constraints. Our method is on par with HMR (unpaired) that uses 3D pose and body shape dataset for training GANs to provide 3D constraints in an unsupervised learning manner. On the other hand, we need dense annotations on 2D images and do not use any 3D annotations such as body shape and pose. Note that HMR (paired) further provides images paired with 3D poses to do supervised learning, which makes their method slightly better than ours but this requires an experimental setup with a Mocap system and synchronized video cameras to construct dataset.

**Per-pixel and per-vertex error ** In order to evaluate alignment of a body model to images, we measured the mean per-vertex error and mean per-pixel error and compare with HMR (paired), which is shown in Table 2. HMR (paired) obtained better results on Human 3.6M dataset than ours in both vertex alignment and pixel alignment, as they use a large amount of 3D pose dataset paired with images whereas ours only use 2D annotations. For this dataset, our refinement was not very effective probably because there is no large variations of the subjects, actions and background in this dataset. For MS COCO dataset, our refinement was effective because this dataset is challenging for deep ConvNets to predict body parameters from due to a large variations in background, body shape/pose and clothing. Thanks to the refinement step, we achieve better fits in terms of the mean per-pixel error than HMR. Figure. 6 shows the comparisons between ours and HMR (paired). Our method with refinement produces better alignment than HMR, especially around feet and hands. Our method captures more natural appearances of body shape and pose as the prior and constraints used do not come from the 3D dataset that is limited to some age range or captured in a controlled environment.

8.6 Ablation studies

**Loss ** We have compared the registration results by varying the losses (Fig. 7). Without $\mathcal{L}_{\mathrm{d}ense}$ , the alignment between the body model and the image is poor, even the pose fit is not satisfactory. The losses $\mathcal{L}_{\mathrm{g}eo}$ and $\mathcal{L}_{\mathrm{d}et}$ play important role in mitigating distortions. With our 3D human pose cGANs, the depths of a skeleton can be constrained by making the resulting distribution close to that of the data, which can for example prevent the incline and recline of the body.

**Dataset ** We also compared the results of ConvNets by changing the dataset i.e., using dense COCO only, Human3.6M only and both. The MPJPE results after the 1st iteration are shown in Table 3. By combining MS COCO which contains in-the-wild images and Human3.6M dataset which includes domain knowledge, the better results are obtained than using a single dataset.

8.7 Is the iterative training strategy effective?

To show the effectiveness of our iterative training strategy, we show a graph with the history of MPJPE errors in Fig. 8. Here, MPJPE values after deformable registration is computed for training dataset. Our deform-and-learn strategy starts from image-surface registration using the T-pose as the initial pose. After the first registration phase, the train-set MPJPE for registration results is approx. 100 mm. Then, ConvNets is trained based on these registration results as supervisions. After 1 iteration, the test-set MPJPE of ConvNet predictions is 140 mm, which is slightly high. Next, deformable surface registration is performed again using the results of ConvNets as its initialization. These two steps are iterated for several times. This strategy was shown to be effective in gradually decreasing the error, which is visually noticeable in Fig. 9. In fact, MPJPE decreased approximately 30 mm from around 140 mm to 106 mm.

We also compared different training strategies in Table 4. For the single-step learning strategy which incorporates all the losses from Eq. (1) for training ConvNets (including the training of a discriminator), this is a difficult problem and the error stayed high. By omitting a discriminator and the loss for joint angle regularization ${\mathcal{L}}_{\mathrm{j}oint}$ , we were able to train the model but the MPJPE error was not satisfactory. Also, instead of iterating registration and learning we have tried to perform one iteration of deform-and-learn and train longer (200 epoch). This improved MPJPE slightly but not as much as five iterations of deform-and-learn. Note also that a longer deformable registration (600 iterations) in a single step only improves accuracy of MPJPE 10mm (the train-set MPJPE from 100mm to 90mm), which is inferior to our deform-and-learn strategy that can achieve the train-set MPJPE 60 mm after registration.

8.8 Inference time

We measure the time required for the inference phase which can be divided in to five major steps: 3D body parameter prediction, skeleton-based deformation, 2D key point prediction (optional), DensePose prediction (optional) and refinement (optional). The 3D body parameter prediction step itself only takes approx. 0.035 sec. The time for the skeleton deformation step is also approx. 0.035 sec, which means that the inference can be performed in approximately 0.07 sec given the cropped image of a human. Other steps that are required for refinement take 0.03 sec and 0.08 sec for 2D joint prediction and DensePose prediction, respectively. The refinement step takes 5-6 seconds for 50 iterations. Adding up all the step, our technique including refinement takes around 5 seconds to process one image. Compared to SMPLify [BKL*∗*16] and its variants [LRK*∗*17], which takes over 30 sec, our technique is faster as we start from the better initial pose and shape.

8.9 Failure cases and limitations

As our refinement step rely on DensePose predictions, if this result is erroneousness, the final result will be worse; for example DensePose occasionally confuses the left and right for hands and feet, which results in distortions. While we represent body shape by segment scales, it is difficult to estimate a child’s shape (Fig. 10) as the body style is very different from the template mesh. A mechanism to select template meshes from different body styles would be useful for these cases. As with most of other approaches, our method cannot recover the absolute scale of body shape. Our network is currently designed for estimating body shape and pose for a single-person and we would like to extend to multiple human settings.

9 Conclusion

We presented a deep learning technique for estimating 3D human body shape and pose from a single color image. To that end, we propose an iterative training approach that alternates between deformable surface registration and training of deep ConvNets, which gradually improves accuracy of predictions by extracting and aggregating 3D information from dense correspondences provided on 2D images. This approach allows us to learn 3D body shapes and pose from 2D dataset only without having to use 3D annotations that are in general very expensive to obtain. More importantly, as our approach does not rely on statistical body models or 3D annotations captured in a controlled environment, our method is not restricted to the space of the pre-built statistical model and can capture body shape and pose details contained in in-the-wild images better than previous approaches.

In future work, we would like to address the modeling of clothing and details. We are interested in designing a single unified network that can handle from 2D detection to 3D body shape/pose prediction all at once, which would be more efficient and faster. It would also be beneficial to find a better representation of 3D human body and pose than body shape/pose parametric representation, which is more friendly to be used by ConvNets.

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[ACP 03] Allen B., Curless B., Popović Z. : The space of human body shapes: reconstruction and parameterization from range scans. ACM Trans. Graph. 22 , 3 (2003), 587–594.
2[APGB] Andriluka M., Pishchulin L., Gehler P., Bernt S. :.
3[ARV 07] Amberg B., Romdhani S., Vetter T. : Optimal Step Nonrigid ICP Algorithms for Surface Registration. In CVPR (2007).
4[ASK ∗ 05] Anguelov D., Srinivasan P., Koller D., Thrun S., Rodgers J., Davis J. : SCAPE: shape completion and animation of people. ACM Trans. Graph. 24 (2005), 408–416.
5[BKL ∗ 16] Bogo F., Kanazawa A., Lassner C., Gehler P., Romero J., Black M. J. : Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In ECCV (2016), Springer, pp. 561–578.
6[CTA ∗ 19] Chen C., Tyagi A., Agrawal A., Drover D., Rohith M. V., Stojanov S., Rehg J. M. : Unsupervised 3d pose estimation with geometric self-supervision. Co RR abs/1904.04812 (2019).
7[GWBB 09] Guan P., Weiss A., Balan A., Black M. J. : Estimating human shape and pose from a single image. In Int. Conf. on Computer Vision, ICCV (2009), pp. 1381–1388.
8[HAWG 08] Huang Q.-X., Adams B., Wicke M., Guibas L. J. : Non-rigid registration under isometric deformations. In Proceedings of the Symposium on Geometry Processing (2008), pp. 1449–1457.