Analytical Derivatives for Differentiable Renderer: 3D Pose Estimation   by Silhouette Consistency

Zaiqiang Wu; Wei Jiang

arXiv:1906.07870·cs.CV·June 20, 2019

Analytical Derivatives for Differentiable Renderer: 3D Pose Estimation by Silhouette Consistency

Zaiqiang Wu, Wei Jiang

PDF

Open Access

TL;DR

This paper introduces an analytical gradient-based differentiable renderer that improves accuracy and efficiency over numerical methods, enabling effective 3D pose estimation from silhouettes without joint supervision.

Contribution

It proposes a novel differentiable mesh renderer with analytical gradients derived from continuous pixel intensity, enhancing 3D pose estimation accuracy and efficiency.

Findings

01

Achieves competitive 3D pose estimation from silhouettes without joint supervision.

02

Outperforms previous differentiable renderers in accuracy and efficiency.

03

Demonstrates effectiveness in multi-viewpoint silhouette-based 3D reconstruction.

Abstract

Differentiable render is widely used in optimization-based 3D reconstruction which requires gradients from differentiable operations for gradient-based optimization. The existing differentiable renderers obtain the gradients of rendering via numerical technique which is of low accuracy and efficiency. Motivated by this fact, a differentiable mesh renderer with analytical gradients is proposed. The main obstacle of rasterization based rendering being differentiable is the discrete sampling operation. To make the rasterization differentiable, the pixel intensity is defined as a double integral over the pixel area and the integral is approximated by anti-aliasing with an average filter. Then the analytical gradients with respect to the vertices coordinates can be derived from the continuous definition of pixel intensity. To demonstrate the effectiveness and efficiency of the proposed…

Tables4

Table 1. Table 1: Quantitative results compared with other state-of-the-art methods

Method	Per-vertex error (mm)
L2EPS [28] (supervised)	117.7
N3MR [3] (unsupervised)	172.2
Ours (unsupervised)	142.8

Table 2. Table 2: Quantitative results of different rendering resolution and wether anti-aliasing is applied caption should end without a full stop

Resolution	Anti-aliasing	Per-vertex error (mm)
$32 \times 32$	No	181.9
$32 \times 32$	Yes	173.7
$64 \times 64$	No	153.4
$64 \times 64$	Yes	142.8

Table 3. Table 3: Elapsed time in ms of one iteration of our method and N3MR in different resolution setup. The number of SMPL models is set to 1

Resolution	Ours	N3MR
$16 \times 16$	17.11	16.08
$32 \times 32$	17.21	16.11
$64 \times 64$	17.63	18.00
$128 \times 128$	18.50	23.64

Table 4. Table 4: Elapsed time in ms of one iteration of our method and N3MR with different number of SMPL model. The rendering resolution is set to 64 × 64 64 64 64\times 64

Number of SMPL	Ours	N3MR
1	17.63	18.00
2	28.35	29.20
3	36.72	38.96
4	46.65	49.14

Equations47

I (i, j) = \frac{1}{S} \iint_{Ω_{i, j}} p (x, y) d x d y

I (i, j) = \frac{1}{S} \iint_{Ω_{i, j}} p (x, y) d x d y

I (i, j) = \frac{1}{F ^{2}} k = 1 \sum F^{2} p (x_{k}, y_{k})

I (i, j) = \frac{1}{F ^{2}} k = 1 \sum F^{2} p (x_{k}, y_{k})

F \to \infty lim \frac{1}{F ^{2}} k = 1 \sum F^{2} p (x_{k}, y_{k}) = \frac{1}{S} \iint_{Ω_{i, j}} p (x, y) d x d y

F \to \infty lim \frac{1}{F ^{2}} k = 1 \sum F^{2} p (x_{k}, y_{k}) = \frac{1}{S} \iint_{Ω_{i, j}} p (x, y) d x d y

\frac{\partial I ( i , j )}{\partial x _{0}}

\frac{\partial I ( i , j )}{\partial x _{0}}

= \frac{1}{S} \iint_{Ω_{i, j}} \frac{\partial p ( x , y )}{\partial x _{0}} d x d y

α (x, y) = A x + B y + C

α (x, y) = A x + B y + C

p (x, y) = {p_{1}, p_{0}, \mbox i f α (x, y) < 0 \mbox an d (x, y) \in Ω_{0} \mbox i f α (x, y) > 0 \mbox an d (x, y) \in Ω_{0}

p (x, y) = {p_{1}, p_{0}, \mbox i f α (x, y) < 0 \mbox an d (x, y) \in Ω_{0} \mbox i f α (x, y) > 0 \mbox an d (x, y) \in Ω_{0}

p (x, y) = p_{0} h (α (x, y)) + p_{1} h (- α (x, y)), (x, y) \in Ω_{0}

p (x, y) = p_{0} h (α (x, y)) + p_{1} h (- α (x, y)), (x, y) \in Ω_{0}

\frac{\partial I ( i , j )}{\partial x _{0}}

\frac{\partial I ( i , j )}{\partial x _{0}}

= \frac{1}{S} \iint_{Ω_{0}} \frac{\partial p ( x , y )}{\partial x _{0}} d x d y

\frac{\partial I ( i , j )}{\partial x _{0}}

\frac{\partial I ( i , j )}{\partial x _{0}}

= \frac{p _{1} - p _{0}}{S} \iint_{Ω_{0}} δ (α (x, y)) (- \frac{\partial α ( x , y )}{\partial x _{0}}) d x d y

\frac{\partial I ( i , j )}{\partial x _{0}} = \frac{p _{1} - p _{0}}{S} \iint_{Ω_{0}} δ (A x + B y + C) (y_{1} - y) d x d y

\frac{\partial I ( i , j )}{\partial x _{0}} = \frac{p _{1} - p _{0}}{S} \iint_{Ω_{0}} δ (A x + B y + C) (y_{1} - y) d x d y

{t = A x + B y k = - B x + A y

{t = A x + B y k = - B x + A y

\frac{\partial I ( i , j )}{\partial x _{0}}

\frac{\partial I ( i , j )}{\partial x _{0}}

= \frac{p _{1} - p _{0}}{S ( A ^{2} + B ^{2} )} \int_{k_{0}}^{k_{1}} (y_{1} - \frac{A k - B C}{A ^{2} + B ^{2}}) d k

= \frac{p _{1} - p _{0}}{S ( A ^{2} + B ^{2} )} ((y_{1} + \frac{B C}{A ^{2} + B ^{2}}) (k_{1} - k_{0}) - \frac{A ( k _{1}^{2} - k _{0}^{2} )}{2 ( A ^{2} + B ^{2} )})

{k_{0} k_{1} = - B x_{0}^{'} + A y_{0}^{'} = - B x_{1}^{'} + A y_{1}^{'}

{k_{0} k_{1} = - B x_{0}^{'} + A y_{0}^{'} = - B x_{1}^{'} + A y_{1}^{'}

\frac{\partial I ( i , j )}{\partial y _{0}}

\frac{\partial I ( i , j )}{\partial y _{0}}

\frac{\partial I ( i , j )}{\partial x _{1}}

\frac{\partial I ( i , j )}{\partial y _{1}}

\frac{\partial I ( i , j )}{\partial v _{k}} = {\sum_{n = 1}^{N_{e}} \frac{\partial I ( i , j )}{\partial v _{k}^{n}}, 0, \mbox i f N_{e} > 0 \mbox i f N_{e} = 0

\frac{\partial I ( i , j )}{\partial v _{k}} = {\sum_{n = 1}^{N_{e}} \frac{\partial I ( i , j )}{\partial v _{k}^{n}}, 0, \mbox i f N_{e} > 0 \mbox i f N_{e} = 0

E_{s l}

E_{s l}

= i = 1 \sum N_{s} ∥ R_{i} (\hat{P}) - S_{i} ∥_{2}^{2}

= i = 1 \sum N_{s} ∥ R_{i} (M (β, θ; Φ)) - S_{i} ∥_{2}^{2}

E_{s pt} = \frac{N _{sec}}{N _{v}}

E_{s pt} = \frac{N _{sec}}{N _{v}}

E = E_{s l} + λ E_{s pt}

E = E_{s l} + λ E_{s pt}

E_{p} = \frac{1}{N _{v}} i = 1 \sum N_{v} ∥ \hat{P}_{i} - P_{i} ∥_{2}

E_{p} = \frac{1}{N _{v}} i = 1 \sum N_{v} ∥ \hat{P}_{i} - P_{i} ∥_{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · 3D Shape Modeling and Analysis · Computer Graphics and Visualization Techniques

Full text

11institutetext: Zhejiang University

Analytical Derivatives for Differentiable Renderer: 3D Pose Estimation by Silhouette Consistency

Zaiqiang Wu

Wei Jiang

Abstract

Differentiable render is widely used in optimization-based 3D reconstruction which requires gradients from differentiable operations for gradient-based optimization. The existing differentiable renderers obtain the gradients of rendering via numerical technique which is of low accuracy and efficiency. Motivated by this fact, a differentiable mesh renderer with analytical gradients is proposed. The main obstacle of rasterization based rendering being differentiable is the discrete sampling operation. To make the rasterization differentiable, the pixel intensity is defined as a double integral over the pixel area and the integral is approximated by anti-aliasing with an average filter. Then the analytical gradients with respect to the vertices coordinates can be derived from the continuous definition of pixel intensity. To demonstrate the effectiveness and efficiency of the proposed differentiable renderer, experiments of 3D pose estimation by only multi-viewpoint silhouettes were conducted. The experimental results show that 3D pose estimation without 3D and 2D joints supervision is capable of producing competitive results both qualitatively and quantitatively. The experimental results also show that the proposed differentiable renderer is of higher accuracy and efficiency compared with previous method of differentiable renderer.

Keywords:

inverse graphics, differentiable renderer, 3D pose estimation

1 Introduction

In recent years, convolutional neural networks (CNNs) have achieved appealing results in image understanding, such as single image based 3D reconstruction. It is generally known that differentiable operations are essential for back-propagation algorithm to train the neural networks. For instance, 3D reconstruction in generative manner requires differentiable renderer to construct the loss for supervision. However, due to the discrete sampling operation, the traditional rendering algorithms, (e.g., rasterization and ray tracing [1]) are not differentiable and can not be directly applied in the framework of 3D reconstruction.

Many researchers paid a lot of attention to differentiate the process of rendering to make it feasible to incorporate the rendering operation into gradient-based optimization framework. Loper et al. [2] proposed a general-purpose differentiable renderer named OpenDR which is capable of rendering triangular meshes into images and automatically acquiring derivatives with respect to the model parameters. However the derivatives of OpenDR are computed by numerical method which is lack of accuracy. Further more, OpenDR is not compatible with existing deep learning framework. Kato et al. [3] proposed a differentiable renderer designed for neural networks, but this method still relies on numerical methods to compute derivatives. Liu et al. [4] proposed a differentiable renderer called SoftRas which only focuses on the rendering of silhouette, however this method requires to generate probability maps for each triangle in the mesh, which results in high memory consumption and blurry rendering results.

To address these issues mentioned above, the first differentiable silhouette renderer with analytical derivatives which is of higher efficiency and accuracy compared with previous methods. It is worth mentioning that it is not necessary to utilize a general-purpose renderer in 3D reconstruction tasks since the illumination and material parameters are usually unknown, thus a differentiable renderer focusing on synthesizing silhouettes is enough for supervision. The forward pass of our renderer is similar to rasterization with anti-aliasing. However the backward pass is different from previous methods which depend on accessing to rendered frame buffers and obtaining derivatives by numerical methods. The high light of our work is that the derivatives of pixel intensities with respect to the coordinates of vertices are obtained by our proposed analytical method without the need of accessing to the frame buffers and applying any numerical method.

To obtain the derivatives of rasterization, the pixel intensities are defined as the average value of the certain area within the pixel region. The average value can be obtained by a double integral over pixel region of the pixel intensity function. Since only silhouette is considered in this paper, there is no need to deal with self-occlusion. Based on the integral expression of pixel intensity, the expression of derivatives could be obtained and simplified to an analytical expression without integral forms. With the analytical expression of derivatives, it is convenient and efficient to implement the backward pass of rendering. Our main contributions are summarized below.

•

The analytical expressions of derivatives of rasterization are derived and a novel non-numerical approach is proposed to implement the backward pass of differentiable renderer efficiently.

•

Experiments were conducted to demonstrated that our proposed method is of higher accuracy and efficiency compared with previous state-of-the-art method.

•

The potential of 3D pose estimation by silhouette consistency without 2D and 3D joints is shown in the experiments we conducted.

2 Related Work

2.1 Differentiable Renderer

Computer vision problems have been viewed as inverse graphics in a long literature. Computer graphics aims to render an image from the object shape, texture and illumination. In contrary to computer graphics, inverse graphics aims to estimate the object shape, texture and illumination from an input image. Differentiable rendering offers a straightforward and practical technique to infer the parameters of 3D models by gradient-based methods.

Gkioulekas et al. [5] developed an algorithmic framework to infer internal scattering parameters for heterogeneous materials. Gradients are leveraged for optimization to solve this inverse problem, however this approach is limited to specific illumination problems. Mansinghka et al. [6] proposed a probabilistic graphics model to estimate scene parameters from observations. Loper and Black [2] introduced an approximate differentiable renderer called OpenDR that makes it easy to render 3D model and automatically obtain derivatives w.r.t. the model parameters. However OpenDR has no interfaces to popular deep learning library which makes it difficult to be incorporated into deep learning framework. Kato et al. [3] introduced a differentiable rendering pipeline which approximate the rasterization gradient with a hand-designed function. More recently, Li et al. [7] presented a differentiable ray tracer which is able to compute derivatives of scalar function over the rendered image w.r.t. arbitrary scene parameters. However the forward pass and backward pass of this method are performed by Monte Carlo ray tracing which makes it time consuming and impractical to be incorporated into learning-based framework.

With the development of deep learning and CNNs, there is a growing trend for researchers to achieve the froward pass and backward pass of differentiable rendering in a deep learning framework [8, 9, 10, 11, 12, 13, 14, 15]. Nguyen-Phuoc et al. [16] presented RenderNet, a convolutional network which learns the direct map from scene parameters to corresponding rendered images. However the shortcoming of RenderNet is that it is computational expensive since it is composed of convolutional networks.

In this paper, we focus on exploring a rasterization-based differentiable renderer with analytical derivatives. The main difference between our work and Neural 3D Mesh Render [3] is that instead of approximating the derivatives with hand-designed functions we derived a analytical expression to obtain derivatives with significantly higher efficiency and accuracy.

2.2 Single-image 3D reconstruction

Inferring 3D shape from images is a traditional and challenging problem in computer vision. With the surge of deep learning, 3D reconstruction from a single image has become an active research topic in recent years.

Most of learning-based approaches learn the mapping from 2D image to 3D shape with 3D supervision. Some of these methods predict a depth map to reconstruct 3D shape [17, 18], while others predict 3D shapes directly [3, 19, 20, 21, 22, 23, 24].

When it comes to 3D pose estimation, statistical body shape models such as SMPL [25] and SCAPE [26] are frequently employed due to their low dimensional representation. Bogo et al. [27] proposed a iteratively optimization-based approach to reconstruct 3D human pose and shape from single image by minimizing the reprojection error between the 2D image and the statistical body shape model. Pavlakos et al. [28] presented an end-to-end framework to predict the parameters of the statistical body shape model by training CNNs with single image and 3D ground truth.

Since 3D ground truth models are hard to obtain, 3D reconstruction without 3D supervision also attracts increasing attention. Yan et al. [29] proposed perspective transformer nets (PTN) to infer 3D voxels from silhouette images from multiple viewpoints. Recent works predict 3D polygon meshes using differentiable renderer with 2D silhouettes supervision only. We follow these works in supervision, but we use a statistical body shape model named SMPL to represent 3D shape of human body and optimize the 3D pose with the gradients obtained by our proposed differentiable renderer.

3 Analytical derivatives for rasterization

Rasterization is a process of computing the mapping from scene geometry described in vector graphics format to raster images. The main obstacle that impedes rasterization from being differentiable is the discrete sampling operation that pixel intensities are sampled only at the central points of each pixel. Due to the discrete sampling operation and limited resolution, aliasing effect often appears in the rendered images. Anti-aliasing techniques are proposed to remove the aliasing effect and smooth the rendered images. In traditional anti-aliasing techniques, an image with higher resolution is rendered and down-sampled to the expected resolution with a average filter. Inspired by this approach, it is natural for us to assume that if a 3D model is rendered into an image with infinite resolution and down-sampled to the expected resolution using average filter, the sampling operation will be continuous and derivable. Since infinite resolution can not be achieved, the resolution of rendering is set to a higher and finite value to approximate the ideal situation. the forward pass of our renderer works the same as standard graphics pipeline with anti-aliasing but the backward derivatives are derived under the hypothesis that the image is rendered in infinite resolution and down-sampled into expected resolution by average filter.

3.1 Forward rendering

The forward pass of our proposed differentiable renderer follows the standard graphics method [30]. To ensure the consistency between the forward and backward propagation, anti-aliasing is applied to smooth the rendered images.

Rendering a model with infinite resolution and down-sampling to the expected resolution, i.e., the pixel intensities equal to the double integral of a scalar function $p(x,y)$ of two variables $x$ and $y$ over the region within the pixel. The scalar function $p(x,y)$ represents the continuous distribution of intensity in the screen space.

Since we only focus on synthesizing silhouettes, i.e., there are only two possible values for $p(x,y)$ : the foreground intensity $p_{1}$ and the background intensity $p_{0}$ . Consider a image with $H$ rows and $W$ columns, the pixel intensity $I(i,j)$ of pixel in $i$ -th row and $j$ -th column can be represented as:

[TABLE]

where $\Omega_{i,j}$ represents the region of the pixel in row $i$ and column $j$ , $S$ denotes the area of region within the pixel.

However the value of the integral expression in Equation 1 is hard to compute in computer, so we use anti-aliasing to approximate this integral value as shown in Figure 1. The anti-aliasing we adopt is fairly rudimentary compared to more modern techniques. With this approach, individual pixels are divided into multiple coverage samples. By analyzing the intensity of the pixels surrounding each of these samples, an average intensity is produced, which determines the intensity of the original pixel. $F$ times anti-aliasing is applied in rendering, then the pixel intensity can be obtained as:

[TABLE]

where $x_{k}$ and $y_{k}$ represent the coordinate of the $k$ -th sampling point in screen space.

It is obvious that:

[TABLE]

When implementing the code, we set $F$ to $4$ for the tradeoff between accuracy and speed.

3.2 Derivatives computation

With the continuous definition of pixel intensity in Equation 1, the derivatives with respect to the vertices can be derived. Considering a edge consisted of vertices $v_{a}$ and $v_{b}$ located at the boundary of the silhouette, the coordinates of $v_{a}$ and $v_{b}$ are denoted as $(x_{0},y_{0})$ and $(x_{1},y_{1})$ . Assuming that this edge is intersected with the region of pixel in $i$ -th row and $j$ -th column. The partial derivative $\frac{\partial I(i,j)}{\partial x_{0}}$ can be written as:

[TABLE]

For notational convenience we denote that $A=y_{1}-y_{0}$ , $B=x_{0}-x_{1}$ , $C=x_{1}y_{0}-x_{0}y_{1}$ . The equation of the edge can be represented as:

[TABLE]

Assuming that if $\alpha(x,y)<0$ , then the point $(x,y)$ is in the region of foreground, and vice versa. Let $\Omega_{0}$ be a appropriate sub region of $\Omega_{i,j}$ s.t. $\Omega_{0}$ only covers the edge connecting $v_{a}$ and $v_{b}$ , thus the intensity distribution function $p(x,y)$ can be written as:

[TABLE]

The equation above can be simplified with the Heaviside step function $h$ :

[TABLE]

The partial derivative $\frac{\partial I(i,j)}{\partial x_{0}}$ can be rewritten as:

[TABLE]

From Equation 8 and Equation 10 we can obtain the partial derivative $\frac{\partial I(i,j)}{\partial x_{0}}$ as:

[TABLE]

where $\delta$ denotes the Dirac delta function.

Substituting Equation 6 into Equation 12, the partial derivative $\frac{\partial I(i,j)}{\partial x_{0}}$ can be represented as:

[TABLE]

To eliminate the Dirac delta function, we perform the following variable substitution:

[TABLE]

After variable substitution, Equation 13 can be rewritten as:

[TABLE]

where $A^{2}+B^{2}$ is the $L^{2}$ length of the edge, which takes the Jacobian of the variable substitution into account. $k_{0}$ and $k_{1}$ are the lower and upper limits of integral obtained by Liang-Barsky algorithm [31].

To illustrate the procedure of determining the lower and upper limits, the two new endpoints after clipping are denoted as $v_{a}^{\prime}$ and $v_{b}^{\prime}$ as shown in Figure 2, the coordinates are denoted as $(x_{0}^{\prime},y_{0}^{\prime})$ and $(x_{1}^{\prime},y_{1}^{\prime})$ respectively. Then the lower and upper limits can be obtained as:

[TABLE]

The same procedure can be easily adapted to obtain the partial derivatives $\frac{\partial I(i,j)}{\partial y_{0}}$ , $\frac{\partial I(i,j)}{\partial x_{1}}$ and $\frac{\partial I(i,j)}{\partial y_{1}}$ as follows.

[TABLE]

It is feasible to obtain the derivatives without any numerical method with the analytical expressions of derivatives above, which brings space for improvement in accuracy and efficiency.

3.3 Backward gradients flow

Considering a 3D mesh consisting of a set of vertices $\{v_{1}^{o},v_{2}^{o},\dots,v_{N_{v}}^{0}\}$ and faces $\{f_{1},f_{2},\dots,f_{N_{f}}\}$ . $v_{k}^{o}\in\mathbb{R}^{3}$ represents the position of the $k$ -th vertex in the 3D object space and $f_{k}\in\mathbb{N}^{3}$ represents the the indices of the three vertices corresponding to the $k$ -th triangle face. For rendering this 3D mesh, vertices $\{v_{k}^{o}\}$ in the object space are projected into screen space as vertices $\{v_{k}\},v_{k}\in\mathbb{R}^{2}$ .

The scalar loss function over the rendered image for optimization is denoted as $L$ . The partial derivatives $\{\frac{\partial L}{\partial I(i,j)}|i=1,\dots,H,j=1,\dots,W\}$ can be computed through automatic differentiable library. Our task is that: given the partial derivatives of loss function $L$ with respect to pixel intensities $\{\frac{\partial L}{\partial I(i,j)}\}$ , our goal is to compute derivatives of pixel intensities with respect to vertices $\{\frac{\partial I(i,j)}{\partial v_{k}}\}$ . Thus the derivatives $\{\frac{\partial L}{\partial v_{k}}\}$ can be obtained by chain rule, after which the gradient backward flow will be completed.

It should be noted that the gradients flow is sparse since $\frac{\partial I(i,j)}{\partial v_{k}}\neq 0$ only if there is at least one edge consisted of $v_{k}$ intersected with the pixel region of $I(i,j)$ . We only have to focus on specific $i$ , $j$ and $k$ such that $\frac{\partial I(i,j)}{\partial v_{k}}\neq 0$ , this allows skipping pixels that have no contribution of gradient to current triangle when traversing the arrays of triangles and improves the efficiency.

In order to achieve efficient retrieval of pixels that have contribution of gradient to current triangle, pixels out of the bounding box of current triangle are excluded first. The Liang-Barsky clipping algorithm [31] is adopted to determine wether a pixel is intersected with current triangle. As shown in Figure 3, a pixel is intersected with the triangle only if there is at least one edge of the triangle intersected with the pixel.

It is obvious that gradients only flow at the boundary pixel of the silhouette image, so edge detection is performed on the rendered image to determine pixels that gradients can flow into, computation is required only at the boundary of silhouette.

Considering a pixel at the boundary and it is in the $i$ -th row and $j$ -th column, we need to determine the partial derivatives of pixel intensity with respect to the location of $k$ -th vertices $v_{k}$ , denoted as $\frac{\partial I(i,j)}{\partial v_{k}}$ . It is assumed that there are $N_{e}$ edges consisted of $v_{k}$ intersected with the pixel in row $i$ , column $j$ . The derivatives of the pixel intensity $I(i,j)$ with respect to the position of $v_{k}$ can be represented as:

[TABLE]

where $\frac{\partial I(i,j)}{\partial v_{k}^{n}}$ represents the derivatives computed by the $n$ -th edge.

To verify our method, experiments of our differentiable renderer on generating per-pixel gradient with respect to translation, rotation and scaling were conducted. The visualized results are presented in Figure 4. From the visualized per-pixel gradient images, conclusion can be draw that our proposed differentiable renderer is able to generate correct gradients with respect to vertices location, which enables the gradient-based optimization for 3D pose estimation.

4 3D pose estimation

To show the effectiveness of our method, experiments of 3D pose estimation based on statistical body shape model by our proposed differentiable silhouette renderer were performed. Following the work of [27], an iteratively optimization-based method is presented to estimate the pose parameters of statistical body shape model by minimizing the error between reprojected silhouettes and ground truth silhouettes. The images and 3D ground truth leveraged in the experiments are from a 3D pose dataset named UP-3D [33]. Unlike previous works, ground truth 2D and 3D joints truth are not necessary for experiments of 3D pose estimation in this paper.

4.1 Statistical body shape model

A statistical body shape model named SMPL [25] is employed as our representation of 3D body model. Essential notations of SMPL model are provided here. The SMPL model can be view as s function $\mathcal{M}(\beta,\theta;\Phi)$ , where $\beta$ is the shape parameters, $\theta$ is the pose parameters and $\Phi$ are fixed parameters learned from a dataset with body scans [34]. The output of the SMPL function are vertices $P\in\mathbb{R}^{N\times 3}$ with $N=6890$ of a body mesh. The shape parameters $\beta\in\mathbb{R}^{10}$ are the linear coefficients of a low number of principal body shapes. The pose parameters $\theta\in\mathbb{R}^{24\times 3}$ are expressed in axis and angle representation and define the relative rotation between parts of the skeleton. Additionally, the 3D joints $J\in\mathbb{R}^{24\times 3}$ obtained conveniently by a sparse linear combination of mesh vertices.

In our experiments, the shape parameters $\beta$ are fixed and our goal is optimizing the pose parameters $\theta$ to minimize the errors between the ground truth silhouettes and reprojected silhouettes.

4.2 Data preparation

It is assumed that only images and multi-viewpoints silhouettes are available in the 3D pose estimation task. The ground truth silhouettes are generated by rendering the 3D ground truth meshes of UP-3D [33] from $4$ azimuth angles (with step of $90^{\circ}$ ) with fixed elevation angles ( $0^{\circ}$ ) under the same camera setup as illustrated in Figure 5. The resolution of silhouettes is set to $64\times 64$ .

4.3 Method

Given a single image $I$ and its multi-viewpoints 2D silhouettes $\{S_{i}\}$ , the 3D body model is fitted by minimizing a weighted sum of error terms.

The differentiable silhouette rendering process is denoted as $\mathcal{R}$ , then the silhouette error term $E_{sl}$ can be represented as:

[TABLE]

where $P$ and $\hat{P}$ denote the ground truth vertices and estimated vertices, $N_{s}$ denotes the total number of silhouettes, $\mathcal{R}_{i}$ denotes the camera in the $i$ -th position, $S_{i}$ denotes the $i$ -th ground truth silhouette.

To discourage the body model from self-intersection, a self-intersection penalty term $E_{spt}$ from [35] is adopted. This self-intersection penalty term can be represented as:

[TABLE]

where $N_{sec}$ denotes the number of vertices in self-intersection region, $N_{v}$ denotes the total number of vertices.

The backward gradients of $E_{spt}$ is obtained by a hand-designed algorithm which can produce gradients to pull vertices out of region of self-intersection. The details of this algorithm are beyond the scope of this paper, we refer the interested readers to [35] for more details.

The objective function can be written as the weighted sum of the two error terms above:

[TABLE]

where $\lambda$ is a scalar weight.

5 Experiments

In this section, experiments of 3D pose estimation are performed to evaluate the effectiveness of our method. The details of our experiments setup are provided. The results of qualitative comparison and quantitative comparison are presented to demonstrate the effectiveness of our method.

5.1 Experimental setup

5.1.1 Dataset

Our proposed method is tested on UP-3D [33] for evaluation. This dataset contains color images and corresponding ground truth 3D pose represented as pose parameters of SMPL model. Noting that our iterative optimization-based method is sensitive to the initial pose, results on the subset of UP-3D selected by Tan et al. [36] aiming to limit the range of global rotation of SMPL models are reported.

5.1.2 Evaluation metric

For quantitative evaluation, per-vertex error from [28] is used as metric for evaluating the accuracy of 3D pose when comparing with other methods. As shown in Figure 6, the surface of body mesh is represented as vertices and triangles. The accuracy of pose estimation can be effectively evaluated by measuring error of each vertex, the per-vertex error $E_{p}$ can be represented as:

[TABLE]

where $N_{v}$ denotes the total number of vertices, $\hat{P}_{i}$ denotes the estimated location of vertices, $P_{i}$ denotes the ground truth location of vertices.

5.1.3 Implementation details

The resolution of output images of differentiable renderer is set to $64\times 64$ , and the multiple of anti-aliasing $F$ is set to $4$ . The number of silhouettes $N_{s}$ is set to $4$ . The code is implemented in C++ with interface to the automatic differentiation library PyTorch [37], which allows us to employ their built-in optimizers and optimize the pose parameters of SMPL model easily. The objective function is minimized with Adam optimizer [38] with $\alpha=1.5\times 10^{-4}$ , $\beta_{1}=0.9$ and $\beta_{2}=0.999$ . $\lambda$ in Equation 27 is set to $0.001$ across all experiments.

5.2 Qualitative comparison

Comparison between the proposed differentiable renderer with Neural 3D Mesh Render (N3MR) [3] is performed by conducting 3D pose estimation in same experimental setup. To demonstrate the effectiveness of our approach, we also compare our results with that of direct prediction method named Learning to Estimate 3D Human Pose and Shape from a Single Color Image (L2EPS) by Pavlakos et al. [28].

From the results shown in Figure 7, it is apparent that the Neural 3D Mesh Render suffers from local minimums which often result in failed prediction. Due to the discontinuous forward rendering pass without any smooth filter and the inconsistency between forward and backward propagations, the process of optimization is unstable and tends to fall in local minimums. In contrast, we apply anti-aliasing in the forward rendering to make the intensity of each pixel as much as possible close to the continuous definition in Equation 1, which achieves the consistency between forward and backward propagations and stability of optimization.

Though our method performs 3D pose estimation without any 2D joint error term, the results are comparable with the learning-based method [28] whose model is trained with 3D ground truth. Since 3D ground truth and 2D location are apparently more difficult to obtain than silhouette, our method offers possibility for 3D pose estimation without any 2D joint location and 3D ground truth.

5.3 Quantitative comparison

We show the quantitative evaluation on per vertex error with different approaches. Results are given in Table 1. As seen in Table 1, our differentiable renderer outperforms N3MR [3] in 3D pose estimation. The result of our method is worse than that of L2EPS [28] since the method in [28] leverages 3D ground truth but our method only leverages 2D silhouettes and predict 3D pose in an unsupervised manner.

5.4 Ablation analysis

In this section, we conduct controlled experiments to validate the necessity of different components.

5.4.1 Self-intersection penalty term.

We investigate the influence of Self-intersection penalty term in 3D pose estimation by conducting experiment without the self-intersection penalty term [35] (SPT). In Figure 8 we visually compare the results of 3D pose estimation with and without SPT. As shown in Figure 8, the result without SPT suffers from self-intersection. However the experiment with SPT obtains more reasonable result.

5.4.2 Anti-aliasing.

To demonstrate the importance of anti-aliasing in the forward pass of our differentiable renderer, we conduct quantitative comparison of 3D pose estimation by differentiable renderer with and without anti-aliasing. The result is given in Table 2, As seen in Table 2, anti-aliasing improves the accuracy of 3D pose estimation, especially when the resolution is quite low.

5.5 Running time analysis

To demonstrate the efficiency of our differentiable renderer, we carried out experiments of our method with different resolution and different number of SMPL models compared with N3MR [3]. For a fair comparison, we implemented the CPU version of N3MR from their released GPU version. All experiments in this section were performed on a laptop with Intel(R) Core(TM) i7-8750H processer. We recorded the elapsed time of a single forward and backward pass of the two different renderer in Table 3 and Table 4. As seen in Table 3 and Table 4, with the increasing number of triangles and resolution, it is more and more obvious that our method runs faster than N3MR.

6 Conclusion

In this paper, we proposed a novel method to obtain analytical derivatives for differentiable silhouette renderer. We demonstrate experiments of 3D pose estimation by silhouette consistency to show the effectiveness efficiency of our proposed method. Unlike pervious works like N3MR [3] using numerical approach to obtain derivatives, we proposed a continuous definition of pixel intensity and derived the analytical derivatives based on the continuous definition. We adopt anti-aliasing to make sure the intensity of each pixel is close to the continuous definition. Experiments have shown that accuracy and stability of optimization benefit from the consistency between forward and backward propagations of our differentiable renderer. Since we only focus on synthesizing silhouettes, only a few pixels and edges need to be considered. We employ quadtree to accelerate the process of retrieving edges which the gradient of current pixel may back-propagate into. As shown in the experiment, the efficiency of our implementation is higher than that of N3MR [3].

There are two main limitations of our method. One is that our differentiable renderer is not general-purpose which means that our method can not obtain derivatives with respect to texture and lighting parameters and limits the application in inverse graphic. The other is that our implementation requires constructing a quadtree recursively which leads to lower efficiency compared with previous method when the mesh is too simple or the resolution of output image is quite low.

Future direction of this work may include deriving analytical derivatives for general-purpose renderer to enable the gradients back-propagate into arbitrary scene parameters. It may also include developing a parallelizable algorithm to enable efficient implementation on GPU.

Bibliography38

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Whitted, T.: An improved illumination model for shaded display. In: ACM Siggraph 2005 Courses, ACM (2005) 4
2[2] Loper, M.M., Black, M.J.: Opendr: An approximate differentiable renderer. In: European Conference on Computer Vision, Springer (2014) 154–169
3[3] Kato, H., Ushiku, Y., Harada, T.: Neural 3d mesh renderer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2018) 3907–3916
4[4] Liu, S., Chen, W., Li, T., Li, H.: Soft rasterizer: Differentiable rendering for unsupervised single-view mesh reconstruction. ar Xiv preprint ar Xiv:1901.05567 (2019)
5[5] Gkioulekas, I., Levin, A., Zickler, T.: An evaluation of computational imaging techniques for heterogeneous inverse scattering. In: European Conference on Computer Vision, Springer (2016) 685–701
6[6] Mansinghka, V.K., Kulkarni, T.D., Perov, Y.N., Tenenbaum, J.: Approximate bayesian image interpretation using generative probabilistic graphics programs. In: Advances in Neural Information Processing Systems. (2013) 1520–1528
7[7] Li, T.M., Aittala, M., Durand, F., Lehtinen, J.: Differentiable monte carlo ray tracing through edge sampling. In: SIGGRAPH Asia 2018 Technical Papers, ACM (2018) 222
8[8] Zienkiewicz, J., Davison, A., Leutenegger, S.: Real-time height map fusion using differentiable rendering. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE (2016) 4280–4287