Walking on Thin Air: Environment-Free Physics-based Markerless Motion   Capture

Micha Livne; Leonid Sigal; Marcus A. Brubaker; David J. Fleet

arXiv:1812.01203·cs.CV·December 5, 2018

Walking on Thin Air: Environment-Free Physics-based Markerless Motion Capture

Micha Livne, Leonid Sigal, Marcus A. Brubaker, David J. Fleet

PDF

Open Access

TL;DR

This paper introduces an automatic, environment-free physics-based motion capture method that estimates human pose and contact in real-time from noisy depth data, without prior scene calibration.

Contribution

It presents a novel physics-based motion capture approach that operates online without environment calibration, using a data-driven body model and contact estimation from torque trajectories.

Findings

01

Improves tracking accuracy over state-of-the-art methods.

02

Reduces visual artifacts like foot-skate and jitter.

03

Works effectively with noisy single depth camera data.

Abstract

We propose a generative approach to physics-based motion capture. Unlike prior attempts to incorporate physics into tracking that assume the subject and scene geometry are calibrated and known a priori, our approach is automatic and online. This distinction is important since calibration of the environment is often difficult, especially for motions with props, uneven surfaces, or outdoor scenes. The use of physics in this context provides a natural framework to reason about contact and the plausibility of recovered motions. We propose a fast data-driven parametric body model, based on linear-blend skinning, which decouples deformations due to pose, anthropometrics and body shape. Pose (and shape) parameters are estimated using robust ICP optimization with physics-based dynamic priors that incorporate contact. Contact is estimated from torque trajectories and predictions of which contact…

Tables3

Table 1. TABLE I: A comparison between contact predictor performance, with and without Kalman filter. The values represent the percentage of predicting ground truth contact state.

Contact Point	No Kalman Filter	With Kalman Filter
Left Toe	94.6	98.9
Right Toe	95.6	98.5
Left Heel	77.4	84.0
Right Heel	87.9	95.9

Table 2. TABLE II: Quantitative results. The metrics MJPE and MJIE are defined in Sec. IV-B . While accurate numbers for Baak were not available, it is at best 5 [ c m ] 5 delimited-[] 𝑐 𝑚 5[cm] , as shown in his work from 2013.

	MJPE	MJIE
Proposed Method (Data Only)	$1.7 [c m]$	0.975
Proposed Method (Physics)	$1.89 [c m]$	0.973
Ganapathi	N/A	0.971
Baak	$\sim 5 [c m]$	N/A

Table 3. TABLE III: A comparison between our model, SCAPE [ 23 ] , Hastler [ 24 ] , and Allen [ 32 ] . Semantics refers to the direct interpretation of model parameters. Semantic parameters, such as explicit anthropometrics (skeleton) representation, is useful in the context of tracking. In our case, it allows direct control over which set of parameters to optimize.

	Our Model	SCAPE	Hasler	Allen
Mesh	Linear	Nonlinear	Nonlinear	Linear
recon.	Linear	(Least Squares)	(Poisson)
Mesh	1.82 [mSec] (full)	1 [Sec]
recon.	1.17 [mSec] (pose)	mesh only	25 [Sec]	13 [mSec]
speed		(given matrices)
Semantics	Skeleton	Skeleton
	Body shape	Body shape	None	Skeleton
	Anthropometrics
Mean
reconstruction	5.3 [mm]	N/A	54 [mm]	4.9 [mm]
error

Equations40

M (Θ) = {p (Θ; B_{ℓ}, B_{β}), E}

M (Θ) = {p (Θ; B_{ℓ}, B_{β}), E}

\tilde{p}^{s} (ℓ^{s}, β^{s}) = B_{ℓ} (ℓ^{s} - \hat{ℓ}) + B_{β} β^{s}

\tilde{p}^{s} (ℓ^{s}, β^{s}) = B_{ℓ} (ℓ^{s} - \hat{ℓ}) + B_{β} β^{s}

p_{i}^{s} (Θ) = b \in B_{i} \sum w_{ib} M_{b} (ℓ, q) M_{b}^{- 1} (ℓ, \tilde{q}^{s}) \tilde{p}_{i}^{s} (ℓ, β)

p_{i}^{s} (Θ) = b \in B_{i} \sum w_{ib} M_{b} (ℓ, q) M_{b}^{- 1} (ℓ, \tilde{q}^{s}) \tilde{p}_{i}^{s} (ℓ, β)

E (q_{k - 1}, q_{k}, q_{k + 1}) D_{1} L^{d} (q_{k}, q_{k + 1}) + D_{2} L^{d} (q_{k - 1}, q_{k}) = f_{k + 1}

E (q_{k - 1}, q_{k}, q_{k + 1}) D_{1} L^{d} (q_{k}, q_{k + 1}) + D_{2} L^{d} (q_{k - 1}, q_{k}) = f_{k + 1}

L_{c} (q) = g (q)^{T} λ

L_{c} (q) = g (q)^{T} λ

p (c_{k}^{i} ∣ f_{k}) = σ_{i} (f_{k})

p (c_{k}^{i} ∣ f_{k}) = σ_{i} (f_{k})

E (q_{k - 1}, q_{k}, q_{k + 1}) + \frac{\partial g ^{T}}{\partial q _{k}} λ = f_{k + 1}

E (q_{k - 1}, q_{k}, q_{k + 1}) + \frac{\partial g ^{T}}{\partial q _{k}} λ = f_{k + 1}

I_{r oo t} (E + \frac{\partial g ^{T}}{\partial q _{k}} λ)_{2}^{2}

I_{r oo t} (E + \frac{\partial g ^{T}}{\partial q _{k}} λ)_{2}^{2}

λ^{*} = (\frac{\partial g ^{T}}{\partial q _{k}} I_{r oo t} \frac{\partial g}{\partial q _{k}})^{- 1} \frac{\partial g ^{T}}{\partial q _{k}} I_{r oo t} E

λ^{*} = (\frac{\partial g ^{T}}{\partial q _{k}} I_{r oo t} \frac{\partial g}{\partial q _{k}})^{- 1} \frac{\partial g ^{T}}{\partial q _{k}} I_{r oo t} E

f_{k + 1}^{*} = E + \frac{\partial g ^{T}}{\partial q _{k}} λ^{*}

f_{k + 1}^{*} = E + \frac{\partial g ^{T}}{\partial q _{k}} λ^{*}

p (Θ_{k} ∣ D_{k}, Θ_{k - 1 : k - 2}) \propto p (D_{k} ∣ Θ_{k}) p (Θ_{k} ∣ Θ_{k - 1 : k - 2})

p (Θ_{k} ∣ D_{k}, Θ_{k - 1 : k - 2}) \propto p (D_{k} ∣ Θ_{k}) p (Θ_{k} ∣ Θ_{k - 1 : k - 2})

- lo g p (D_{k} ∣ Θ_{k}) = (p^{'}, d^{'}) \in Ψ_{k} \sum ∣∣ p^{'} - d^{'} ∣ ∣_{2}^{2}

- lo g p (D_{k} ∣ Θ_{k}) = (p^{'}, d^{'}) \in Ψ_{k} \sum ∣∣ p^{'} - d^{'} ∣ ∣_{2}^{2}

- lo g p (Θ_{k} ∣ Θ_{k - 1 : k - 2}) = γ_{1} ∥ f_{r oo t}^{k} ∥^{2} + γ_{2} ∥ f_{- r oo t}^{k} ∥^{2}

- lo g p (Θ_{k} ∣ Θ_{k - 1 : k - 2}) = γ_{1} ∥ f_{r oo t}^{k} ∥^{2} + γ_{2} ∥ f_{- r oo t}^{k} ∥^{2}

F (Θ_{k - 2 : k}, D_{k}) = - lo g p (D_{k} ∣ Θ_{k}) - lo g p (Θ_{k} ∣ Θ_{k - 1 : k - 2})

F (Θ_{k - 2 : k}, D_{k}) = - lo g p (D_{k} ∣ Θ_{k}) - lo g p (Θ_{k} ∣ Θ_{k - 1 : k - 2})

ϕ_{r a t i o} = (r_{1}, ..., r_{4})^{T}, r_{j} \equiv \frac{d ( g _{j + 1} , g ˉ )}{d ( g _{j} , g ˉ )}

ϕ_{r a t i o} = (r_{1}, ..., r_{4})^{T}, r_{j} \equiv \frac{d ( g _{j + 1} , g ˉ )}{d ( g _{j} , g ˉ )}

\phi_{pos}=\left\{\begin{array}[]{ll}\text{angles between all triplets}\\ \text{angles between all orientations}\\ \end{array}\right\}

\phi_{pos}=\left\{\begin{array}[]{ll}\text{angles between all triplets}\\ \text{angles between all orientations}\\ \end{array}\right\}

F = s \in S \sum F^{s} (Θ^{s}, D)

F = s \in S \sum F^{s} (Θ^{s}, D)

\begin{array}[]{l}\left(\begin{array}[]{ccc}\ldots&\mathbf{M}_{b}\left({\mathbf{q}}_{j}\right)\cdot\tilde{\mathbf{M}}_{b}\left(\tilde{{\mathbf{q}}}^{s}\right)^{-1}\tilde{{\mathbf{p}}}_{i}^{s}&\ldots\end{array}\right)\mathbf{w}_{i}={\mathbf{p}}_{i}^{j}\\ \Rightarrow\mathbf{A}_{i}^{s,j}\cdot\mathbf{w}_{i}={\mathbf{p}}_{i}^{j}\end{array}

\begin{array}[]{l}\left(\begin{array}[]{ccc}\ldots&\mathbf{M}_{b}\left({\mathbf{q}}_{j}\right)\cdot\tilde{\mathbf{M}}_{b}\left(\tilde{{\mathbf{q}}}^{s}\right)^{-1}\tilde{{\mathbf{p}}}_{i}^{s}&\ldots\end{array}\right)\mathbf{w}_{i}={\mathbf{p}}_{i}^{j}\\ \Rightarrow\mathbf{A}_{i}^{s,j}\cdot\mathbf{w}_{i}={\mathbf{p}}_{i}^{j}\end{array}

T_{i}^{s, j} \cdot \tilde{p}_{i}^{s} = p_{i}^{j}

T_{i}^{s, j} \cdot \tilde{p}_{i}^{s} = p_{i}^{j}

\tilde{p}^{s} \approx B_{ℓ} \cdot ℓ^{s}

\tilde{p}^{s} \approx B_{ℓ} \cdot ℓ^{s}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Human Pose and Action Recognition · 3D Shape Modeling and Analysis

Full text

Walking on Thin Air: Environment-Free Physics-based Markerless Motion

Capture

Micha Livne1, Leonid Sigal2, Marcus A. Brubaker3 and David J. Fleet1

1Department of Computer Science

University of Toronto

Toronto, Canada

{mlivne, fleet}@cs.toronto.edu

2Department of Computer Science

University of British Columbia

Vancouver, Canada

[email protected]

3Lassonde School of Engineering

York University

Toronto, Canada

[email protected]

Abstract

We propose a generative approach to physics-based motion capture. Unlike prior attempts to incorporate physics into tracking that assume the subject and scene geometry are calibrated and known a priori, our approach is automatic and online. This distinction is important since calibration of the environment is often difficult, especially for motions with props, uneven surfaces, or outdoor scenes. The use of physics in this context provides a natural framework to reason about contact and the plausibility of recovered motions. We propose a fast data-driven parametric body model, based on linear-blend skinning, which decouples deformations due to pose, anthropometrics and body shape. Pose (and shape) parameters are estimated using robust ICP optimization with physics-based dynamic priors that incorporate contact. Contact is estimated from torque trajectories and predictions of which contact points were active. To our knowledge, this is the first approach to take physics into account without explicit a priori knowledge of the environment or body dimensions. We demonstrate effective tracking from a noisy single depth camera, improving on state-of-the-art results quantitatively and producing better qualitative results, reducing visual artifacts like foot-skate and jitter.

Index Terms:

Computer Graphics; Computer Vision; Physics; 3D Human Pose Tracking;

I Introduction

Markerless motion capture methods enable reconstruction of detailed motion and dynamic geometry of the body (and sometimes garments) from multiple streams of video [1] or depth data [2, 3]. Recent human tracking methods are able to handle video captured in the wild, but still suffer from visually significant artifacts (jittering, feet/contact skating). This issue is significant as people are sensitive to such artifacts (e.g., foot-skate is perceptible at levels less than 21 mm [4]).

To address these challenges we propose a generative 3D human tracking approach that takes physics-based prior knowledge into account when estimating pose over time. The use of physics in this context is compelling as it provides a natural framework to reason about contact and the plausibility of the recovered motions online. Prior attempts to use physics for tracking assume that the subject and scene geometry are known a priori and calibrated [5, 6], that contact states are annotated by a user [7], or that optimization can be performed off-line (i.e., in batch) [8]. In contrast, our approach is online, without manual input. Beginning with the first frame, the subject and the contact state(s) are estimated online during tracking, without a priori knowledge of the environment. This is an important distinction, as calibration of the environment can be difficult, especially when capturing motions with props, on uneven surfaces or outdoors.

Our main contribution is the use of a physics-based prior without an explicit model of the environment. To our knowledge, it is the first tracking approach to incorporate physics without any explicit a priori knowledge of the environment or body dimensions. We demonstrate that the approach is effective in tracking from a single depth camera, improving on state-of-the-art results quantitatively and qualitatively, greatly reducing visually unpleasant artifacts such as foot-skate and jitter.

II Related Work

3D Human Tracking: Markerless motion capture, estimating the skeletal motion of a subject, has a rich history in vision and graphics (for an extensive survey see [9]). Methods can be broken into two classes: model-based and regression-based (or generative vs. discriminative). Regression-based methods estimate pose directly by regressing pose from image feature descriptors (e.g.[10, 11, 12, 3, 13, 14]). Model-based approaches exploit a generative model for the body and image, and optimize for generative parameters that explain the image observations (e.g.[15, 16]). The former is faster, but generally less accurate (unless the problem domain is highly constrained). Model-based approaches may be more accurate, but tend to be slower as they require iterative or stochastic optimization, and suffer from local optima.

Use of Physics in Tracking: Physics-based tracking has been proposed as a way to regularize pose under the assumption that physics is a universal prior that requires no assumptions about one’s motion (given a physical model). Early work dates back to [17] and [18], however they focused on simple motions in absence of contact. More recently Brubaker et al. [5] proposed a low-dimensional model of the lower-body to track walking subjects from monocular video. A more general data-driven physics-based filter, applicable to variety of motions, was proposed in [6]. In [19] a controller-based approach is proposed where a physics-based full body controller, instead of sequence of poses, is estimated. In all cases the body proportions and the scene geometry were assumed to be known. In [7] a physics-based tracking approach is formulated as a batch optimization problem with known contact points and ground geometry. In [8] the parameters of the planar ground model are estimated from data, however, the method assumes a parametric structure of ground geometry (a plane) and reasonably accurate 3D input, obtained using a binocular system. We build on this work with one notable distinction: we assume no knowledge of ground geometry or subject proportions. A notable distinction from prior work is [20] which attempted to encode ground constraints directly in the kinematics, without a physics model. Like other methods, however, it required a prior knowledge of the ground plane.

Contact Estimation: We briefly note that contact estimation and sampling has been used in other domains of graphics as well. One example is hand manipulation [21], where a randomized search over the hand-object contacts is proposed as the strategy for finding pose of the hand manipulating an object over time. Contact invariant optimization [22] attempts to sidestep the problem of explicit contact estimation by searching over the space of contacts at the same time as behaviour of the character. Such approaches, while interesting, require batch processing and long compute times, making them inapplicable for real-time capable full body tracking.

III Method

Our tracking pipeline is depicted in Fig. 2. We describe in this section a fast body mesh model (Sec. III-A), discrete formulation of a physical engine and physics-based motion prior (Sec. III-B), and a tracking framework that utilizes those in order to facilitate physics-based 3D human tracking (Sec. III-C). We also describe a method to pre-process input point cloud data that allows us to automatically initialize tracking, is fast, and simple to implement (Sec. III-D).

III-A Fast Data-driven Parametric Body Model

In what follows we exploit a new SCAPE-like model (see [23]) for tracking. With an explicit skeleton, anthropometrics (bone length) and body shape parameters, our model is easy to manipulate and control. The anthropometrics parameters offers direct control over deformations due to bone lengths. The body shape parameters allow for control over the shape, independent of anthropometrics. To the best of our knowledge, explicit control over anthropometrics and shape is not straightforward with other existing body models.

The body is modelled as a 3D triangulated mesh and comprise 69 DOF and 26 body parts

[TABLE]

where ${\mathbf{B}_{\boldsymbol{\ell}}}$ and ${\mathbf{B}_{\boldsymbol{\beta}}}$ are basis matrices, which capture variations in the mesh due to anthropometrics and body shape respectively, and ${\Theta}=({\mathbf{q}},{\boldsymbol{\ell}},{\boldsymbol{\beta}})$ , where ${\mathbf{q}}$ denotes articulated pose, and ${\boldsymbol{\ell}}$ and ${\boldsymbol{\beta}}$ denote coordinate vectors within the two subspaces. The $N$ mesh vertices of a canonical pose (called the template pose $\tilde{{\mathbf{q}}}^{s}$ ) for a given subject $s$ are given by a vector $\tilde{{\mathbf{p}}}^{s}\in\mathbb{R}^{3N\times 1}$

[TABLE]

where $\hat{{\boldsymbol{\ell}}}$ denotes the mean anthropometrics within the subspace. The anthropometrics basis ${\mathbf{B}_{\boldsymbol{\ell}}}$ represents a linear mapping from bone lengths (relative to the mean) to a base template mesh. The basis ${\mathbf{B}_{\boldsymbol{\beta}}}$ provides a linear mapping of body shape coefficients into a deformation from a base template mesh. We enforce orthogonality of the two subspaces during the basis learning stage. We discuss how we learn the basis in the supplementary material. The final mesh is calculated using Linear Mesh Blending (LMB)

[TABLE]

where $\mathcal{B}_{i}$ is the set of bones (i.e., rigid body parts) that influence the position of vertex $i$ , $w_{ib}$ is the influence of bone $b$ on vertex $i$ (assumed to be constant for all poses), and ${\mathbf{p}}^{s}\in\mathbb{R}^{3N\times 1}$ is the final mesh. We trained our model on the Hasler dataset [24]. Fig. 3 and Fig. 4 depict our LMB model.

III-B Environment-Free Physics-Based Priors

Priors are used in optimization to regularize the loss, pushing the optimal solution to a more desired manifold. As such, we would like our priors to be as generic as possible in order to generalize well. Physics-based priors exploit physical dynamics as an informative but general prior on motion, to help ensure that tracking yields a plausible motion. To that end we formulate our model of articulated dynamics using discrete mechanics [25]. This has many desirable properties such as direct mapping to discrete observations, conservation of energy, and computational efficiency (see [26]).

Variational Integrator: In the variational formulation of Lagrangian mechanics, the motion of a system is described by a function known as a discrete Lagrangian, $\mathcal{L}^{d}({\mathbf{q}}_{k-1},{\mathbf{q}}_{k})$ where ${\mathbf{q}}_{k}$ denotes the generalized coordinates of the system (e.g., a stick figure) at time step $k$ . The discrete Lagrangian is an approximation of the continuous Lagrangian, and is used in a discrete formulation of the principal of least action to derive discrete mechanics (see [25] for more details). The evolution of the system is then given by discrete Euler-Lagrange equations

[TABLE]

where $\mathbf{f}$ is the vector of net generalized forces applied to the system, and $D_{i}$ is a partial derivative operator with respect to the $i$ th function parameter.

Contact: Contact is one of the greatest challenges, both computationally and theoretically, with physical dynamics. Such problems are less severe if contact times and locations are known, or provided by a user (e.g., [7]), but in most real-word tracking problems contact is unknown at inference time. Despite the added complexity, contact represents a strong constraint on motion (e.g., feet skate should not happened during contact), and as such is a desirable element of the prior. To avoid dependency on prior knowledge of the environment or manual intervention, we infer contact states as part of a generative model. This reduces the computational challenge of handling inequality constraints into enforcing holonomic constraints, wherein one adds a constraint term, $\mathcal{L}_{c}$ , to the Lagrangian $\mathcal{L}^{d}$ . For a set of constraints, given by a function equation $\mathbf{g}({\mathbf{q}})=0$ , $\mathcal{L}_{c}$ is constructed as

[TABLE]

where $\boldsymbol{\lambda}$ is a vector of Lagrange multipliers.

Root Forces: We refer to the forces applied to the root node of the kinematic tree as root forces. The root node represents the global translation/rotation, and forces applied to it represents external forces applied to our physical model. Newton’s 2nd law states that changes (over time) to the total momentum of a physical system are equal to external forces applied to the system. Our model represents a system that has no external forces but contact. As a result, we use the existence of root forces as an indicator that our contact model is incomplete (following [8]). By choosing a contact configuration that minimizes the external forces, we enhance our model with contact in order to enforce the assumption that contact is the only option for our model to change its momentum, and propel itself. Alternatively, applying a direct force to the root node of a kinematic tree can be thought of as a human wearing a jet-pack. By minimizing the root forces, we discourage that option.

Contact Estimation: To determine contact, we trained an independent binary classifier (per possible contact point) on the forces of tracked subjects, assuming contact-free motions. Effectively, we learn to infer contact from the forces that drive our model in the absence of contact. Currently we use four possible contact points at the heel and toe of each foot. A logistic regressor is trained to estimate the probability of contact given theses forces:

[TABLE]

where $c_{k}^{i}$ is a binary variable indicating the contact state of point $i$ in frame $k$ , $\mathbf{f}_{k}$ is the vector of net generalized forces at frame $k$ (in the absence of contact), and $\sigma_{i}(\mathbf{x})=(1+\exp(-\alpha_{0}^{i}-\mathbf{x}^{T}\boldsymbol{\alpha}_{1}^{i}))^{-1}$ is a sigmoid function. Parameters $\alpha_{0}^{i},\boldsymbol{\alpha}_{1}^{i}$ were learned for each contact point independently.

Physics-based Prior: With contact state determined, we can estimate the contact forces by minimizing root forces (following [8]). We can then write Eq. 4 with the additional contact constraints as

[TABLE]

where $\boldsymbol{\lambda}$ is a vector of Lagrange multipliers for the holonomic contact constraints $\mathbf{g}$ . Given the selected contact configuration (i.e., active contact points), we can estimate $\mathbf{f}$ such that the forces on the root node are minimized. We achieve that by minimizing the squared norm of the root forces

[TABLE]

where $\mathbf{I}_{root}$ is a square selection matrix in the size of ${\mathbf{q}}$ , with ones on the diagonal to select the six degrees of freedom of the root node. The regularized LS solution (by adding a small constant to the diagonal of $\mathbf{I}_{root}$ ) yields the contact forces required to minimize root forces, i.e.,

[TABLE]

We can then calculate the final forces

[TABLE]

that are used as a prior in tracking. The prior over the forces $\mathbf{f}^{*}_{k+1}$ is defined twofold. We would like to minimize the root forces as a generic prior for a plausible motion, and we would like to minimize the internal torques to reduce jitter. Notice that given a contact configuration, both $\boldsymbol{\lambda}^{*}$ and $\mathbf{f}^{*}_{k+1}$ are functions of $\left({\mathbf{q}}_{k-1},{\mathbf{q}}_{k},{\mathbf{q}}_{k+1}\right)$ , thus our physics-based prior over the forces is applied directly over poses ${\mathbf{q}}$ .

III-C Registration and Tracking

Tracking is accomplished in an online fashion, by maximizing the posterior distribution over state parameters at each frame. As is common in online filtering, we assume conditional observation independence, and a second-order Markov model to account for acceleration in the physics-based prior. Accordingly, the posterior over state parameters at time $k$ is proportional to the data likelihood and the conditional distribution over state parameters given those at previous time steps

[TABLE]

where $\mathcal{D}_{k}$ is an input 3D point cloud at time $k$ . By assuming Gaussian noise in observations, the negative log likelihood of the data term in Eq. 11 becomes

[TABLE]

where $\Psi_{k}$ holds all matching body model points ${{\mathbf{p}}^{\prime}\in{\mathbf{p}}({\Theta}_{k})}$ and data points ${{\mathbf{d}}^{\prime}\in\mathcal{D}_{k}}$ at time $k$ . The matching was done in a standard Iterative Closest Point (ICP [27]) manner, and matched closest body and data points with a maximal distance threshold and pruning of back facing vertices. The data term captures the discrepancy between the model surface of the body, encoded by mesh vertices ${\mathbf{p}}_{i}({\Theta}_{k})$ , and the observed depth data points.

The negative log likelihood of the conditional state probability is based on the physics-based priors, as described above, and takes the following form:

[TABLE]

where $\gamma_{1},\gamma_{2}$ are prior weights, $\mathbf{f}_{-root}^{k}$ comprises all but the root forces and accounts for smoothness in torques, and $\mathbf{f}_{root}^{k}$ are root forces which account to physical plausibility.

Tracking is formulated as the optimization of a global objective $\mathcal{F}$ , to find the parameters ${\Theta}_{k}$ at each frame that minimize errors between the body model, denoted ${\mathcal{M}}({\Theta}_{k})$ , and an input 3D point cloud $\mathcal{D}_{k}$ at time $k$ . The objective is the negative log likelihood of Eq. 11, i.e.,

[TABLE]

as defined in Eq. 12, and Eq. 13 above. A natural way to optimize this objective function is to use a variant of ICP, i.e., by alternating between correspondence and parameter optimization. Empirically, ICP tends to be both fast and accurate. We register our model in the first frame by optimizing Eq. 14 w.r.t. all parameters $({\mathbf{q}},{\boldsymbol{\ell}},{\boldsymbol{\beta}})$ , and in the following frames update the pose ${\mathbf{q}}$ only, while holding ${\boldsymbol{\ell}}$ and ${\boldsymbol{\beta}}$ fixed.

III-D Pre-processing

The proposed ICP algorithm requires initialization of the body model parameters ${\Theta}$ . In what follows we describe a fast and simple initialization method. Following [28], we exploit the observation that the geodesic distances between human end-effectors (i.e., head, hands, feet) are both large and relatively independent of body pose, in order to automatically initialize tracking.

Geodesic Extrema as Scale/Rotation Invariant Mesh Features: Given an input point cloud $\mathcal{D}$ , we generate a mesh $\mathcal{D}_{mesh}$ (i.e., connecting vertices with edges) using a greedy projection method for fast triangulation of unordered point clouds [29]. In case of grid-based depth input data, we use a method similar to [28], with a cut-off distance between nearby vertices (i.e., threshold over maximal distance).

Given a connected component, we extract the first five geodesic extrema $\{{\mathbf{g}}_{i}\,|\,{\mathbf{g}}_{i}\in\mathcal{D}_{mesh}\}_{i=1}^{5}$ from the geodesic centroid of the mesh, $\bar{{\mathbf{g}}}$ , as in [28]. We order the geodesic extrema by geodesic distance, so that $d({\mathbf{g}}_{i},\bar{{\mathbf{g}}})\leq d({\mathbf{g}}_{i+1},\bar{{\mathbf{g}}})$ , where $d(\cdot,\cdot)$ is the geodesic distance between two points. We define two features from which we detect human-like meshes, for labelling end-effectors and for finding initial poses for tracking. The first is a 4D vector that encodes the ratio of the ordered geodesic distances:

[TABLE]

These features act like moments to describe the geodesic eccentricity of the point cloud. The second feature encodes geometric shape:

[TABLE]

where orientation is defined as the vector between a geodesic extrema and the point 30cm along the geodesic path to the geodesic centroid.

Detecting human-like components: We use the distance-ratio features to detect connected-components that might be people in the scene. To that end we learn a 4D Gaussian distribution over $\phi_{ratio}$ of human meshes. This distribution then provides a probability that a connected component is a plausible person. A threshold of $0.1$ on that probability is used to cull non-human components. Even with this simple method we accurately detect about 90% of the human components with minimal false positives, which is sufficient for our application.

Pose Initialization: Given the extrema feature descriptors, we can register an unregistered point cloud $\mathcal{D}$ by finding poses from a database of labelled point clouds whose features $\phi_{pos}$ are most similar to those of the point cloud (see examples in Fig. 5). In more details: we fetch a pose from a pose data-base (based on $\phi_{pos}$ L2 distance), we align the database mesh (with mean ${\Theta}$ ) and the data point-cloud (based on fitted Ellipsoids to vertices), we estimate ${\Theta}$ (ICP).

III-E Method Summary

To summarize, the tracking pipeline is as follow:

1: Divide $\mathcal{D}$ into connected components

2: Remove non-human connected components by using $\phi_{ratio}$

3: Initialize pose and register first frame (ICP)

4: for all frames do

5: Initialize pose with previous pose

6: Execute ICP

7: end for

IV Experiments

IV-A Execution Speed

The tracking system was implemented in Python, with the core physical components in C++. It was evaluated on a desktop running OS X, with Intel Core $i7/2.3GHz$ and 8 GB RAM. At present it runs at $0.2[fps]$ with physical priors, and $1.96[fps]$ with body model only.

IV-B Quantitative Comparison

We used SMMC-10 dataset [30] for quantitative comparison. It comprises synchronized Vicon mocap marker data and Mesa SwissRanger ToF depth data. The depth data have significant amounts of noise, as can be seen in Fig. 7, and the accompanying video. We compare our method to [28] and [31] on the same dataset. We achieve state-of-the-art tracking accuracy (see Table II). We show the results of two metrics: Mean Joint Prediction Error (MJPE) - the RMSE of predicting the mocap markers from our skeleton joints, and Mean Joint Indicator Error (MJIE) - the average of all joint predictions that were within a range of $10[cm]$ from the target joint. Interestingly, when based solely on MSE metrics (as defined above), the physics-based prior does not appear to significantly affect performance (e.g., see Table II). On the contrary, the accompanying video demonstrates how MSE metric does not reflect the physical plausibility of a motion. That is, there are many motions for a given MSE metric, most of which are not physically plausible per se.

Due to high SNR in SMMC-10 dataset’s depth scans, and as a result in our force estimation, we used an online Kalman filter as a noise filtering technique. We have found that this improves performance, reducing prediction error, by roughly 50% (see Table I).

We used a threshold of $0.8$ contact probability to reduce sticky contact (i.e., false positive contact prediction). This is the result of enforcing holonomic (equality) contact constraints instead of inequality constraints. Despite simplifying the joint distribution over all contact points into an independent probabilistic model per contact point, our model was accurate enough to allow tracking with 4 possible contact points. The contact prediction model predicted the correct contact configuration with more than $95\%$ mean accuracy (see Table I).

IV-C Qualitative Comparison

Fig. 6 depicts the process of registering an accurate laser scan with our model. Notice how we learn anthropometrics, body shape and pose. Similarly, in Fig. 7 we register our model to the first frame in a tracking sequence. Due to the low dimensionality of the model we are still able to register a plausible model to very noisy data.

We tested our registration technique (Sec. III-C) on Hasler [24], and SMMC-10 [31] datasets with promising initial results. When inspecting the tracking results in the accompanying video, there are some visible artifacts in the mesh model. Those, however, are mainly due to different pose distributions in training the mesh model and during tracking, rather than due to fundamental limitation of the model (excluding known artifacts of linear mesh blending such as volume collapse). Another interesting property of our mesh model is volume prediction per registered mesh. We used the mesh volume to calculate the inertial description needed for the physical priors, treating the volume as water. When compared with ground truth, we had an average weight prediction error of $1.5\%$ of $115$ subjects.

The true power of our approach is with the reduction of visual artifacts. While we do not remove jitter entirely, it is attenuated, when compared with data-only tracking. A more dramatic result is how foot-slide is removed in cases where contact is correctly detected. Despite the fact that our false positive contact estimation (wrong contact prediction) caused occasional visual artifacts, the value of removing foot-slide is much more noticeable, as evident in the accompanying video. To better understand how the force predictor works, Fig. 8 plots ground truth contacts, along with the corresponding forces. We considered the lowest marker, along with all markers up to $5[cm]$ away as in contact, due to lack of contact ground truth. Despite having noisy ground truth, our simple predictor was able to perform well on most frames.

Fig. 9 demonstrates how adding the contact constraint acts as a strong motion prior. While smoothing pose will reduce jitter, it will also reduce discontinuities in motion due to contact. On the other hand, applying physical contact constraints can smooth jitter while allowing abrupt changes in motion.

V Discussion

We propose an online physics-based 3D human tracking approach that incorporates physics-based priors into tracking without the need for subject calibration or knowledge about the environment. The use of physics in this context is compelling as it allows us to minimize visual artifacts, most notably jitter and foot-skate which results from noise and occlusions. We demonstrate that we can infer contact from joint torque trajectories computed by inverse dynamics. We show that our method is effective at tracking from a noisy single depth sensor and produces quantitative results that are on par or better than current state-of-the-art, while at the same time qualitatively reducing visual artifacts.

Our contact prediction model, while conceptually compelling, is relatively simple. For example, we predict all of the contact points independently, despite the fact that contact patterns (especially for contact points on connected segments) are clearly correlated. We believe the prediction model can be further improved by structured prediction that incorporate these correlations in contact state. While our method predicts the environment online, it currently does not aggregate these predictions, which may be important for longer sequences.

We also note that in our method the ground can both push and pull on the body when contact is established. This can sometimes be seen in the video. While this behaviour is not realistic in terms of underlying physical behaviour, we nevertheless believe it allows us to overcome vast amounts of noise in the observations (especially where feet can easily be confused with occluders and the ground plane).

Finally, we note how MSE metrics do not capture the dynamics of a motion. For example, the same MJPE can represent a motion that is the ground truth with a constant added to it, or the ground truth with a Gaussian noise with the same constant as a standard deviation, and a zero mean. This exposes the limitation of relying on a MSE metric to assess the quality of tracking results, and highlight the advantage of using physics-based prior. In a sense, the physics-based prior shapes the results to be qualitatively superior, despite not improving the MSE metric itself.

VI Learning the Body Model

In order to learn the body mesh model in Sec. III-A we use the Hastler dataset [24], which consists of 111 subjects with 520 poses, all with registered meshes. We learn the model by minimizing Eq. 14 w.r.t. the weights $W=\left\{w_{ib}\right\}$ , the different mesh templates $\left\{\tilde{{\mathbf{p}}}^{s}\right\}_{s}$ per subject $s$ , and the template pose and anthropometrics $\left\{{\mathbf{q}}^{s},{\boldsymbol{\ell}}^{s}\right\}_{s}$ per subject $s$ . We define the number of weights per vertex based on joints proximity along the kinematic tree, with BFS of distance 3. Since we optimize all parameters w.r.t. to the same reconstruction error function (Eq. 14), we get an accurate reconstruction despite the simplicity of the model, when compared with other state-of-the-art models (Table III).

Notice that some of the models in Table III were trained on different datasets. However, our main goal is to demonstrate that our model is comparable to state-of-the-art models, rather than a comprehensive comparison. In our dataset only 43 out of 111 subjects have more than a single pose, which is required for our training. However, those subjects account for 86% of the total number of poses (450 out of 520 poses).

VI-A Model Parameters Optimization

While it is possible to optimize for the weights, mesh templates and pose simultaneously, it is a slow, non-convex and nonlinear optimization. Instead we alternated the optimization between the parameters, which yields a much faster optimization process, and is also convex w.r.t. the mesh templates and weights. Our learning process includes the following steps:

Initialize ${\mathbf{q}}^{s}$ for all subjects by fitting landmarks (based on the registered meshes). 2. 2.

Repeat until convergence:

(a)

Optimize weights $W=\{w_{ib}\}$ given current poses and mesh templates. We optimize a global reconstruction error function as $W$ is shared among all subjects

[TABLE]

where $\mathcal{S}$ are all subjects with more than a single pose in our dataset. By examining Eq. 3, it is clear that Eq. 17 is convex w.r.t $W$ . Thus, we can define $\mathbf{A}_{i}^{s,j}$ by rewriting Eq. 3 as

[TABLE]

where $j$ is a pose index (over all poses of subject $s$ ), and $\mathbf{w}_{i}$ is the weights of vertex $i$ as a vector. By concatenating $\mathbf{A}^{j}_{i},{\mathbf{p}}^{j}_{i}$ of all subjects $s$ and all poses $j$ it is easy to calculate the least-squares solution. 2. (b)

Optimize mesh template per subject, given current weights and poses. We optimize the mesh template independently per subject. By examining Eq. 3, it is clear that Eq. 14 is convex w.r.t to $\tilde{{\mathbf{p}}}^{s}$ , and we can define $\mathbf{T}_{i}^{s,j}$ by rewriting Eq. 3 as

[TABLE]

per vertex $i$ and per pose $j$ . By concatenating matrix $\mathbf{T}_{i}^{s,j}$ for all poses $j$ per subject, a simple least-squares solution can be used here as well. 3. (c)

Optimize pose ${\mathbf{q}}^{j}$ for all poses of all subjects. All poses can be estimated independently, by using nonlinear and non-convex optimization of Eq. 14 w.r.t. the pose parameter ${\mathbf{q}}$ . We used BFGS to optimize for pose parameters ${\mathbf{q}}_{s}^{k}$ for all poses of all subject. Note: Since the optimization is local, good initialization is required.

Practically, two full iterations iterations of $(a),(b),(c)$ were enough to get close to convergence. The result of the model parameters optimization phase are $W$ , shared weights to be used in LMB, the mesh template per subject $\tilde{{\mathbf{p}}}^{s}$ , the bones length ${\boldsymbol{\ell}}^{s}$ per subject, and the pose vector ${\mathbf{q}}^{j}_{s}$ per pose $j$ and subject $s$ .

VI-B Basis Learning

Once we learn the model parameters as explained above, we can train a linear regressor with basis ${\mathbf{B}_{\boldsymbol{\ell}}}$ from bones length to a mesh template $\tilde{{\mathbf{p}}}$ , s.t.

[TABLE]

where ${\mathbf{B}_{\boldsymbol{\ell}}}$ is learned with a least squares formulation.

By applying PCA to the null space of the linear regression basis (difference between regressed mesh template and $\tilde{{\mathbf{p}}}$ ), we can learn the body shape basis ${\mathbf{B}_{\boldsymbol{\beta}}}$ . We used the first 10 PC as a linear basis. Once we have the two basis, ${\mathbf{B}_{\boldsymbol{\ell}}},{\mathbf{B}_{\boldsymbol{\beta}}}$ , we can generate new mesh templates given any desired bones length ${\boldsymbol{\ell}}$ and body shape score ${\boldsymbol{\beta}}$ , as shown in Eq. 2, and generate a mesh for any given pose by using Eq. 3.

Bibliography32

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] E. de Aguiar, C. Stoll, C. Theobalt, N. Ahmed, H.-P. Seidel, and S. Thrun, “Performance capture from sparse multi-view video,” ACM Trans. Graph. , vol. 27, no. 3, pp. 98:1–98:10, Aug. 2008.
2[2] X. Wei, P. Zhang, and J. Chai, “Accurate realtime full-body motion capture using a single depth camera,” ACM Trans. Graph. , vol. 31, no. 6, pp. 188:1–188:12, Nov. 2012.
3[3] A. Haque, B. Peng, Z. Luo, A. Alahi, S. Yeung, and F. Li, “Viewpoint invariant 3d human pose estimation with recurrent error feedback,” Co RR , vol. abs/1603.07076, 2016.
4[4] M. Prazak, L. Hoyet, and C. O’Sullivan, “Perceptual evaluation of footskate cleanup,” ACM SIGGRAPH/Eurographics Symposium on Computer Animation , 2011.
5[5] M. A. Brubaker, D. J. Fleet, and A. Hertzmann, “Physics-based person tracking using the anthropomorphic walker,” International Journal of Computer Vision , vol. 87(1), pp. 140–155, 2010.
6[6] M. Vondrak, L. Sigal, and O. Jenkins, “Dynamical simulation priors for human motion tracking,” Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol. 35, no. 1, pp. 52–65, 2013.
7[7] X. Wei and J. Chai, “Videomocap: Modeling physically realistic human motion from monocular video sequences,” ACM Trans. Graphics (SIGGRAPH) , vol. 29(4), 2010.
8[8] M. A. Brubaker, L. Sigal, and D. J. Fleet, “Estimating contact dynamics,” in Proc. IEEE ICCV , 2009.