Visibility Constrained Generative Model for Depth-based 3D Facial Pose   Tracking

Lu Sheng; Jianfei Cai; Tat-Jen Cham; Vladimir Pavlovic; King Ngi Ngan

arXiv:1905.02114·cs.CV·May 7, 2019

Visibility Constrained Generative Model for Depth-based 3D Facial Pose Tracking

Lu Sheng, Jianfei Cai, Tat-Jen Cham, Vladimir Pavlovic, King Ngi Ngan

PDF

Open Access

TL;DR

This paper introduces a robust generative framework for depth-based 3D facial pose tracking that adapts in real-time to occlusions and expression changes, improving accuracy over previous methods.

Contribution

It presents a novel statistical 3D morphable model with online adaptation and a ray visibility constraint to enhance robustness against occlusions.

Findings

01

Outperforms state-of-the-art depth-based methods on Biwi and ICT-3DHP datasets.

02

Effective in unconstrained scenarios with heavy occlusions and expression variations.

03

Demonstrates the benefit of visibility constraints over ICP-based pose estimation.

Abstract

In this paper, we propose a generative framework that unifies depth-based 3D facial pose tracking and face model adaptation on-the-fly, in the unconstrained scenarios with heavy occlusions and arbitrary facial expression variations. Specifically, we introduce a statistical 3D morphable model that flexibly describes the distribution of points on the surface of the face model, with an efficient switchable online adaptation that gradually captures the identity of the tracked subject and rapidly constructs a suitable face model when the subject changes. Moreover, unlike prior art that employed ICP-based facial pose estimation, to improve robustness to occlusions, we propose a ray visibility constraint that regularizes the pose based on the face model's visibility with respect to the input point cloud. Ablation studies and experimental results on Biwi and ICT-3DHP datasets demonstrate that…

Tables4

Table 1. TABLE I: Facial Pose Datasets Summary

	$N_{seq}$	$N_{frm}$	$N_{subj}$	$𝝎_{\max}$
BIWI [21]	24	$\sim$ 15K	25	$\pm 75^{\circ}$ yaw
BIWI [21]	24	$\sim$ 15K	25	$\pm 60^{\circ}$ pitch
ICT-3DHP [48]	10	$\sim$ 14K	10	$\pm 75^{\circ}$ yaw
ICT-3DHP [48]	10	$\sim$ 14K	10	$\pm 45^{\circ}$ pitch

Table 2. TABLE II: Component-wise Runtime Comparison (MATLAB Platform)

	ICP	RVS	RVS+TE	RVS+TE+PSO	IA
$Δ t$ (sec)	0.0156	0.0193	0.0233	0.328	0.131

Table 3. TABLE III: Evaluations on Biwi dataset

Method	Errors
Method	Yaw (^∘)	Pitch (^∘)	Roll (^∘)	Trans (mm)
Ours	1.6	1.7	1.8	5.5
Sheng et. al. [27]	2.3	2.0	1.9	6.9
RF [21]	8.9	8.5	7.9	14.0
Martin [49]	3.6	2.5	2.6	5.8
CLM-Z [48]	14.8	12.0	23.3	16.7
TSP [20]	3.9	3.0	2.5	8.4
PSO [44]	11.1	6.6	6.7	13.8
Meyer et. al. [37]	2.1	2.1	2.4	5.9
Li et. al.^⋆ [38]	2.2	1.7	3.2	$-$

Table 4. TABLE IV: Evaluations on ICT-3DHP dataset

Method	Errors
Method	Yaw (^∘)	Pitch (^∘)	Roll (^∘)
Ours	2.5	3.0	2.7
Sheng et. al. [27]	3.4	3.2	3.3
RF [21]	7.2	9.4	7.5
CLM-Z [48]	6.9	7.1	10.5
Li et. al.^⋆ [38]	3.3	3.1	2.9

Equations49

f = \overset{ˉ}{f} + C \times_{2} w_{id}^{⊤} \times_{3} w_{exp}^{⊤},

f = \overset{ˉ}{f} + C \times_{2} w_{id}^{⊤} \times_{3} w_{exp}^{⊤},

f = \overset{ˉ}{f} + C \times_{2} μ_{id} \times_{3} μ_{exp} + C \times_{2} ϵ_{id} \times_{3} μ_{exp} + C \times_{2} μ_{id} \times_{3} ϵ_{exp} + C \times_{2} ϵ_{id} \times_{3} ϵ_{exp} .

f = \overset{ˉ}{f} + C \times_{2} μ_{id} \times_{3} μ_{exp} + C \times_{2} ϵ_{id} \times_{3} μ_{exp} + C \times_{2} μ_{id} \times_{3} ϵ_{exp} + C \times_{2} ϵ_{id} \times_{3} ϵ_{exp} .

p_{M} (f) = N (f ∣ μ_{M}, Σ_{M}),

p_{M} (f) = N (f ∣ μ_{M}, Σ_{M}),

p (f, w_{id})

p (f, w_{id})

= N (f ∣ \overset{ˉ}{f} + P_{id} w_{id}, Σ_{E}) N (w_{id} ∣ μ_{id}, Σ_{id}),

[u, v, 1]^{⊤} = π (p) = K [x / z, y / z, 1]^{⊤}

[u, v, 1]^{⊤} = π (p) = K [x / z, y / z, 1]^{⊤}

p = π^{- 1} (x, D (x)) = K^{- 1} [u, v, 1]^{⊤} \cdot D (x),

p = π^{- 1} (x, D (x)) = K^{- 1} [u, v, 1]^{⊤} \cdot D (x),

q_{n} = T (θ) \circ f_{n} = e^{α} R (ω) f_{n} + t, n \in {1, \dots, N_{M}}

q_{n} = T (θ) \circ f_{n} = e^{α} R (ω) f_{n} + t, n \in {1, \dots, N_{M}}

p_{Q} (q_{n}; θ) = N (q_{n} ∣ T (θ) \circ μ_{M, [n]}, e^{2 α} Σ_{M, [n]}^{(ω)}),

p_{Q} (q_{n}; θ) = N (q_{n} ∣ T (θ) \circ μ_{M, [n]}, e^{2 α} Σ_{M, [n]}^{(ω)}),

y_{n} = Δ (q_{n}; p_{n}) = n_{n}^{⊤} (q_{n} - p_{n}),

y_{n} = Δ (q_{n}; p_{n}) = n_{n}^{⊤} (q_{n} - p_{n}),

N (y_{n} ∣Δ (T (θ) \circ μ_{M, [n]}; p_{n}), σ_{o}^{2} + e^{2 α} n_{n}^{⊤} Σ_{M, [n]}^{(ω)} n_{n}),

N (y_{n} ∣Δ (T (θ) \circ μ_{M, [n]}; p_{n}), σ_{o}^{2} + e^{2 α} n_{n}^{⊤} Σ_{M, [n]}^{(ω)} n_{n}),

γ_{n} = 1 : Δ (T (θ) \circ μ_{M, [n]}; p_{n}) \leq σ_{o}^{2} + e^{2 α} n_{n}^{⊤} Σ_{M, [n]}^{(ω)} n_{n} .

γ_{n} = 1 : Δ (T (θ) \circ μ_{M, [n]}; p_{n}) \leq σ_{o}^{2} + e^{2 α} n_{n}^{⊤} Σ_{M, [n]}^{(ω)} n_{n} .

γ_{n} = 0 : Δ (T (θ) \circ μ_{M, [n]}; p_{n}) > σ_{o}^{2} + e^{2 α} n_{n}^{⊤} Σ_{M, [n]}^{(ω)} n_{n} .

γ_{n} = 0 : Δ (T (θ) \circ μ_{M, [n]}; p_{n}) > σ_{o}^{2} + e^{2 α} n_{n}^{⊤} Σ_{M, [n]}^{(ω)} n_{n} .

p_{P} (y_{n}; θ) = N (y_{n} ∣0, σ_{o}^{2})^{γ_{n}} U_{O} (y_{n})^{1 - γ_{n}},

p_{P} (y_{n}; θ) = N (y_{n} ∣0, σ_{o}^{2})^{γ_{n}} U_{O} (y_{n})^{1 - γ_{n}},

L_{rvs} (θ) = D_{K L} [p_{Q} (y; θ) ∣∣ p_{P} (y; θ)]

L_{rvs} (θ) = D_{K L} [p_{Q} (y; θ) ∣∣ p_{P} (y; θ)]

W (x_{m}^{t - 1}, Δ θ) = π (T (Δ θ) \circ π^{- 1} (x_{m}^{t - 1}, D_{t - 1} (x_{m}^{t - 1}))) .

W (x_{m}^{t - 1}, Δ θ) = π (T (Δ θ) \circ π^{- 1} (x_{m}^{t - 1}, D_{t - 1} (x_{m}^{t - 1}))) .

L_{t} (Δ θ) = \frac{1}{2 σ _{t}^{2}} m = 1 \sum M^{t - 1} (D_{t} (W (x_{m}^{t - 1}, Δ θ)) - Z (x_{m}^{t - 1}; Δ θ))^{2},

L_{t} (Δ θ) = \frac{1}{2 σ _{t}^{2}} m = 1 \sum M^{t - 1} (D_{t} (W (x_{m}^{t - 1}, Δ θ)) - Z (x_{m}^{t - 1}; Δ θ))^{2},

L_{s} (Δ θ) = \frac{1}{2} λ_{s}^{(t)} (Δ α)^{2},

L_{s} (Δ θ) = \frac{1}{2} λ_{s}^{(t)} (Δ α)^{2},

L (Δ θ) = L_{rvs} (θ^{(t - 1)} + Δ θ) + L_{t} (Δ θ) + L_{s} (Δ θ) .

L (Δ θ) = L_{rvs} (θ^{(t - 1)} + Δ θ) + L_{t} (Δ θ) + L_{s} (Δ θ) .

p (μ_{id}, Σ_{id}) = N (μ_{id} ∣ m, β^{- 1} Σ_{id}) W^{- 1} (Σ_{id} ∣ Ψ, ν)

p (μ_{id}, Σ_{id}) = N (μ_{id} ∣ m, β^{- 1} Σ_{id}) W^{- 1} (Σ_{id} ∣ Ψ, ν)

p_{ℓ} (y^{t}, P^{t} ∣ w_{id}; θ^{(t)}) = n = 1 \prod N_{M} U_{O} (y_{n}^{t})^{1 - γ_{n}^{(t)}} \times n = 1 \prod N_{M} p_{Q \to P} (y_{n}^{t} ∣ w_{id}; θ^{(t)})^{γ_{n}^{(t)}} p_{Q} (p_{n}^{t} ∣ w_{id}; θ^{(t)})^{γ_{n}^{(t)}},

p_{ℓ} (y^{t}, P^{t} ∣ w_{id}; θ^{(t)}) = n = 1 \prod N_{M} U_{O} (y_{n}^{t})^{1 - γ_{n}^{(t)}} \times n = 1 \prod N_{M} p_{Q \to P} (y_{n}^{t} ∣ w_{id}; θ^{(t)})^{γ_{n}^{(t)}} p_{Q} (p_{n}^{t} ∣ w_{id}; θ^{(t)})^{γ_{n}^{(t)}},

κ^{(t)} = \frac{1}{N _{M}} i = 1 \sum N_{M} γ_{n}^{(t)} .

κ^{(t)} = \frac{1}{N _{M}} i = 1 \sum N_{M} γ_{n}^{(t)} .

m_{new} = \frac{N _{C} w ˉ _{id} + β m}{N _{C} + β}, and β_{new} = β + N_{C}

m_{new} = \frac{N _{C} w ˉ _{id} + β m}{N _{C} + β}, and β_{new} = β + N_{C}

Ψ_{new} = Ψ + N_{C} S + \frac{β N _{C}}{β + N _{C}} (\overset{ˉ}{w}_{id} - m) (\overset{ˉ}{w}_{id} - m)^{⊤},

and ν_{new} = ν + N_{C},

p (k ∣ w_{id}; {I_{k}}_{k = 1}^{K}, I_{0}) = \frac{p _{I_{k}} ( w _{id} )}{\sum _{k = 1}^{K} p _{I_{k}} ( w _{id} ) + p _{I_{0}} ( w _{id} )},

p (k ∣ w_{id}; {I_{k}}_{k = 1}^{K}, I_{0}) = \frac{p _{I_{k}} ( w _{id} )}{\sum _{k = 1}^{K} p _{I_{k}} ( w _{id} ) + p _{I_{0}} ( w _{id} )},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · 3D Shape Modeling and Analysis · Human Pose and Action Recognition

Full text

Visibility Constrained Generative Model for Depth-based 3D Facial Pose Tracking

Lu Sheng, Jianfei Cai, Tat-Jen Cham, Vladimir Pavlovic, and King Ngi Ngan

L. Sheng is with the College of Software, Beihang University, China.

E-mail: [email protected] K. N. Ngan is with University of Electronic Science and Technology.

E-mail: [email protected] J. Cai and T-J. Cham are with the School of Computer Science and Engineering, Nanyang Technological University, Singapore.

E-mail: {asjfcai, astjcham}@ntu.edu.sg V. Pavlovic is with the Department of Computer Science, Rutgers University, USA.

E-mail: [email protected]

Abstract

In this paper, we propose a generative framework that unifies depth-based 3D facial pose tracking and face model adaptation on-the-fly, in the unconstrained scenarios with heavy occlusions and arbitrary facial expression variations. Specifically, we introduce a statistical 3D morphable model that flexibly describes the distribution of points on the surface of the face model, with an efficient switchable online adaptation that gradually captures the identity of the tracked subject and rapidly constructs a suitable face model when the subject changes. Moreover, unlike prior art that employed ICP-based facial pose estimation, to improve robustness to occlusions, we propose a ray visibility constraint that regularizes the pose based on the face model’s visibility with respect to the input point cloud. Ablation studies and experimental results on Biwi and ICT-3DHP datasets demonstrate that the proposed framework is effective and outperforms completing state-of-the-art depth-based methods.

Index Terms:

3D facial pose tracking, generative model, depth, online Bayesian model, mixture of Gaussian models

1 Introduction

Robust 3D facial pose tracking is a central task in many computer vision and computer graphics problems, with applications in facial performance capture, human-computer interaction, as well as VR/AR applications in modern mobile devices. Although the facial pose tracking has been successfully performed on RGB data [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] for well-constrained scenes, challenges posed by illumination variations, shadows, and substantial occlusions make these RGB-based facial pose tracking approaches less reliable in unconstrained scenarios. The utilization of depth data from commodity real-time range sensors has led to more robust 3D facial pose tracking, not only by enabling registration in the 3D metric space but also by providing cues for the occlusion reasoning.

Although promising results have been demonstrated by leveraging both RGB and depth data [11, 12, 13, 14, 15, 16, 17], or even RGB data alone [6, 2] in unconstrained facial pose tracking, existing approaches are not yet able to reliably cope with RGB data affected by inconsistent or poor lighting conditions. For example, mobile applications like FaceID or Animoji (which heavily employ the face tracking module in their systems) are bound to fail in dark scenes such as bedrooms and theaters, or scenes under complex illuminations such as parties and clubs. Furthermore, RGB data may be deliberately suppressed in scenarios where privacy is a major concern. Therefore, it is meaningful to study robust 3D facial pose tracking using depth data alone, complementary to traditional RGB-based tracking systems.

Several key challenges need to be addressed in the context of depth-based tracking: (1) coping with complex self-occlusions and other occlusions caused by hair, accessories, hands and etc.; (2) sustaining an always-on face tracker that can dynamically adapt to any user without manual recalibration; and (3) providing stability over time to variations in user expressions. Unlike previous depth-based discriminative or data-driven methods [18, 19, 20, 21, 22, 23, 24] that require complex training or manual calibration, in this paper we propose a framework that unifies pose tracking and face model adaptation on-the-fly, offering highly accurate, occlusion-aware and uninterrupted 3D facial pose tracking, as shown in Fig. 1. The contributions of this work are fourfold:

We introduce a decomposable statistical formulation for a 3D morphable face model, improving upon the earlier 3DMM models [25, 26]. This formulation encourages a group-wise pose estimation for any potential face model and enables expression-invariant face model updating.
We propose an occlusion-aware pose estimation mechanism based on minimizing an information-theoretic ray visibility score that optimizes the visibility of our statistical face model. The mechanism is based on the intuition that valid poses imply that the face points must either be co-located (i.e., visible) with the observed point cloud or be occluded by the point cloud. Without any need for explicit correspondences, our method is highly effective in handling various types of occlusions.
We introduce a flow-based constraint to enforce temporal coherence among poses of neighboring frames, in which the poses are regularized by per-pixel depth flows between adjacent frames rather than predicting the motion patterns from the previous pose trajectories.
We present an online switchable identity adaptation approach to gradually adapt the face model to the captured subject, or instantaneously switch among stored personalized identity models and create novel face models for new identities.

We present a comprehensive ablation study for the proposed facial tracking framework. Moreover, experiments on Biwi and ICT-3DHP datasets manifest the superiority of the proposed method against competing depth-based tracking systems. Note that an early version of this work was published in [27]. Compared to it, this paper has made substantial extensions including a flow-based temporal coherence, an online switchable identity adaptation method, and a comprehensive ablation study. With these new technical changes, our proposed framework significantly improves the results of our early method, outperforming the state-of-the-art methods on two benchmark datasets.

2 Related Work

Facial pose tracking and model regression methods typically consider the RGB videos as their input modality. They largely rely on tracking the dynamics of sparse 3D facial features that correspond to parametric 3D face models [28, 29, 30, 10, 1, 8]. In the presence of reliable feature detection, the facial pose can be tracked accurately under moderate occlusions and smooth motion patterns. Recent advances in discriminative pose estimation and face model reconstruction, which employ deep learning [5, 7, 4] or random forest [31, 1, 8, 32] paradigms, have shown promising results in many applied scenarios. To improve robustness of prior work, explicit modeling of the occlusions has also been considered [31, 2, 6].

Leveraging the introduction and development of depth sensors, a variety of depth-based 3D facial pose tracking and model personalization frameworks have been proposed. One category of approaches employ sparse depth features, such as facial surface curvatures [18], nose tips [19], or triangular surface patch descriptors [20] as the means for a robust 3D facial pose estimation. However, these methods may fail when such features cannot be detected under conditions of highly noisy depth data, extreme poses or large occlusions. A different family of approaches considers discriminative methods based on random forests [21, 22], deep Hough network [23], or finding a dense correspondence field between the input depth image and a predefined canonical face model [24, 33]. Although promising and often accurate, these methods require sophisticated supervised training with large-scale, tediously labeled datasets.

A different modeling strategy involves rigid and non-rigid registration of 3D face models to the depth images, either through the use of 3D morphable models [25, 34, 12, 35, 36, 11, 37, 38, 39, 40, 41, 2], or brute-force per-vertex 3D face reconstruction [42, 14, 15, 16, 14, 25]. Although such systems may be accurate, they often require offline initialization or user calibration to create face models specific to individual users. Subsequent works have been developed to gradually refine the 3D morphable model over time during active tracking [38, 37, 36, 11, 13, 15]. Our proposed method falls into this category. Inspired by statistical models [25, 26, 2, 41], we enhance this face model through a decomposable statistical formulation, in which the shape variations from identity and expression are explicitly disentangled.

Occlusion handling is vital for robust 3D facial pose tracking. While occlusions may be elucidated through face segmentation [11, 6, 2] or patch-based feature learning [22, 21, 23, 24], iteratively closest points (ICP) based face model registration frameworks do not handle ambiguous correspondences well [43, 11, 14, 15]. Possible remedies include particle swarm optimization [44] for optimizing complex objective functions [37]. Recently, Wang et. al. [45] tackled for partial registration of general moving subjects and improved occlusion handling by considering multi-view visibility consistency. Our proposed ray visibility score incorporates a similar visibility constraint between the face model and the input point cloud but with a probabilistic formulation, which is able to more robustly handle uncertainties in the 3D face model, and is thus less vulnerable to local minima that are frequently encountered in ICP.

Our online switchable identity adaptation falls in the category of adaptive online approaches [11, 13, 15, 36, 37, 27]. However, unlike prior work we also tackle the important problem of identity instantiation and switching, which enables us to rapidly adapt to new users.

3 Probabilistic 3D Face Parameterization

In this section, we introduce the 3D morphable face model with a probabilistic interpretation, which acts as an effective prior for facial pose estimation and face identity adaptation.

3.1 Multilinear Face Model

3D face shapes are usually represented by triangular meshes, but in this paper we only focus on the parameterization of vertices, while leaving the edges unchanged. Thus the face shape can be simplified as a vector constructed by an ordered 3D vertex list $\mathbf{f}=[x_{1},y_{1},z_{1},\ldots,x_{N_{\mathcal{M}}},y_{N_{\mathcal{M}}},z_{N_{\mathcal{M}}}]^{\top}$ , where the $n^{\text{th}}$ vertex is $[x_{n},y_{n},z_{n}]^{\top}\in\mathbb{R}^{3},\forall n\in\{1,\ldots,N_{\mathcal{M}}\}$ . $N_{\mathcal{M}}$ is the total number of vertices in the model.

We apply the multilinear model [25, 26] to parametrically generate $3$ D faces that are adaptive to different identities and expressions. It is controlled by a three dimensional tensor $\mathcal{C}\in\mathbb{R}^{3N_{\mathcal{M}}\times N_{\text{id}}\times N_{\text{exp}}}$ , where the dimensions correspond to shape, identity and expression, respectively. Thus, the multilinear model represents a 3D face shape as

[TABLE]

where $\mathbf{w}_{\text{id}}\in\mathbb{R}^{N_{\text{id}}}$ and $\mathbf{w}_{\text{exp}}\in\mathbb{R}^{N_{\text{exp}}}$ are the linear weights for identity and expression, respectively. $\times_{i}$ denotes the $i$ -th mode product. $\bar{\mathbf{f}}$ is the mean face in the training dataset. The tensor $\mathcal{C}$ , also called the core tensor encoding the subspaces of the shape variations in faces, is calculated by high-order singular value decomposition (HOSVD) onto the training dataset, i.e., $\mathcal{C}=\mathcal{T}\times_{2}\mathbf{U}_{\text{id}}\times_{3}\mathbf{U}_{\text{exp}}$ . $\mathbf{U}_{\text{id}}$ and $\mathbf{U}_{\text{exp}}$ are unitary matrices of the mode- $2$ and mode- $3$ HOSVD of the 3D data tensor $\mathcal{T}$ . $\mathcal{T}$ stacks the mean-subtracted face offsets from the training dataset, along the identity and expression dimensions, respectively.

3.2 Statistical Modeling of the Multilinear Face Model

Unlike using a deterministic face template to match the target point cloud, we apply a statistical model where the potential face shape varies inside a learned shape uncertainty around the mean face, thus we have a better chance to find a suitable face prototype compatible with the target point cloud. Such a model provides a probabilistic prior for robust face pose tracking.

3.2.1 Identity and Expression Priors

According to Eq. 1, the multilinear model is controlled by the identity weight $\mathbf{w}_{\text{id}}$ and expression weight $\mathbf{w}_{\text{exp}}$ . It is convenient to assume that $\mathbf{w}_{\text{id}}$ and $\mathbf{w}_{\text{exp}}$ both follow Gaussian distributions: $\mathbf{w}_{\text{id}}=\boldsymbol{\mu}_{\text{id}}+\boldsymbol{\epsilon}_{\text{id}},\boldsymbol{\epsilon}_{\text{id}}\sim\mathcal{N}(\boldsymbol{\epsilon}_{\text{id}}|\mathbf{0},\boldsymbol{\Sigma}_{\text{id}})$ and $\mathbf{w}_{\text{exp}}=\boldsymbol{\mu}_{\text{exp}}+\boldsymbol{\epsilon}_{\text{exp}},\boldsymbol{\epsilon}_{\text{exp}}\sim\mathcal{N}(\boldsymbol{\epsilon}_{\text{exp}}|\mathbf{0},\boldsymbol{\Sigma}_{\text{exp}})$ .

Notice that $\boldsymbol{\mu}_{\text{id}}$ (or $\boldsymbol{\mu}_{\text{exp}}$ ) should not be $\mathbf{0}$ as it will possibly make the face model $\mathbf{f}$ insensitive to $\mathbf{w}_{\text{exp}}$ (or $\mathbf{w}_{\text{id}}$ ) [46]. If we assume either $\boldsymbol{\mu}_{\text{id}}$ (or $\boldsymbol{\mu}_{\text{exp}})\simeq\mathbf{0}$ , the variation of the expression (or identity) parameters will be less significant for forming the face shape, i.e., $\mathcal{C}\times_{2}\mathbf{w}_{\text{id}}^{\top}\times_{3}\mathbf{w}_{\text{exp}}^{\top}\simeq\mathbf{0}$ .

3.2.2 Multilinear Face Model

The canonical face model $\mathcal{M}$ with respect to $\mathbf{w}_{\text{id}}$ and $\mathbf{w}_{\text{exp}}$ can be written by re-organizing Eq. (1), as

[TABLE]

The last term in (2) is usually negligible in the shape variation, as illustrated in Fig. 2. Therefore, $\mathcal{M}$ approximately follows a Gaussian distribution as

[TABLE]

where its mean face shape is $\boldsymbol{\mu}_{\mathcal{M}}=\bar{\mathbf{f}}+\mathcal{C}\times_{2}\boldsymbol{\mu}_{\text{id}}\times_{3}\boldsymbol{\mu}_{\text{exp}}$ , and its variance matrix is $\boldsymbol{\Sigma}_{\mathcal{M}}=\mathbf{P}_{\text{id}}\boldsymbol{\Sigma}_{\text{id}}\mathbf{P}_{\text{id}}^{\top}+\mathbf{P}_{\text{exp}}\boldsymbol{\Sigma}_{\text{exp}}\mathbf{P}_{\text{exp}}^{\top}$ . The projection matrices $\mathbf{P}_{\text{id}}$ and $\mathbf{P}_{\text{exp}}$ for identity and expression are defined as $\mathbf{P}_{\text{id}}=\mathcal{C}\times_{3}\boldsymbol{\mu}_{\text{exp}}\in\mathbb{R}^{3N_{\mathcal{M}}\times N_{\text{id}}}$ and $\mathbf{P}_{\text{exp}}=\mathcal{C}\times_{2}\boldsymbol{\mu}_{\text{id}}\in\mathbb{R}^{3N_{\mathcal{M}}\times N_{\text{exp}}}$ , respectively.

Since we are also interested in estimating the identity distribution for personalization of the face shape, we convert the canonical face distribution into a joint distribution for the face shape $\mathbf{f}$ and the identity parameter $\mathbf{w}_{\text{id}}$ , as

[TABLE]

where the expression variance is $\boldsymbol{\Sigma}_{\mathcal{E}}=\mathbf{P}_{\text{exp}}\boldsymbol{\Sigma}_{\text{exp}}\mathbf{P}_{\text{exp}}^{\top}$ .

As shown in Fig. 2, the overall shape variation (represented as per-pixel standard deviation) is, unsurprisingly, most significant in the facial region as compared to other parts of the head. We further observe that this shape variation is dominated by differences in identities, as encoded by $\boldsymbol{\Sigma}_{\mathcal{I}}=\mathbf{P}_{\text{id}}\boldsymbol{\Sigma}_{\text{id}}\mathbf{P}_{\text{id}}^{\top}$ . As expected, the shape uncertainties of expressions, quantified by $\boldsymbol{\Sigma}_{\mathcal{E}}$ are usually localized around the mouth and chin, as well as the regions around cheek and eyebrow. More importantly, the variation by the residual term in Eq. (2) has a much lower magnitude than those caused solely by identity and expression. Note that the shape variations from identity or expression heavily depend on the dataset statistics that underpin this probabilistic multilinear model.

3.2.3 Estimating Hyper-parameters

We employ the FaceWarehouse dataset [25] as the training dataset since it contains face meshes with a comprehensive set of expressions ( $N_{\text{exp}}=50$ ) and a variety of identities ( $N_{\text{id}}=150$ ) from different ages, genders and races.

Assigning each face mesh in the training dataset to two one-hot vectors $\mathbf{x}_{\text{id}}$ and $\mathbf{x}_{\text{exp}}$ for identity and expression, respectively, we find that $\mathbf{x}_{\text{id}}$ / $\mathbf{x}_{\text{exp}}$ and $\mathbf{w}_{\text{id}}$ / $\mathbf{w}_{\text{exp}}$ are linearly connected. Because the face mesh is written as $\bar{\mathbf{f}}+\mathcal{T}\times_{2}\mathbf{x}_{\text{id}}^{\top}\times_{3}\mathbf{x}_{\text{exp}}^{\top}=\bar{\mathbf{f}}+\mathcal{C}\times_{2}(\mathbf{U}_{\text{id}}^{\top}\mathbf{x}_{\text{id}})^{\top}\times_{3}(\mathbf{U}_{\text{exp}}^{\top}\mathbf{x}_{\text{exp}})^{\top}=\bar{\mathbf{f}}+\mathcal{C}\times_{2}\mathbf{w}_{\text{id}}^{\top}\times_{3}\mathbf{w}_{\text{exp}}^{\top},$ we have $\mathbf{w}_{\text{id}}=\mathbf{U}_{\text{id}}^{\top}\mathbf{x}_{\text{id}}$ and $\mathbf{w}_{\text{exp}}=\mathbf{U}_{\text{exp}}^{\top}\mathbf{x}_{\text{exp}}$ .

The mean face $\bar{\mathbf{f}}$ requires $\bar{\mathbf{x}}_{\text{id}}=\frac{1}{N_{\text{id}}}\mathbf{1}$ and $\bar{\mathbf{x}}_{\text{exp}}=\frac{1}{N_{\text{exp}}}\mathbf{1}$ . The variances $\text{Var}(\mathbf{x}_{\text{id}})\simeq\frac{1}{N_{\text{id}}}\mathbf{I}$ and $\text{Var}(\mathbf{x}_{\text{exp}})\simeq\frac{1}{N_{\text{exp}}}\mathbf{I}$ , where $\mathbf{I}$ is the identity matrix. Thus the hyper-parameters in the prior distributions can be estimated accordingly, such that $\boldsymbol{\mu}_{\text{id}}=\frac{1}{N_{\text{id}}}\mathbf{U}_{\text{id}}^{\top}\mathbf{1}$ and $\boldsymbol{\mu}_{\text{exp}}=\frac{1}{N_{\text{exp}}}\mathbf{U}_{\text{exp}}^{\top}\mathbf{1}$ , and $\boldsymbol{\Sigma}_{\text{id}}\simeq\frac{1}{N_{\text{id}}}\mathbf{U}_{\text{id}}^{\top}\mathbf{U}_{\text{id}}=\frac{1}{N_{\text{id}}}\mathbf{I}$ and $\boldsymbol{\Sigma}_{\text{exp}}\simeq\frac{1}{N_{\text{exp}}}\mathbf{U}_{\text{exp}}^{\top}\mathbf{U}_{\text{exp}}=\frac{1}{N_{\text{exp}}}\mathbf{I}$ .

4 Probabilistic Facial Pose Tracking

In this section, we present our probabilistic facial pose tracking approach. Fig. 3 shows the overall architecture, which consists of two main components: 1) robust facial pose tracking, and 2) online switchable identity adaptation. The goal of the first component is to estimate the rigid facial pose $\boldsymbol{\theta}$ , given an input depth image and the current facial model. The second component aims to update the distribution of the identity parameter $\mathbf{w}_{\text{id}}$ and the face model $p_{\mathcal{M}}(\mathbf{f})$ , given the previous face model, the current pose parameter, and the input depth image.

4.1 Notation and Prerequisites

Notation. The input depth sequence is $\{\mathbf{D}_{t}\}_{t=1}^{T}$ , where $T$ is the number of frames. The goal of the proposed system is to estimate the facial poses as $\{\boldsymbol{\theta}^{(t)}\}_{t=1}^{T}$ , and estimate all identities $\{\mathcal{I}_{k}\}_{k=1}^{K}$ contained in this sequence. Each identity $\mathcal{I}_{k}$ is parameterized by identity parameters in each identity distribution, please refer to Sec. 4.4 for detailed description.

2D and 3D Conversion. Let the matrix-form camera parameters be $\mathbf{K}=\left[\begin{smallmatrix}f&0&u_{\mathbf{o}}\\ 0&f&v_{\mathbf{o}}\\ 0&0&1\end{smallmatrix}\right]$ , where $f$ is the focal length and $\mathbf{o}=[u_{\mathbf{o}},v_{\mathbf{o}}]$ is the principle point. A 3D point $\mathbf{p}=[x,y,z]^{\top}$ can be perspectively projected onto the pixel coordinate

[TABLE]

Inversely, a pixel $\mathbf{x}=[u,v]$ can be back-projected as

[TABLE]

where $\mathbf{D}(\mathbf{x})$ is the depth at the pixel $\mathbf{x}$ .

Rigid Transformation. The rigid facial pose $\boldsymbol{\theta}$ consists of the rotation angles $\boldsymbol{\omega}\in\mathbb{R}^{3}$ , the translation vector $\mathbf{t}\in\mathbb{R}^{3}$ , and an auxiliary scale factor $\alpha$ . Thus $\boldsymbol{\theta}=\{\boldsymbol{\omega},\mathbf{t},\alpha\}$ indicates a transformation of the face model,

[TABLE]

where $\mathbf{R}(\boldsymbol{\omega})\in\mathbb{R}^{3\times 3}$ is the rotation matrix converted from $\boldsymbol{\omega}$ by Rodrigues’ rotation formula. The exponential scale $e^{\alpha}$ ensures the necessary positivity of the scale factor, $\forall\alpha\in\mathbb{R}$ .

The auxiliary scale factor is introduced to model possible deviations of scale beyond that observed in the training dataset, for example when tracking children. Even though iterative optimization alternating the model personalization and rigid pose estimation could alone resolve this issue, an explicit global scale will effectively compensate the scale incompatibility and speed up the optimization.

Transformed Face Distribution. We have depicted the probabilistic formulation for the morphable face model in Sec. 3. The rigid transformed face model $\mathcal{Q}$ has a similar marginal distribution for each $\mathbf{q}_{n}\in\mathcal{Q}$ as Eq. (3), but with rotation:

[TABLE]

where $\boldsymbol{\mu}_{\mathcal{M},[n]}$ and $\boldsymbol{\Sigma}_{\mathcal{M},[n]}^{(\boldsymbol{\omega})}$ are the mean and the rotated variance matrix for point $\mathbf{f}_{n}$ , respectively. Moreover, we have $\boldsymbol{\Sigma}_{\mathcal{M},[n]}^{(\boldsymbol{\omega})}=\mathbf{R}(\boldsymbol{\omega})\boldsymbol{\Sigma}_{\mathcal{M},[n]}\mathbf{R}(\boldsymbol{\omega})^{\top}$ . $\boldsymbol{\mu}_{\mathcal{M},[n]}$ and $\boldsymbol{\Sigma}_{\mathcal{M},[n]}$ are the $n^{\text{th}}$ blocks corresponding to $\mathbf{f}_{n}$ in $\boldsymbol{\mu}_{\mathcal{M}}$ and $\boldsymbol{\Sigma}_{\mathcal{M}}$ .

4.2 Face Localization

The face localization procedure infers the probable area containing a face in a depth image. It is a vital preprocessing step to extract the point cloud $\mathcal{P}$ from the depth image and a fairly good pose initialization for the face tracking system.

System Initialization. By adopting an efficient filtering-based head localization method [37] to the first frame $\mathbf{D}_{1}$ , the face ROI is localized at the pixel $\mathbf{c}$ in which the highest correlation is achieved with the average human head-shoulder template. The size of the ROI is depth-adaptive, where the height $h=f\bar{h}/d_{\mathbf{c}}$ and the width is $w=f\bar{w}/d_{\mathbf{c}}$ . $d_{\mathbf{c}}=\mathbf{D}_{1}(\mathbf{c})$ . $\bar{h}=240\text{mm}$ is empirically set as the average head height, and $\bar{w}=320\text{mm}$ is around twice the size of the average head width, which generally ensures the coverage of the entire faces. The point cloud $\mathcal{P}$ is a set of 3D points converted from depth pixels in the face ROI. In addition, the initial pose consists of no rotation as $\boldsymbol{\omega}=\mathbf{0}$ , and a translation $\mathbf{t}=\pi^{-1}(\mathbf{c},d_{\mathbf{c}})$ . Initially $\alpha=0$ in default.

During Tracking. The point cloud is extracted in a manner similar to the system initialization, but the center pixel from the previous frame is corrected by the translation vector. The pose is initialized from the previous frame pose estimate.

4.3 Robust Facial Pose Tracking

An optimal pose suggests that the input point cloud $\mathcal{P}$ should fail within a high density region of the distribution of the warped face model $\mathcal{Q}$ . However, in uncontrolled scenarios, we often encounter self-occlusions or object-to-face occlusions, like hair, glasses and fingers/hands, as shown in Fig. 4. In these scenarios, even if the face model $\mathcal{Q}$ and the point cloud $\mathcal{P}$ are correctly aligned, $\mathcal{Q}$ can only partially fit a subset of 3D points in $\mathcal{P}$ while leaving the remaining points in $\mathcal{Q}$ occluded.

Therefore, it is important to find the visible parts of $\mathcal{Q}$ , based on which we can robustly track the facial pose. We do not follow a correspondence-based methods like distance thresholding and normal vector compatibility check [11] to identify the visible regions, since finding reliable correspondences is itself challenging. Instead, we propose a ray visibility constraint to regularize the visibility of each face model point, based on our developed statistical face prior.

4.3.1 Ray Visibility Constraint

Denote the ray connecting the camera center to a face model point $\mathbf{q}_{n}$ as $\vec{v}_{\mathbf{q}_{n}}$ . This ray intersects with the point cloud $\mathcal{P}$ at a point $\mathbf{p}_{n}$ , which can be found by matching the pixel location of $\mathbf{q}_{n}$ in the input depth image [11, 15].

The role of the proposed ray visibility constraint (RVC) is to examine the physically reasonable relative position between $\mathbf{p}_{n}$ and $\mathbf{q}_{n}$ if the face model complies with a valid facial pose:

Visible $\mathbf{q}_{n}$ is close to the local surface around the connected $\mathbf{p}_{n}$ in the point cloud $\mathcal{P}$ ; 2. 2.

Occluded $\mathbf{q}_{n}$ must be located further away and behind the surface around $\mathbf{p}_{n}$ ; 3. 3.

Invalid $\mathbf{q}_{n}$ is in front of the surface around $\mathbf{p}_{n}$ , which is impossible and should be avoided.

Take Fig. 5 as an example. In this study, we propose a statistical formulation for the RVC constraint to automatically adjust the pose of the face model $\mathcal{Q}$ . Eventually, the face model will tightly but partially fit the point cloud $\mathcal{P}$ while leaving the rest of the points as occlusions. The shape uncertainty by expressions are also absorbed in this constraint, so that the expression variations will not harm the pose estimation. Note that in our setting, invalid point pairs usually suffer obligatory penalties to push the invalid face points farther away.

Statistical Formulation. Assume the surface of $\mathcal{P}$ is $y=\mathbf{n}_{n}^{\top}(\mathbf{p}-\mathbf{p}_{n})$ , where $\mathbf{n}$ is the normal vector at $\mathbf{p}_{n}$ and $y$ is the signed distance of $\mathbf{p}$ onto this plane. Thus the signed distance $y_{n}$ of a face point $\mathbf{q}_{n}$ to the surface around $\mathbf{p}_{n}$ is

[TABLE]

as visualized in Fig. 6(a). Similarly as Eq. (8), the distribution of the signed distance $p_{\mathcal{Q}\rightarrow\mathcal{P}}(y_{n};\boldsymbol{\theta})$ , called the projected face distribution onto the surface around $\mathcal{P}$ , will be

[TABLE]

where $\sigma_{o}^{2}$ is the noise variance describing the surface modeling errors and the depth sensor’s systematic errors. Moreover, the surface distribution around $\mathbf{p}_{n}$ is assumed Gaussian $p_{\mathcal{P}}(y_{n})=\mathcal{N}(y_{n}|0,\sigma_{o}^{2})$ , whose variance subsumes the modeling and systematic noise.

By profiling the distributions of the point pair $\{\mathbf{p}_{n},\mathbf{q}_{n}\}$ from the ray $\vec{v}_{\mathbf{q}_{n}}$ in Fig. 6(c), the visibility (as $\gamma_{n}=1$ ) is interpreted as the projected face distribution $p_{\mathcal{Q}\rightarrow\mathcal{P}}(y_{n};\boldsymbol{\theta})$ “overlapping” or being “in front” of the surface distribution $p_{\mathcal{P}}(y_{n})$ . More precisely, we define visible surface points $\mathbf{q}_{n}$ as those for which $\Delta(\mathbf{T}(\boldsymbol{\theta})\circ\boldsymbol{\mu}_{\mathcal{M},[n]};\mathbf{p}_{n})$ is inside one standard deviation significance of $p_{\mathcal{Q}\rightarrow\mathcal{P}}(y_{n})$ or negative111The normal $\mathbf{n}_{n}$ is set to point away from the camera center, thus negative $y_{n}$ means that the face point $\mathbf{q}_{n}$ is in front of the surface of $\mathbf{p}_{n}$ .,

[TABLE]

The occlusion (as $\gamma_{n}=0$ ) requires that $p_{\mathcal{Q}\rightarrow\mathcal{P}}(y_{n};\boldsymbol{\theta})$ is “behind” the major mass of the surface distribution, thus the signed distance $y_{n}$ should always be positive and beyond the confidence interval of $p_{\mathcal{Q}\rightarrow\mathcal{P}}(y_{n};\boldsymbol{\theta})$ , i.e.,

[TABLE]

4.3.2 Ray Visibility Score

The ray visibility score (RVS) is converted from the RVC for the rigid facial pose estimation, which measures the compatibility between the distributions of the transformed face model $\mathcal{Q}$ and the input point cloud $\mathcal{P}$ .

For a ray $\vec{v}_{\mathbf{q}_{n}}$ between the face model point $\mathbf{q}_{n}$ and the cloud point $\mathbf{p}_{n}$ , as the visibility $\gamma_{n}$ for $\mathbf{q}_{n}$ is judged by the pose $\boldsymbol{\theta}$ , the distribution of $\mathbf{p}_{n}$ is now refined to be as

[TABLE]

where $\mathcal{U_{O}}(y_{n})=U_{\mathcal{O}}$ is a uniform distribution that is valid within a range when $0<y_{n}<2500$ mm. (11) takes into account the visibility labels. When $\mathbf{q}_{n}$ is visible, $\mathbf{p}_{n}$ has a compatible surface distribution of $\mathcal{N}(y_{n}|0,\sigma_{o}^{2})$ . However, if $\mathbf{q}_{n}$ is occluded, $\mathbf{p}_{n}$ can be arbitrary as long as it is in front of $\mathbf{q}_{n}$ , which we model as a uniform distribution $\mathcal{U_{O}}(y_{n})$ . $p_{\mathcal{P}}(y_{n};\boldsymbol{\theta})$ can be regarded as a noisy face measurements contaminated by occlusions, while the projected face distribution $p_{\mathcal{Q}\rightarrow\mathcal{P}}(y_{n};\boldsymbol{\theta})$ represents the face model with its own uncertainties along the local surfaces. If ideally aligned, these distributions should be co-located and cover each other in a statistical manner.

The RVS score $\mathcal{S}(\mathcal{Q},\mathcal{P};\boldsymbol{\theta})$ is thus to measure the similarity between $p_{\mathcal{P}}(\mathbf{y};\boldsymbol{\theta})=\prod_{n=1}^{N_{\mathcal{M}}}p_{\mathcal{P}}(y_{n};\boldsymbol{\theta})$ and $p_{\mathcal{Q}}(\mathbf{y};\boldsymbol{\theta})=\prod_{n=1}^{N_{\mathcal{M}}}p_{\mathcal{Q}\rightarrow\mathcal{P}}(y_{n};\boldsymbol{\theta})$ by the Kullback-Leibler divergence,

[TABLE]

so that the more similar $p_{\mathcal{P}}(\mathbf{y};\boldsymbol{\theta})$ and $p_{\mathcal{Q}}(\mathbf{y};\boldsymbol{\theta})$ are, the smaller $\mathcal{L}_{\text{rvs}}(\boldsymbol{\theta})$ is. Thus, the optimal pose parameter $\boldsymbol{\theta}^{*}$ is the one minimizing the RVS score $\boldsymbol{\theta}^{*}=\arg\min_{\boldsymbol{\theta}}\mathcal{L}_{\text{rvs}}(\boldsymbol{\theta})$ . The visibility labels are instantaneously obtained when the pose parameter is given. Note that Eq. (12) not only accounts for the visible points but also penalizes the number of occluded points, thus avoiding a degenerated solution where a majority of the face points are labeled as occluded.

4.3.3 Temporal Constraint

The single-frame rigid pose estimation can effectively employ the proposed ray visibility score, however the lack of temporal cohesion results in unstable pose and identity estimation over a series of frames. In this part, we try to relieve these drawbacks by enforcing temporal pose coherence over adjacent frames. Since a fixed identity should bear a fixed scale, thus the estimated scale should be accumulated over a long period of sequence to stabilize the identity estimation.

Temporal Coherence. Observing that the face model and the point clouds should concurrently follow the same rigid motion between adjacent frames, it is possible to apply the pixel-wise flows in the input sequences to regularize the pose changes of the face model. Consider a facial pixel $\mathbf{x}_{m}^{t-1}$ at frame $t-1$ . A change in pose of $\Delta\boldsymbol{\theta}$ will induce the change of the pixel location, such as:

[TABLE]

Therefore, an optimal $\Delta\boldsymbol{\theta}$ will eliminate the depth difference between $\mathbf{D}_{t}(\mathbf{W}(\mathbf{x}_{m}^{t-1},\Delta\boldsymbol{\theta}))$ in frame $t$ and the depth value of the transformed point as $\mathcal{Z}(\mathbf{x}_{m}^{t-1};\Delta\boldsymbol{\theta})=[0,0,1]\mathbf{T}(\Delta\boldsymbol{\theta})\circ\pi^{-1}(\mathbf{x}_{m}^{t-1},\mathbf{D}_{t-1}(\mathbf{x}_{m}^{t-1}))$ . We use a temporal smoothness term for pose changes in adjacent frames to account for this fact, i.e.,

[TABLE]

for a set of visible pixels $\{\mathbf{x}_{m}^{t-1}\}_{m=1}^{M^{t-1}}$ in frame $t-1$ , with the temporal variance controlled by $\sigma_{t}^{2}$ .

In comparison with the prediction-correction techniques like Kalman filtering, which smooth the facial poses according to the statistics of their preceding estimates, the proposed method explicitly relates the flows between adjacent facial points and the motion of the face model, and is capable of rapidly capturing the incremental motion without pose flickers, and avoiding over-smoothing introduced by online filtering like Kalman filtering or average filtering [47].

Scale Accumulation. The scale $e^{\alpha}$ is indeed related to the face model rather than a component in the rigid pose, it means that a fixed tracked subject should have a fixed scale. Therefore, the estimation of the scale factor requires explicit accumulations from the preceding estimates if the subject keeps the same. The scale loss accounts for the mismatch between the previous scale and the current estimate

[TABLE]

where $\Delta\alpha=\alpha-\alpha^{(t-1)}$ , and $\alpha^{(t-1)}$ is the estimated scale factor in the previous frame. $\lambda^{(t)}_{s}$ is the cumulative precision across the frames, written as $\lambda^{(t)}_{s}=\lambda^{(t-1)}_{s}+\frac{1}{\sigma_{s}^{2}}$ , where $\sigma_{s}^{2}$ is a predefined variance of the scale factor.

4.3.4 Rigid Pose Estimation

The overall formulation for the rigid pose tracking combines the ray visibility score and the temporal constraint, as

[TABLE]

We seek to estimate the incremental pose parameters $\Delta\boldsymbol{\theta}$ between adjacent frames rather than the absolute poses $\boldsymbol{\theta}$ between the canonical face model and the input point cloud.

Solving Eq. (16) is challenging since $\mathcal{L}_{\text{rvs}}(\boldsymbol{\theta}^{(t-1)}+\Delta\boldsymbol{\theta})$ is highly nonlinear with no closed-form solution. In this work, we apply a recursive estimation method. In particular, in each iteration, we alternatively determine the visibility labels $\boldsymbol{\gamma}$ given the previous pose parameters, and then estimate the incremental pose parameters $\Delta\boldsymbol{\theta}$ . In the first sub-problem for visibility labels, we examine the ray visibility constraint to all point pairs along $\{\vec{v}_{\mathbf{q}_{n}}\}_{n=1}^{N_{\mathcal{M}}}$ coming from the current pose estimate. In the second sub-problem for incremental pose estimation, we apply the quasi-Newton update using the trust region approach for the overall cost, given the current visibility labels. The process repeats until convergence or beyond the predefined iteration numbers.

Two criteria are applied for detecting the bad local minima: 1) Estimated poses are unreasonable, i.e., sudden pose changes (e.g., large rotation changes $|\Delta\boldsymbol{\omega}|>\pi/4$ or large translation changes $|\Delta\mathbf{t}|>100$ mm) over adjacent frames, and impossible pose parameters out of their possible ranges (e.g., as depicted in Tab. I). 2) The proportion of occluded pixels, relative to total number of pixels in the face region, exceeds 50%. Therefore, particle swarm optimization (PSO) [44, 37] is optionally added to tackle poor initialization and bad local minima. Our facial pose tracking approach is listed in Algorithm 1, which highlights the key steps leading to the pose and visibility estimation.

4.4 Online Switchable Identity Adaptation

Concurrently with the rigid pose tracking, the face model is also progressively adapted to the tracked subject’s identity. Moreover, the proposed identity adaptation can instantaneously switch to different users and instantiate novel face models for new identities. To accomplish this, we track the switchable identity distributions in an online Bayesian updating scheme.

4.4.1 Online Identity Adaptation

As depicted in Sec. 3.2, the face model is personalized by the identity distribution $p^{\star}(\mathbf{w}_{\text{id}})=\mathcal{N}(\mathbf{w}_{\text{id}}|\boldsymbol{\mu}_{\text{id}}^{\star},\boldsymbol{\Sigma}_{\text{id}}^{\star})$ . However, the exact $p^{\star}(\mathbf{w}_{\text{id}})$ is unknown without adequate depth samples. Our goal is to sequentially update the face identity parameters $\boldsymbol{\mu}_{\text{id}}$ and $\boldsymbol{\Sigma}_{\text{id}}$ so as to gradually match the statistics of the input identity.

Formally, we apply an online Bayesian model to fulfill the sequential estimate. The priors for $\boldsymbol{\mu}_{\text{id}}$ and $\boldsymbol{\Sigma}_{\text{id}}$ are assigned as Normal Inverse-Wishart conjugate priors [47]:

[TABLE]

in which $\boldsymbol{\Sigma}_{\text{id}}$ follows the inverse-Wishart distribution $\mathcal{W}^{-1}$ , and $\boldsymbol{\mu}_{\text{id}}$ conditionally depends on $\boldsymbol{\Sigma}_{\text{id}}$ and follows a Gaussian distribution. Therefore, given a streaming set of samples $\mathbf{w}_{\text{id}}$ , the parameters $\{\mathbf{m},\beta,\boldsymbol{\Psi},\nu\}$ can be analytically updated [47], and we may simply employ this set of parameters to estimate the expected mean $\mathbb{E}[\boldsymbol{\mu}_{\text{id}}]=\mathbf{m}$ and variance $\mathbb{E}[\mathbb{\boldsymbol{\Sigma}}_{\text{id}}]=\boldsymbol{\Psi}/(\nu-N_{\text{id}}-1)$ of the identity distribution $p(\mathbf{w}_{\text{id}})$ .

Therefore, the identity adaptation turns out to estimating the parameter set $\{\mathbf{m},\beta,\boldsymbol{\Psi},\nu\}$ from a series of estimated $\{\mathbf{w}_{\text{id}}^{(t)}\}_{t=1}^{T}$ . And an additional identity parameter is the scale factor $\alpha$ , estimated from the tracking parameters. We thus denote the complete identity model as $\mathcal{I}=\{\mathbf{m},\beta,\boldsymbol{\Psi},\nu,\alpha\}$ .

– Estimating $\mathbf{w}_{\text{id}}$ . We directly estimate $\mathbf{w}_{\text{id}}$ by fitting the face model $\mathcal{Q}$ and the point cloud $\mathcal{P}$ in a maximum likelihood sense. The likelihood consists of point-to-point and point-to-plane costs [43] for measuring the discrepancy between paired points in the visible region. Thus, $\mathbf{w}_{\text{id}}^{(t)}$ in frame $t$ is achieved by maximizing

[TABLE]

where $p_{\mathcal{Q}\rightarrow\mathcal{P}}(y_{n}^{t}|\mathbf{w}_{\text{id}};\boldsymbol{\theta}^{(t)})$ and $p_{\mathcal{Q}}(\mathbf{p}_{n}^{t}|\mathbf{w}_{\text{id}};\boldsymbol{\theta}^{(t)})$ are transferred from Eq. (8) and (10) but have the form of $p_{\mathcal{M}}(\mathbf{f}|\mathbf{w}_{\text{id}})$ in Eq. (4). The confidence of the estimated $\mathbf{w}_{\text{id}}^{(t)}$ is modeled by the portion of the visible region relative to the entire head area in the face model,

[TABLE]

Therefore, given a clip of input frames, we can gather the identity parameters and their confidence scores as a tuple set $\mathcal{C}=\{(\mathbf{w}_{\text{id}}^{(1)},\kappa^{(1)}),(\mathbf{w}_{\text{id}}^{(2)},\kappa^{(2)}),\ldots,(\mathbf{w}_{\text{id}}^{(T)},\kappa^{(T)})\}$ .

– Updating $\{\mathbf{m},\beta,\boldsymbol{\Psi},\nu\}$ . Given the estimated identity parameters $\{\mathbf{w}_{\text{id}}^{(t)}\}_{t=1}^{T}$ , the distribution $p(\boldsymbol{\mu}_{\text{id}},\boldsymbol{\Sigma}_{\text{id}})$ is updated via online parameter updating based on Bayesian posteriors

[TABLE]

where $N_{\mathcal{C}}=\sum_{t=1}^{T}\kappa^{(t)}$ , $\bar{\mathbf{w}}_{\text{id}}=\frac{1}{N_{\mathcal{C}}}\sum_{t=1}^{T}\kappa^{(t)}\mathbf{w}_{\text{id}}^{(t)}$ and $\mathbf{S}=\frac{1}{N_{\mathcal{C}}}\sum_{t=1}^{T}\kappa^{(t)}(\mathbf{w}_{\text{id}}^{(t)}-\bar{\mathbf{w}}_{\text{id}})(\mathbf{w}_{\text{id}}^{(t)}-\bar{\mathbf{w}}_{\text{id}})^{\top}$ as the weighted scatter matrix of the samples. As more frames are acquired, the identity distribution more closely captures the statistics of the identity samples, where the mean value converges to the weighted mean of all samples, and its variance becomes narrower and converges to the samples’ scatter matrix.

4.4.2 Switchable Identity Adaptation

To seamlessly create a novel identity as well as switch among multiple identities, we extend our method to an online mixture model. The switching function is the posterior with respect to the identity ID $k$ conditioned on the current identity parameter $\mathbf{w}_{\text{id}}$ and the stored identity models $\{\mathcal{I}_{k}\}_{k=1}^{K}$ :

[TABLE]

where $\mathcal{I}_{0}$ represents the generic identity model. $p_{\mathcal{I}_{k}}(\mathbf{w}_{\text{id}})$ is the identity prior where $\boldsymbol{\mu}_{\text{id}}$ and $\boldsymbol{\Sigma}_{\text{id}}$ are controlled by $\mathcal{I}_{k}$ . This posterior assigns a sample $\mathbf{w}_{\text{id}}$ to one identity model $k$ according to the Gaussian mixture model.

Therefore, we can automatically cluster the set of identity samples and confidence scores $\mathcal{C}=\{\mathbf{w}_{\text{id}}^{(t)},\kappa^{(t)}\}_{t=1}^{T}$ into $\{\mathcal{C}_{k}\}_{k=1}^{K}$ according to this MAP estimation, and update the $k^{\text{th}}$ identity model with its corresponded clustered tuples.

Note that if the indicator once refers to the generic model $\mathcal{I}_{0}$ , or the set $|\mathcal{C}_{0}|>0$ , we instantiate a new identity model, initialized by the generic model, which is subsequently updated, as shown in lines 10 to 13 in Alg. 2.

The presented face model is assigned by the indicated identity model in the last frame, and its scale factor is parsed from the stored model accordingly.

We summarize our online switchable identity adaptation in Alg. 2. In practice, each identity continues adaptation until its adapted face model converges, i.e., the average point-wise mean squared error between adjacent face models is smaller than a given threshold.

5 Experiments and Discussions

5.1 Datasets and System Setup

Datasets. We evaluate the proposed method on two public depth-based benchmark datasets, i.e., the Biwi Kinect head pose dataset [21] and ICT 3D head pose (ICT-3DHP) dataset [48]. The dataset summaries are listed in Tab. I.

Biwi Dataset: $15$ K RGB-D images of $20$ subjects (different genders and races) in $24$ sequences, with large ranges in rotations and translations. The recorded faces suffer the occlusions from hair and accessories and shape variations from facial expressions. The groundtruth poses and face models were generated by RGB-D based FaceShift [14].

ICT-3DHP Dataset: $10$ Kinect RGB-D sequences including $6$ males and $4$ females. This dataset contains similar occlusions and distortions like Biwi dataset. Each subject also involves arbitrary expression variations. The groundtruth rotations were collected by Polhemus Fastrak flock of birds tracker that is attached to a cap worn by each subject.

Implementation. We implemented the proposed depth-based 3D facial pose tracking model in MATLAB. The results reported in this paper were measured on a 3.4 GHz Intel Core i7 processor with 16GB RAM. No GPU acceleration was applied. The face model is specified as $N_{\mathcal{M}}=11510,N_{\text{id}}=150,N_{\text{exp}}=47$ . In practice, we employ a truncated multilinear model with smaller dimensions as $\tilde{N}_{\text{id}}=28,\tilde{N}_{\text{exp}}=7$ for the sake of efficiency. We set the noise variance as $\sigma_{o}^{2}=25$ , and the outlier distribution is $\mathcal{U_{O}}(y)=U_{\mathcal{O}}=\frac{1}{2500}$ . $\sigma_{t}^{2}$ is empirically set to $75$ , and $\sigma_{s}^{2}$ is $0.04$ . Note that the measurement unit in this paper is millimeter (mm). The online face adaptation is performed every $5$ frames to avoid overfitting to partial facial scans.

Visualization. We plot the point clouds by MATLAB’s parula color map. Each point’s color represents its depth to the camera, where the farther point has a warmer color, and the nearer point has a cooler color. The presented face model is visualized as transformed face mesh based on Phong shading. The visibility masks are marked in red and overlaid on the transformed face mesh.

Note that the face model has holes inside the mouth and eyes, and thus the results shown in Fig. 12, Fig. 14(c) and Fig. 15 are not meaningful.

5.2 Ablation Study

5.2.1 Facial Poses with Generic Face Models

The ray visibility constraint driven rigid facial pose is robust to noise, occlusions and moderate face deformations. For example, the point clouds in Fig. 7 are noisy with missing measurements and quantization errors, as well as heavy occlusions (hairs in (a) and fingers in (c)) and severe partial face scans ((a) and (b)), but the proposed method is still able to render successful poses to fit the point clouds even based on the generic face model and naïve pose initialization.

The proposed facial pose estimation outperforms previous methods such as iterative closest points (ICP) [43], given coarse initial poses proposed by Meyer’s face localization method [37]. ICP iteratively revises the transformation between the face model and the point clouds according to point-to-plane distance between matched points. As shown in Fig. 8, the proposed method only needs the set of rays $\mathcal{V}=\{\vec{v}_{\mathbf{q}_{n}}\}_{n=1}^{N_{\mathcal{M}}}$ but does not require explicit correspondences during estimation. In contrast, ICP and its variants are not able to check the visibility of each matched point pair, and thus cannot guarantee a reasonable pose. For example, as shown in Fig. 8(d), ICP matches the face model with the hairs but has not been aware of the fact that the face cannot occlude the point cloud. Moreover, the ray visibility score (RVS) is less vulnerable to bad local minima, since it rewards a complete overlap of the surface distributions $p_{\mathcal{P}}(\mathbf{y})$ and $p_{\mathcal{Q}\rightarrow\mathcal{P}}(\mathbf{y};\boldsymbol{\theta})$ , which is much less sensitive than some point estimate methods like maximum likelihood (ML) or maximum a posteriori (MAP). For example, maximizing the likelihood $p_{\mathcal{Q}\rightarrow\mathcal{P}}(\mathbf{y};\boldsymbol{\theta})$ in Eq. (10) may just seek a local mode that fails to catch the major mass of the distribution, as shown in Fig. 8(e). On the contrary, the Kullback-Leibler divergence in RVS ensures the face model distribution with the optimal $\boldsymbol{\theta}$ covers the bulk of information conveyed in $p_{\mathcal{P}}(\mathbf{y})$ . Fig. 7 and 8 both reveal the superiority of the RVS and RVS+PSO methods in handling unconstrained facial poses with large rotations and and heavy occlusions, even with generic face model, in which the particle swarm optimization refines the facial poses one step further.

5.2.2 Facial Poses with Personalized Face Models

Fig. 9 shows some tracking results on Biwi and ICT-3DHP datasets based on the gradually adapted face models. Although using generic model can already achieve good performance over challenging cases, as shown in Fig. 7 and 8, using personalized face model receives even better results both in the rotation and translation metrics. As visualized in Fig. 10, by comparing the angle and translation error histograms, it shows that both the angle and the translation errors by the personalized model are smaller than those by the generic model, since the personalized model tends to produce much narrower histograms with much smaller outliers from large errors, suggesting that the personalized face model indeed benefits the face tracking.

Moreover, the personalized face shape distribution enables the face model to fit compactly with the input point cloud so as to better capture some challenging poses than the generic face model (in Fig. 9(a)), while the personalized expression distribution makes the estimated facial pose robust to changes in the expression space (in Fig. 9(b)). Angle error curves in Fig. 11 demonstrate the superiority of the personalized face model in eliminating more angle errors that relate to the shape mismatches between the face model and the point cloud. For the probe frame $t_{0}$ , both the generic and personalized face models have small pitch and yaw errors. Despite that both models seem to produce good fitting results in the camera view, the mismatching between the generic face model and the point clouds leads to a much larger roll error and make the model fail to fit some essential facial regions such as the nose tip, as visualized in the side view. In contrast, the personalized face model is less vulnerable to the profiled faces.

5.2.3 Visibility Detection and Occlusion Handling

Meanwhile, our method efficiently infers the occlusions, partially owing this success to the reliable visibility detection embedded in the ray visibility constraint. As shown in Fig. 12 and Fig. 1(a), the proposed method has checked the visibility of a face model with respect to the input point cloud such that the visibility mask tightly covers the visible regions of a face model and rejects various occlusions placed in its front. Therefore, the proposed pose estimation takes the reliable facial points into consideration and is thus robust to severe occlusions, e.g., self-occlusions like profiled faces, accessories, hands and etc.

We also challenge our robust facial pose tracking with different levels of occlusions, similarly as the way used in Hsieh et. al. [11]. The synthesized occlusion regions have gradually increased sizes and they are randomly placed around the face center. The tracking accuracy is measured on the Biwi dataset, as visualized in Fig. 13. The proposed visibility constraint can handle a large amount of occlusions, and the tracking errors will not increase tremendously.

5.2.4 Online Face Model Adaptation

The proposed method provides an online identity adaptation that progressively adapts the face model to one test subject as shown in Fig. 14. More personalized face models are also visualized in Fig. 16. The statistical face model has the ability to cover various identities ranging from different ages, races and genders. We also compare the refined face models to the groundtruth face meshes in the Biwi datasets, as visualized in Fig. 15. As more frames coming into the system, the personalized face models will gradually have smaller shape differences (termed as point-to-plane cost for matched 3D points) to the groundtruth meshes. Note that the speed of convergence depends on the ratio of occlusions with respect to the facial region, and fewer occlusions lead to faster adaptation.

But the adaptation quality is sometimes contaminated due to falsely aggregating the occluded face regions (short hair, beard and so on) into our model updating system, thus the personalized face models may suffer from distortions around the side heads, as shown in Fig. 16.

The proposed online switch scheme is able to either instantaneously create a novel identity model for an unseen tester, or switch different stored personalized models according to the change of active testers. As visualized in Fig. 17, six depth video clips capturing three different identities (marked by ID $1$ to $3$ ) from Biwi dataset are put into the proposed system. Initially, the first identity model is generated by the generic model, and it is gradually warped to the shape of ID $1$ given a series of depth frames referring to ID $1$ . Since the subsequent frames after $t_{1}$ capture a novel identity ID $2$ , the proposed system automatically adds a novel identity model according to the MAP estimation from Eq. (23). The presented face model is replaced by that of ID $2$ , and it is updated based on the inputs frames about ID $2$ . Same process also happens to ID $3$ when its frames are put into the system after $t_{2}$ . Interestingly, the proposed system can quickly parse a suitable identity model for the input depth frame from the stored personalized identity models, based on a similar switching criterion from Eq. (23). For example, ID $1$ is immediately parsed at time $t_{3}$ , and the presented face model does not receive meaningful updates a step forward since this identity has already been well personalized and the additional frames have not introduced novel knowledges about its facial shape. ID $2$ and ID $3$ are also effectively parsed when the right identities are captured again, but we notice that ID $2$ still receives incremental shape adaptation as additional informative frames are helpful to adapt its identity model. Therefore, the proposed online identity adaptation offers a promising ability to instantaneously switch from one subject to a new one followed by a smooth facial identity model updating.

5.2.5 Execution Efficiency

The proposed RVS-based facial pose estimation has a comparable complexity as ICP (in Tab. II). Thanks to the analytical KL-divergence for Gaussian distributions, the ray visibility score is analytical and contains a similar point-wise squared cost term as ICP. The added temporal coherence term requires a smaller computational budget than RVS and the scale accumulation only adds a marginal costs in computation ((RVS+TE in Tab. II). PSO is indeed the bottleneck of our system (see RVS+TE+PSO in Tab. II), but as it is optionally added to tackle tracking failure, it will not tremendously slow down the executive speed. Since the identity adaptation requires a relative large linear solver in estimating $\mathbf{w}_{\text{id}}$ , it (IA in Tab. II) costs a scale more computational budgets than the tracking module (RVS+TE).

5.3 Quantitative Comparisons with the prior arts

We compare our final model with a number of prior methods [21, 37, 49, 48, 20, 44, 38, 27] for depth-based 3D facial pose tracking on the Biwi [21] and ICT-3DHP [48] datasets. Tab. III shows the average absolute errors for the rotation angles and the average Euclidean errors for the translation on the Biwi dataset. The rotational errors were further quantified with respect to the yaw, pitch and roll angles, respectively. Similarly in Tab. IV, we evaluate the average rotation errors on the ICT-3DHP dataset, as translations are unavailable for ICT-3DHP datasets [48]. Note that the results of the reference methods are taken directly from those reported by their respective authors in literature.

On the Biwi dataset, the proposed method produces the overall lowest rotation errors among the depth-based head pose tracking methods [27, 21, 48, 49, 20, 38, 44, 37]. Although no appearance information is used, the proposed approach outperforms the existing state-of-the-art method [38] (marked with $\star$ in Tab. III and IV) that employed both RGB and depth data. Similar conclusions can also be drawn on the ICT-3DHP dataset, where the proposed method also achieves a superior performance on estimating the rotation parameters in comparison with the random forests [21] and CLM-Z [48]. Our performance is slightly superior to Li [38] even without using any color information.

In comparison with the previously released conference paper [27], the superiority of our proposed method attributes to more effective constrains such as the temporal coherence and scale stabilization. Therefore not only the flickering from the estimated poses among adjacent frames is eliminated, but also the tracking robustness is improved, since some ambiguous cases such as heavily occluded or profiled faces can obtain more confidences from the per-pixel facial flow across adjacent frames.

As for the translation parameters, the proposed method also achieves very competitive performance on the Biwi dataset. The proposed method outperforms the prior arts, especially those proposed by Meyer et. al. [37] and Sheng et. al. [27]. In comparison to Sheng et. al. [27], the introduction of temporal coherence and scale accumulation remarkably increases the translation accuracy, and makes the proposed method outperforms the previous state-of-the-art methods, such as Meyer et. al. [37].

5.4 Limitations

The proposed system is inevitably vulnerable when the input depth video is contaminated by heavy noise, outliers and quantization errors. Since effective clues like facial landmarks are inaccessible due to the missing of the color information, difficult facial poses with extreme large rotational angles or occlusions are sometimes hard to be well tracked. Even though this problem can be relieved by enforcing temporal coherence, the results still receive accumulative tracking errors from the previous frames. Moreover, tightly fitting occlusions, such as face masks, veils and short hair， cannot be well handled by the proposed method.

6 Conclusions And Future Works

We propose a robust 3D facial pose tracking for commodity depth sensors that brings about the state-of-the-art performances on popular facial pose datasets. The proposed generative face model and the ray visibility constraint ensure a robust groupwise 3D facial pose tracking accounting for all inherent identities and expressions that effectively handles heavy occlusions, profiled faces and expression variations. The online switchable identity adaptation is able to gradually personalized the face model for one specific user, and instantaneously switch among stored identities and create a new identity model for a novel user.

Some future works are beneficial: effective long-term temporal coherence still deserves attention since it provides smoother and more complex tracking trajectories. Moreover, effective depth-based features are helpful to provide semantic correspondences to eliminate trivial or outlier solutions.

Acknowledgments

This research is supported by the BeingTogether Centre, a collaboration between Nanyang Technological University (NTU) Singapore and University of North Carolina (UNC) at Chapel Hill. The BeingTogether Centre is supported by the National Research Foundation, Prime Minister’s Office, Singapore under its International Research Centres in Singapore Funding Initiative. This work is also in part supported MoE Tier-2 Grant (2016-T2-2-065) of Singapore.

Bibliography49

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] C. Cao, Y. Weng, S. Lin, and K. Zhou, “3d shape regression for real-time facial animation,” ACM Trans. Graph. , vol. 32, no. 4, p. 41, 2013.
2[2] B. Egger, S. Schönborn, A. Schneider, A. Kortylewski, A. Morel-Forster, C. Blumer, and T. Vetter, “Occlusion-aware 3d morphable models and an illumination prior for face image analysis,” Int. J. Comput. Vis. , Jan 2018.
3[3] J. Booth, E. Antonakos, S. Ploumpis, G. Trigeorgis, Y. Panagakis, and S. Zafeiriou, “3d face morphable models ”in-the-wild”,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. IEEE, July 2017.
4[4] A. S. Jackson, A. Bulat, V. Argyriou, and G. Tzimiropoulos, “Large pose 3d face reconstruction from a single image via direct volumetric cnn regression,” in Proc. IEEE Int. Conf. Comput. Vis. IEEE, Oct 2017.
5[5] E. Richardson, M. Sela, R. Or-El, and R. Kimmel, “Learning detailed face reconstruction from a single image,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. IEEE, July 2017.
6[6] S. Saito, T. Li, and H. Li, “Real-time facial segmentation and performance capture from RGB input,” Proc. Euro. Conf. Comput. Vis. , 2016.
7[7] P. Dou, S. K. Shah, and I. A. Kakadiaris, “End-to-end 3d face reconstruction with deep neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. IEEE, July 2017.
8[8] C. Cao, Q. Hou, and K. Zhou, “Displaced dynamic expression regression for real-time facial tracking and animation,” ACM Trans. Graph. , vol. 33, no. 4, p. 43, 2014.