Structure from Articulated Motion: Accurate and Stable Monocular 3D   Reconstruction without Training Data

Onorina Kovalenko; Vladislav Golyanik; Jameel Malik; Ahmed; Elhayek; Didier Stricker

arXiv:1905.04789·cs.CV·November 13, 2019

Structure from Articulated Motion: Accurate and Stable Monocular 3D Reconstruction without Training Data

Onorina Kovalenko, Vladislav Golyanik, Jameel Malik, Ahmed, Elhayek, Didier Stricker

PDF

TL;DR

This paper introduces SfAM, a model-based method for monocular 3D reconstruction of articulated objects that achieves state-of-the-art accuracy without requiring training data, and is robust to noise and generalizes well.

Contribution

SfAM is a training-free, model-based approach that combines NRSfM with bone length constraints to accurately recover 3D structures from 2D observations.

Findings

01

Achieves comparable accuracy to learning-based methods on benchmarks.

02

Outperforms previous non-rigid structure from motion techniques.

03

Demonstrates robustness to noisy 2D annotations and generalizes to various objects.

Abstract

Recovery of articulated 3D structure from 2D observations is a challenging computer vision problem with many applications. Current learning-based approaches achieve state-of-the-art accuracy on public benchmarks but are restricted to specific types of objects and motions covered by the training datasets. Model-based approaches do not rely on training data but show lower accuracy on these datasets. In this paper, we introduce a model-based method called Structure from Articulated Motion (SfAM), which can recover multiple object and motion types without training on extensive data collections. At the same time, it performs on par with learning-based state-of-the-art approaches on public benchmarks and outperforms previous non-rigid structure from motion (NRSfM) methods. SfAM is built upon a general-purpose NRSfM technique while integrating a soft spatio-temporal constraint on the bone…

Tables2

Table 1. Table 1 : The reconstruction error ℰ 3 D subscript ℰ 3 𝐷 \mathcal{E}_{3D} of SfAM and previous methods on Human 3.6m dataset. “*” indicates learning-based methods which are trained on Human 3.6m Ionescu et al. ( 2014 ) . We outperform all model-based approaches and reach very close to the tuned supervised learning techniques.

Method	P1	P2	P3
Zhou et al. Zhou et al. (2016) *	106.7	-	-
Akhter et al. Akhter and Black (2015)	-	181.1	-
Ramakrishna et al. Ramakrishna et al. (2012)	-	157.3	-
Bogo et al. Bogo et al. (2016)	-	82.3	-
Kanazawa et al. Kanazawa et al. (2018) *	67.5	66.5	-
Moreno-Noguer Moreno-Noguer (2017) *	62.2	-	-
Yasin et al. Yasin et al. (2015)	-	-	110.2
Rogez et al. Rogez and Schmid (2016)	-	-	88.1
Chen, Ramanan Chen and Ramanan (2017) *	-	-	82.7
Nie et al. Nie et al. (2017) *	-	-	79.5
Sun et al. Sun et al. (2017) *	-	-	48.3
Omran et al. Omran et al. (2018) *	59.9	-	-
Zhou et al. Zhou et al. (2018) *	54.7	-	-
Mehta et al. Mehta et al. (2017) *	54.6	-	-
Pavlakos et al. Pavlakos et al. (2017) *	51.9	-	-
Kinauer et al. Kinauer et al. (2017) *	50.3	-	-
Tekin et al. Tekin et al. (2017) *	50.1	-	-
Rogez et al. Rogez et al. (2019) *	49.2	51.1	42.7
Habibie et al. Habibie et al. (2019) *	49.2	-	-
Martinez et al. Martinez et al. (2017) *	45.6	-	-
Zhao et al. Zhao et al. (2019) *	43.8	-	-
Pavlakos et al. Pavlakos et al. (2018) *	41.8	-	-
Arnab, Doersch et al. Arnab et al. (2019) *	41.6	-	-
Chen, Lin et al. Chen et al. (2019) *	41.6	-	-
Sun et al. Sun et al. (2018) *	40.6	-	-
Wandt, Rosenhahn Wandt and Rosenhahn (2019) *	38.2	-	-
Pavllo et al. Pavllo et al. (2019) *	36.5	-	-
Dabral et al. Dabral et al. (2018) *	36.3	-	-
SMSR Ansari et al. (2017)	106.6	105.2	102.9
SMSR Ansari et al. (2017)+Rehan et al. (2014)	145.2	124.0	139.9
Our SfAM	51.2	51.7	53.9

Table 2. Table 2 : The normalized mean 3D error e 3 D subscript 𝑒 3 𝐷 e_{3D} of previous NRSfM methods and our SfAM for synthetic sequences Akhter et al. ( 2008 ) .

Method	Drink	PickUp	Stretch	Yoga
MP Paladini et al. (2009)	0.4604	0.4332	0.8549	0.8039
PTA Akhter et al. (2008)	0.0250	0.2369	0.1088	0.1625
CSF1 Gotardo and Martinez (2011)	0.0223	0.2301	0.0710	0.1467
CSF2 Gotardo and Martínez (2011)	0.0223	0.2277	0.0684	0.1465
BMM Dai et al. (2014)	0.0266	0.1731	0.1034	0.1150
Lee Lee et al. (2016)	0.8754	1.0689	0.9005	1.2276
PPTA Agudo and Moreno-Noguer (2018)	0.011	0.235	0.084	0.158
SMSR Ansari et al. (2017)	0.0287	0.2020	0.0783	0.1493
SMSR Ansari et al. (2017)+Rehan et al. (2014)	0.4348	0.4965	0.3721	0.4471
Our SfAM	0.0226	0.1921	0.0673	0.1242

Equations57

W = R S = M R (C \otimes I_{3}) B = M B,

W = R S = M R (C \otimes I_{3}) B = M B,

W ≅ M^{'} B^{'} ≅ M M^{'} Q B Q^{- 1} B^{'} = M B .

W ≅ M^{'} B^{'} ≅ M M^{'} Q B Q^{- 1} B^{'} = M B .

M_{2 t - 1 : 2 t}^{'} Q_{k} = c_{t k} R_{t} .

M_{2 t - 1 : 2 t}^{'} Q_{k} = c_{t k} R_{t} .

{M_{2 t - 1}^{'} F_{k} M_{2 t - 1}^{' T} = M_{2 t}^{'} F_{k} M_{2 t}^{' T} = c_{ik}^{2} I_{2}, M_{2 t - 1}^{'} F_{k} M_{2 t}^{' T} = 0.

{M_{2 t - 1}^{'} F_{k} M_{2 t - 1}^{' T} = M_{2 t}^{'} F_{k} M_{2 t}^{' T} = c_{ik}^{2} I_{2}, M_{2 t - 1}^{'} F_{k} M_{2 t}^{' T} = 0.

G_{t} [M_{2 t - 1}^{'} \otimes M_{2 t - 1}^{' T} - M_{2 t}^{'} \otimes M_{2 t}^{' T} M_{2 t - 1}^{'} \otimes M_{2 t}^{' T}] vec (F_{k}) = 0,

G_{t} [M_{2 t - 1}^{'} \otimes M_{2 t - 1}^{' T} - M_{2 t}^{'} \otimes M_{2 t}^{' T} M_{2 t - 1}^{'} \otimes M_{2 t}^{' T}] vec (F_{k}) = 0,

G vec (F_{k}) = 0,

G vec (F_{k}) = 0,

F_{k} min ∥ G vec (F_{k}) ∥^{2} .

F_{k} min ∥ G vec (F_{k}) ∥^{2} .

S^{#} = X_{11} \dots X_{1 N} ⋮ ⋮ X_{T 1} \dots X_{T N} Y_{11} \dots Y_{1 N} ⋮ ⋮ Y_{T 1} \dots Y_{T N} Z_{11} \dots Z_{1 N} ⋮ ⋮ Z_{T 1} \dots Z_{T N},

S^{#} = X_{11} \dots X_{1 N} ⋮ ⋮ X_{T 1} \dots X_{T N} Y_{11} \dots Y_{1 N} ⋮ ⋮ Y_{T 1} \dots Y_{T N} Z_{11} \dots Z_{1 N} ⋮ ⋮ Z_{T 1} \dots Z_{T N},

S^{#} = [P_{x} P_{y} P_{z}] (I_{3} \otimes S),

S^{#} = [P_{x} P_{y} P_{z}] (I_{3} \otimes S),

S min ∣∣ S^{#} Π ∣ ∣_{*}, s. t. W = R S,

S min ∣∣ S^{#} Π ∣ ∣_{*}, s. t. W = R S,

E_{B L} (S) = t = 1 \sum T b = 1 \sum B e_{t b} (S),

E_{B L} (S) = t = 1 \sum T b = 1 \sum B e_{t b} (S),

\min_{\bf S}\Big{(}||{\bf S}^{\#}||_{*}+\frac{\beta}{2}{\bf E}_{BL}({\bf S})\Big{)},\quad\text{s. t.}\quad{\bf W}={\bf R}{\bf S},

\min_{\bf S}\Big{(}||{\bf S}^{\#}||_{*}+\frac{\beta}{2}{\bf E}_{BL}({\bf S})\Big{)},\quad\text{s. t.}\quad{\bf W}={\bf R}{\bf S},

S min ∣∣ S^{#} ∣ ∣_{*} + \frac{β}{2} A min E_{B L} (A),

S min ∣∣ S^{#} ∣ ∣_{*} + \frac{β}{2} A min E_{B L} (A),

s. t. W = R S and A = S .

L (S, A, μ) = μ ∣∣ S^{#} ∣ ∣_{*} + \frac{β}{2} E_{B L} (A) + \frac{1}{2} ∣∣ W - R S ∣ ∣_{F}^{2} + \frac{1}{2} ∣∣ A - S ∣ ∣_{F}^{2},

L (S, A, μ) = μ ∣∣ S^{#} ∣ ∣_{*} + \frac{β}{2} E_{B L} (A) + \frac{1}{2} ∣∣ W - R S ∣ ∣_{F}^{2} + \frac{1}{2} ∣∣ A - S ∣ ∣_{F}^{2},

S min L (S, μ) =

S min L (S, μ) =

\displaystyle\;\text{and}\;\min_{{\bf A}}{\bf L}({\bf A})=\min_{{\bf A}}\Big{(}\frac{\beta}{2}{\bf E}_{BL}({\bf A})+\frac{1}{2}||{\bf A}-{\bf S}||^{2}_{F}\Big{)}.

\displaystyle\;\text{and}\;\min_{{\bf A}}{\bf L}({\bf A})=\min_{{\bf A}}\Big{(}\frac{\beta}{2}{\bf E}_{BL}({\bf A})+\frac{1}{2}||{\bf A}-{\bf S}||^{2}_{F}\Big{)}.

g (S^{#}, A) = \frac{\partial \frac{1}{2} ( ∣∣ W - R S ∣ ∣ _{F}^{2} + ∣∣ A - S ∣ ∣ _{F}^{2} )}{\partial S ^{#}} = [P_{x} P_{y} P_{z}] (I_{3} \otimes (R^{T} (R S - W) + (S - A))) .

g (S^{#}, A) = \frac{\partial \frac{1}{2} ( ∣∣ W - R S ∣ ∣ _{F}^{2} + ∣∣ A - S ∣ ∣ _{F}^{2} )}{\partial S ^{#}} = [P_{x} P_{y} P_{z}] (I_{3} \otimes (R^{T} (R S - W) + (S - A))) .

Y^{(t + 1)} = S^{# (t)} - τ g (S^{# (t)}, A^{(t)}),

Y^{(t + 1)} = S^{# (t)} - τ g (S^{# (t)}, A^{(t)}),

S^{# (t + 1)} = S_{τ μ^{(t)}} (Y^{(t + 1)}),

μ^{(t + 1)} = ρ μ^{(t)},

F (A) =

F (A) =

R^{3 T N} \to R^{B T + T N} .

L (A) = ∥ F (A) ∥_{2}^{2} .

L (A) = ∥ F (A) ∥_{2}^{2} .

A^{'} = ar g A min ∥ F (A) ∥_{2}^{2} .

A^{'} = ar g A min ∥ F (A) ∥_{2}^{2} .

F (A_{k} + Δ A) \approx F (A_{k}) + J (A_{k}) Δ A,

F (A_{k} + Δ A) \approx F (A_{k}) + J (A_{k}) Δ A,

Δ A min ∥ J (A_{k}) Δ A + F (A_{k}) ∥^{2} .

Δ A min ∥ J (A_{k}) Δ A + F (A_{k}) ∥^{2} .

[J (A_{k})^{T} J (A_{k}) + λ_{k} I] Δ A = - J (A_{k})^{T} F (A_{k}),

[J (A_{k})^{T} J (A_{k}) + λ_{k} I] Δ A = - J (A_{k})^{T} F (A_{k}),

E_{3 D} = G min \frac{1}{T} \frac{1}{N} t = 1 \sum T n = 1 \sum N ∣∣ \overline{S_{n}^{t}} - G (S_{n}^{t}) ∣ ∣_{2},

E_{3 D} = G min \frac{1}{T} \frac{1}{N} t = 1 \sum T n = 1 \sum N ∣∣ \overline{S_{n}^{t}} - G (S_{n}^{t}) ∣ ∣_{2},

e_{3 D} = G min \frac{1}{σ T} \frac{1}{N} t = 1 \sum T n = 1 \sum N ∣∣ \overline{S_{n}^{t}} - G (S_{n}^{t}) ∣ ∣_{2}^{2}, with

e_{3 D} = G min \frac{1}{σ T} \frac{1}{N} t = 1 \sum T n = 1 \sum N ∣∣ \overline{S_{n}^{t}} - G (S_{n}^{t}) ∣ ∣_{2}^{2}, with

σ = G min \frac{1}{3 T} t = 1 \sum T (σ_{t x} + σ_{t y} + σ_{t z}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAverage Pooling · Global Average Pooling · Sigmoid Activation · *Communicated@Fast*How Do I Communicate to Expedia? · Dense Connections · Scale-wise Feature Aggregation Module

Full text

Abstract

Recovery of articulated 3D structure from 2D observations is a challenging computer vision problem with many applications. Current learning-based approaches achieve state-of-the-art accuracy on public benchmarks but are restricted to specific types of objects and motions covered by the training datasets. Model-based approaches do not rely on training data but show lower accuracy on these datasets. In this paper, we introduce a model-based method called Structure from Articulated Motion (SfAM), which can recover multiple object and motion types without training on extensive data collections. At the same time, it performs on par with learning-based state-of-the-art approaches on public benchmarks and outperforms previous non-rigid structure from motion (NRSfM) methods. SfAM is built upon a general-purpose NRSfM technique while integrating a soft spatio-temporal constraint on the bone lengths. We use alternating optimization strategy to recover optimal geometry (i.e., bone proportions) together with 3D joint positions by enforcing the bone lengths consistency over a series of frames. SfAM is highly robust to noisy 2D annotations, generalizes to arbitrary objects and does not rely on training data, which is shown in extensive experiments on public benchmarks and real video sequences. We believe that it brings a new perspective on the domain of monocular 3D recovery of articulated structures, including human motion capture.

\history\Title

Structure from Articulated Motion: Accurate and Stable Monocular 3D Reconstruction without Training Data \AuthorOnorina Kovalenko *1,**, Vladislav Golyanik 2, Jameel Malik 1,3,4, Ahmed Elhayek 1,5 and Didier Stricker 1,3 \AuthorNamesOnorina Kovalenko, Vladislav Golyanik, Jameel Malik, Ahmed Elhayek and Didier Stricker

\corresCorrespondence: [email protected]

1 Introduction

3D structure recovery of articulated objects (i.e., comprising multiple connected rigid parts) from a set of 2D point tracks through multiple monocular images is a challenging computer vision problem Ramakrishna et al. (2012); Wandt et al. (2016); Zhou et al. (2016); Leonardos et al. (2016). Articulated structure recovery is ill-posed due to missing information about the third dimension Lee and Chen (1985). Its applications include gesture and activity recognition, character animation in movies and games, and motion analysis in sport and robotics.

Recently, multiple learning-based approaches that recover 3D structures from 2D landmarks have been introduced Hossain and Little (2018); Zhou et al. (2017); Mehta et al. (2017); Martinez et al. (2017). These methods show state-of-the-art accuracy across public benchmarks. However, they are restricted to a specific kind of structure (e.g., human skeleton) and require extensive datasets for training. Moreover, they often fail to recover poses that are different from the training examples (see Section 4.2.5). When a scene includes different types of articulated objects, different methods have to be applied to reconstruct the whole scene.

In this paper, we introduce a general approach for accurate recovery of 3D poses of any articulated structure from 2D observations that does not rely on training data (see Figure 1). We build upon the recent progress in non-rigid structure from motion (NRSfM), which is a general technique for non-rigid 3D reconstruction from 2D point tracks. However, when considering an articulated object as a general non-rigid one, reconstructions can evince significant variations in the distances between the connected joints (see Section 4.2.3). These distances have to remain nearly constant across all articulated poses. Our method relies on this assumption and imposes a spatio-temporal constraint on the bone lengths.

We call our approach Structure from Articulated Motion (SfAM). We apply an articulated structure term as a soft constraint on top of the classic optimization problem of NRSfM Dai et al. (2014). This term enforces the bone lengths—though not known in advance—to remain constant across all frames. Our optimization strategy alternates between the classic NRSfM problem and our articulated structure term until they both converge. This allows for recovering the geometry together with the 3D joint positions and the method does not rely on known bone lengths. Starting from a rough initialization of the articulated structure (e.g., a human arm is longer than a leg), SfAM still converges to the correct structure proportions (see Section 4.2.3). Figure 2 illustrates the significant difference between results produced by a general-purpose NRSfM technique Ansari et al. (2017) and our SfAM.

To summarise, our contributions are:

•

A generic framework for articulated structure recovery which achieves state-of-the-art accuracy among not learning-based methods across public datasets. Moreover, it shows performance close to state-of-the-art learning-based methods but at the same time is not restricted to specific objects (see Section 4) and does not require training data.

•

SfAM recovers sequence-specific bone proportions together with 3D joints (see Section 3). Thus, it does need known bone lengths.

•

The articulated prior energy term makes our approach robust to noisy 2D observations (see Section 4.2.2) by imposing additional constraints on the 3D structure.

In this paper, we show that a not learning-based approach can perform on par with state-of-the-art learning-based methods and even outperform some of them in real-world scenes (see Section 4.2.5). We demonstrate the effectiveness of SfAM for the recovery of different articulated structures through extensive quantitative and qualitative evaluation on different datasets Ionescu et al. (2014); Akhter et al. (2011); Tompson et al. (2014) and real-world scenes (see Section 4). To the best of our knowledge, our SfAM is the first NRSfM approach evaluated on such comprehensive datasets as Human 3.6m Ionescu et al. (2014) and NYU hand pose Tompson et al. (2014). As a side effect of our method, it can be used for precise articulated model estimation (generate personalized human skeleton rigs (see Section 4.2.3)). This contrasts a lot with most recent supervised learning approaches which require extensive labeled databases for training, and still, often fail when unfamiliar poses are observed (see Section 4.2.5). Moreover, minor changes in the inputs lead to significant variations in the poses, which makes the results of learning-based methods very difficult or impossible to reproduce.

2 Related Work

Rigid and Non-Rigid Structure from Motion. Factorization-based Structure from Motion (SfM) is a general technique for 3D structure recovery from 2D point tracks. An SfM problem is well-posed for rigid objects due to the rigidity constraint Tomasi and Kanade (1992). Early extensions of Tomasi and Kanade’s method Tomasi and Kanade (1992) for the non-rigid case rely on rank and orthonormality constraints Bregler et al. (2000); Brand (2005). Subsequent methods investigated shape basis priors Xiao et al. (2004), temporal smoothness priors Bartoli et al. (2008), trajectory space constraints Akhter et al. (2008) as well as such fundamental questions as shape basis uniqueness Hartley and Vidal (2008); Akhter et al. (2009). More recent methods combine priors in the metric and trajectory spaces Gotardo and Martínez (2011). To improve the reconstruction of stronger nonlinear deformations, Zhu et al. Zhu et al. (2014) introduce unions of linear subspaces. Dai et al. Dai et al. (2014) propose an NRSfM method with as few additional constraints as possible. Lately, the focus of NRSfM research is drawn to the problem of scalability Ansari et al. (2017); Kumar et al. (2018), i.e., the consistent performance across different scenarios and linear computational complexity in the number of points. Our SfAM is a scalable approach which builds upon the work of Ansari et al. Ansari et al. (2017). In contrast to Ansari et al. (2017), we recover articulated structures with higher accuracy.

Articulated and Multibody Structure from Motion. Over the last few years, several SfM approaches for articulated motion recovery were proposed. Some of them relax the global rigidity constraint for multiple parts Paladini et al. (2012); Costeira and Kanade (1998) so that each of the parts is constrained to be rigid. They can handle relatively simple articulated motions, as the segmentation and the structure composition are assumed to be unknown Paladini et al. (2012). As a result, these methods are hardly applicable to such complicated scenarios as human and hand pose recovery. Tresadern and Reid Tresadern and Reid (2005), Yan and Pollefeys Yan and Pollefeys (2008) and Palladini et al. Paladini et al. (2012) address the articulated case with two rigid body parts and detect a hinge joint. Later, an approach with spatial smoothness and segmentation dealing with an arbitrary number of rigid parts was proposed by Fayad et al. Fayad et al. (2011). Park and Sheikh Park and Sheikh (2011) reconstruct trajectories given parent trajectories and known bone length, known camera, and root motion for each frame. Their objective is highly nonlinear and requires good initialization of trajectory parameters. In contrast, our method recovers sequence-specific bone proportions and does not rely on given bone lengths. Next, Valmadre et al. Valmadre et al. (2012) propose a dynamic-programming approach for the reconstruction of articulated 3D trees from input 2D joint positions operating in linear time. Multibody SfM methods reconstruct multiple independent rigid body transformations and non-rigid deformations in the same scene Costeira and Kanade (1998); Kumar et al. (2017). In contrast, our approach is more general as it imposes a soft constraint of articulated motion on top of classic NRSfM.

Piecewise and Locally Rigid Structure from Motion. Piecewise rigid approaches interpret the structure as locally rigid in the spatial domain Golyanik et al. (2019); Taylor et al. (2010). Several methods divide the structure into patches, each of which can deform non-rigidly Fayad et al. (2010); Lee et al. (2016). High granularity level of operation allows these methods to reconstruct large deformations as opposed to methods relying on linear low-rank subspace models Fayad et al. (2010). Rehan et al. Rehan et al. (2014) penalize deviations between the bone lengths from the average distances between the joints over the whole sequence. This form of constraint does not guarantee a realistic reconstruction though, as it struggles to compensate for inaccurate 2D estimations or 3D inaccuracies in short time intervals.

Monocular 3D Human Body and Hand Pose Estimation. Bone length constraints are widely used in the single-view regression of 3D human poses. One of the early works in this domain operates on single uncalibrated images and imposes constraints on the relative bone lengths Taylor (2000). It is capable of reconstructing a human pose up to scale. Later, an enhancement for multiple frames with bone symmetry and rigidity constraints (joints representing the same bone move rigidly relative to each other) was introduced by Wei and Chai Wei and Chai (2009). Akhter and Black Akhter and Black (2015) use a pose prior that captures pose-dependent joint angle limits. Ramakrishna et al. Ramakrishna et al. (2012) use a sum of squared bone lengths term that can still lead to unrealistic poses. Wandt et al. Wandt et al. (2016) constrain the bone lengths to be invariant. Their trilinear factorization approach relies on pre-trained body poses serving as a shape prior and transcendental functions modeling periodic motion peculiar to the human gait. An adaptation of this approach to hand gestures would require the acquisition of a new shape prior. Wandt et al. Wandt et al. (2018) constrain the sum of squared bone lengths of the articulated structure to be invariant throughout image sequence. However, the length of each bone can still vary. One of the modern methods for human pose and appearance estimation is MonoPerfCap of Xu et al. Xu et al. (2018). It imposes implicit bone length constraints through a dense template tailored to a specific person and captured in an external acquisition process.

Recently, many learning-based approaches for human pose and hand pose estimation have been presented in the literature Rogez et al. (2019); Kanazawa et al. (2018); Pavlakos et al. (2018); Moreno-Noguer (2017); Martinez et al. (2017); Malik et al. (2019, 2018a, 2018b, 2017). In Zhou et al. (2017), weak supervision constrains the output of the network with fixed bone proportions taken from the training dataset. Sun et al. Sun et al. (2017) exploit a joint connection structure and uses bones instead of joints for pose representation. Wandt and Rosenhahn Wandt and Rosenhahn (2019) use kinematic chain representation and include bone length information to their loss function during training. In contrast to our SfAM, Wandt and Rosenhahn (2019) is not as robust to noisy 2D input (see Section 4.2.2). All these methods are highly specialized and rely on extensive collections of training data. In contrast, our SfAM is a general approach that can cope with different articulated structures, with no need for labeled datasets.

3 The Proposed SfAM Approach

Figure 3 shows a high-level overview of our approach. Following factorization-based NRSfM Dai et al. (2014), we first recover the camera pose using 2D landmarks (Section 3.2). For 3D structure recovery, we extend the target energy function of the classic NRSfM problem Ansari et al. (2017); Dai et al. (2014) by our articulated prior term (Section 3.3.1).

We assume that sparse 2D correspondences are given. In Section 3.3.2, we show how our new energy is efficiently optimized alternating between fixed-point continuation algorithm Ma et al. (2011) and Levenberg–Marquardt Levenberg (1944); Marquardt (1963). This leads to an accurate reconstruction of articulated motions of different structures.

3.1 Factorization Model

The input to SfAM is the measurement matrix ${\bf W}=[{\bf W}_{1},{\bf W}_{2},\ldots,{\bf W}_{T}]^{\mathsf{T}}\in\mathbb{R}^{2T\times N}$ with $N$ 2D joints tracked over $T$ frames. Every ${\bf W}_{t}$ , $t\in\{1,\ldots,T\}$ , is registered to the centroid of the observed structure and the translation is resolved in advance. Most of the NRSfM methods assume orthographic projection, as the intrinsic camera model is usually not known. Even though some benchmarks (e.g., Ionescu et al. (2014)) provide camera parameters, we develop a general approach for uncalibrated settings. Following standard SfM approaches, we assume that every 2D projection ${\bf W}_{t}$ can be factorized into a camera pose-projection matrix ${\bf R}_{t}\in\mathbb{R}^{2\times 3}$ and 3D structure ${\bf S}_{t}\in\mathbb{R}^{3\times N}$ so that ${\bf W}_{t}={\bf R}_{t}{\bf S}_{t}$ . We assume that the articulated structure deforms under the low-rank shape model Bregler et al. (2000); Ansari et al. (2017). Thus, ${\bf S}=[{\bf S}_{1},{\bf S}_{2},\ldots,{\bf S}_{T}]^{\mathsf{T}}$ can be parametrized by the set of unknown basis shapes ${\bf B}\in\mathbb{R}^{3K\times N}$ of cardinality $K$ and the coefficient matrix ${\bf C}\in\mathbb{R}^{T\times K}$ :

[TABLE]

where ${\bf R}=\operatorname{bkdiag}({\bf R}_{1},{\bf R}_{2},\ldots,{\bf R}_{T})$ is the joint camera pose-projection matrix, ${\bf I}_{3}$ is a $3\times 3$ identity matrix and $\otimes$ denotes Kronecker product.

3.2 Recovery of Camera Poses

Applying singular value decomposition to ${\bf W}$ , we obtain initial estimates of ${\bf M}$ and ${\bf B}$ from Equation (1) up to an invertible corrective transformation ${\bf Q}\in\mathbb{R}^{3K\times 3K}$ :

[TABLE]

In the following, we are using the shortcuts ${\bf M}^{\prime}_{2t-1:2t}\in\mathbb{R}^{2\times 3K}$ for every $t$ -th pair of rows of ${\bf M}$ , ${\bf Q}_{k}\in\mathbb{R}^{3K\times 3}$ for the $k$ -th column triplet of ${\bf Q}$ , $k\in\{1,\ldots,K\}$ . Considering (1) and (2), for every $t\in\{1,\ldots,T\}$ and $k\in\{1,\ldots,K\}$ , we have:

[TABLE]

Using the orthonormality constraints ${\bf R}_{t}{\bf R}_{t}^{\mathsf{T}}={\bf I}_{2}$ and denoting ${\bf F}={\bf Q}{\bf Q}^{\mathsf{T}}$ , we obtain:

[TABLE]

Therefore, the following systems of equations can be written for every $t$ and $k$ :

[TABLE]

where $\operatorname{vec}(\cdot)$ is vectorization operator permuting a $m\times n$ matrix to a $mn$ column vector. Stacking all ${\bf G}_{t}$ vertically, we obtain:

[TABLE]

where ${\bf G}=[{\bf G}_{1},{\bf G}_{2},\ldots,{\bf G}_{T}]^{\mathsf{T}}$ . Finding an optimal ${\bf F}_{k}$ can be performed by solving the optimization problem:

[TABLE]

Due to the rank-3 constraint on every ${\bf F}_{k}$ , this problem is solved by the iterative shrinkage-thresholding (IST) method Beck and Teboulle (2009). Once an optimal ${\bf F}$ is found, the corrective transformation ${\bf Q}$ is recovered by Cholesky decomposition. Using ${\bf Q}$ , ${\bf R}$ is recovered from Equations (1)–(4).

3.3 Articulated Structure Recovery

3.3.1 Articulated Structure Representation

Having found ${\bf R}$ , we recover ${\bf S}$ . Note that we optionally rely on an updated ${\bf W}$ after the smooth shape trajectory step which imposes additional constraints on point trajectories and reduces the overall number of unknowns; please refer to Ansari et al. (2017) for more details.

We rearrange the shape matrix ${\bf S}$ to

[TABLE]

where $(X_{tn},Y_{tn},Z_{tn}),n\in\{1,\ldots,N\}$ is a 3D coordinate of each joint in ${\bf S}$ . ${\bf S}^{\#}$ can be represented as:

[TABLE]

where ${\bf P}_{x},{\bf P}_{y},{\bf P}_{z}\in\mathbb{R}^{T\times 3N}$ are binary row selectors. We follow Ansari et al. (2017); Dai et al. (2014) and represent the optimal non-rigid structure by:

[TABLE]

where $\boldsymbol{\Pi}=({\bf I}-\frac{1}{T}\boldsymbol{1}\boldsymbol{1}^{\mathsf{T}})$ ( $\boldsymbol{1}$ is a vector of ones) and $||.||_{*}$ denotes the nuclear norm. Note that $\operatorname{rank}({\bf S}^{\#})\leq K$ , and the mean 3D component is removed from ${\bf S}^{\#}$ . As shown in Figure 2, non-rigid structures recovered by the optimization of (10) can have significant variations in bone lengths. This often leads to unrealistic poses and body proportions. Unlike general non-rigid structures, in articulated structures, individual rigid parts or bones have constant lengths throughout the whole sequence. Moreover, all the bones follow constant proportions. These constraints are called articulated priors. We incorporate the articulated priors into the objective function (10) in the form of the following energy term:

[TABLE]

where $e_{tb}({\bf S})=(D_{b}^{t}-L_{b})^{2}$ is an energy term for bone $b$ and frame $t$ , $L_{b}$ is initial normalized bone length value of bone $b$ . The normalization is done with respect to the sum of all initial bone lengths. $D_{b}^{t}=||X_{a_{b}}^{t}-X_{c_{b}}^{t}||_{2}$ is Euclidian distance between joints $X_{a_{b}}^{t}$ and $X_{c_{b}}^{t}$ connected by bone $b$ ; $B$ is the number of bones of the articulated structure. Vectors $a=[X_{a_{1}},X_{a_{2}},\ldots,X_{a_{B}}]$ and $c=[X_{c_{1}},X_{c_{2}},\ldots,X_{c_{B}}]$ define the parent and child joints of bones, respectively.

Unlike some previous works Dabral et al. (2018); Akhter and Black (2015); Yasin et al. (2015); Zhou et al. (2017), we do not require predefined bone lengths or proportions. SfAM recovers optimal articulated structure that minimizes the total energy:

[TABLE]

where $\beta$ is a scalar weight. Implementation of articulated prior (11) as a soft constraint makes the overall method robust to incorrect initialization of bone lengths.

3.3.2 Energy Optimization

Since (12) contains a nonlinear term ${\bf E}_{BL}({\bf S})$ , we introduce an auxiliary variable ${\bf A}$ and obtain the following optimization problem which is linear with respect to ${\bf S}$ :

[TABLE]

We rewrite (LABEL:eq:bmm_convex_min_bl_sub) in the Lagrangian form:

[TABLE]

where $||.||_{F}$ denotes the Frobenius norm and $\mu$ is a parameter. We split $\eqref{eq:lagr_min_bl}$ into two subproblems:

[TABLE]

We alternate between the subproblems (LABEL:eq:two_subproblems1) and (16) and iterate until convergence. ${\bf A}$ remains fixed in (LABEL:eq:two_subproblems1) and ${\bf S}$ remains fixed in (16). In every optimization step, the subproblem (LABEL:eq:two_subproblems1) updates the 3D structure so that it more accurately projects to the observed 2D landmarks. The subproblem (16) penalizes the difference in bone lengths among all frames while recovering the sequence-specific bone proportions. The bone lengths of the recovered optimal 3D structures are almost constant throughout the whole image sequence but different from the initial $L_{b}$ .

The subproblem (LABEL:eq:two_subproblems1) is linear and solved by the fixed-point continuation (FPC) method Ma et al. (2011). First, we obtain the gradient of $\frac{1}{2}(||{\bf W}-{\bf R}{\bf S}||^{2}_{F}+||{\bf A}-{\bf S}||^{2}_{F})$ with respect to ${\bf S}^{\#}$ :

[TABLE]

Next, FPC for $\min_{{\bf S}}{\bf L}({\bf S},\mu)$ instantiates as:

[TABLE]

where $\mathcal{S}_{\nu}(\cdot)$ is the matrix shrinkage operator Ma et al. (2011) and $\tau>0$ is a free parameter.

The second subproblem (16) is nonlinear and is optimized for each iteration (LABEL:eq:two_line) using Levenberg–Marquardt of ceres Agarwal et al. . Let denote the $r_{l}$ , $l\in\{1,\ldots,TN\}$ residuals of $\frac{1}{2}||{\bf A}-{\bf S}||^{2}_{F}$ . We aggregate all residuals $e_{tb}({\bf A})$ from (11) (note that ${\bf S}$ in (11) is substituted by ${\bf A}$ ) and $r_{l}$ into a single function:

[TABLE]

Next, the objective function (16) can be compactly written in terms of ${\bf A}$ as:

[TABLE]

The target nonlinear energy optimization problem consists of finding an optimal parameter set ${\bf A}^{\prime}$ so that:

[TABLE]

We solve (21) iteratively. In every optimization step k, the objective is linearized in the vicinity of the current solution ${\bf A}_{k}$ by the first-order Taylor expansion:

[TABLE]

with ${\bf J}({\bf A})_{(BT+TN)\times 3TN}$ being the Jacobian of ${\bf F}({\bf A}_{k})$ . For every iteration, the objective for $\Delta{\bf A}$ reads:

[TABLE]

In ceres Agarwal et al. , the optimum is computed in the least-squares sense with the Levenberg–Marquardt method:

[TABLE]

where $\lambda_{k}>0$ is a parameter and ${\bf I}$ is an identity matrix.

The algorithm is summarized in Algorithm 1.

4 Experiments and Results

We extensively evaluate our SfAM on several datasets including Human 3.6m Ionescu et al. (2014), synthetic sequences of Akhter et al. Akhter et al. (2011) and NYU hand pose Tompson et al. (2014) dataset. Moreover, we demonstrate qualitative results on challenging community videos. In total, our SfAM is compared to over thirty state-of-the-art model-based and learning-based methods (see Tables 1 and 2). We also implement SMSR of Ansari et al. Ansari et al. (2017), which is the most related approach to our SfAM and evaluate it on Ionescu et al. (2014); Tompson et al. (2014) as well as community videos. Moreover, we extend SMSR Ansari et al. (2017) with the local rigidity constraint of Rehan et al. Rehan et al. (2014) and include it into our comparison.

In Section 4.2.2, we evaluate the robustness of our approach to inaccuracies in 2D landmarks. The proposed SfAM recovers correct articulated structures given highly inaccurate initial bone lengths in Section 4.2.3. Finally, in Section 4.2.5, we highlight the numerous cases when our method performs better than state-of-the-art learning-based approaches in real-world scenes.

In all experiments, we use a sliding time window of $200$ frames. For sequences shorter than $200$ frames, we run our method on the whole sequence at once. All experiments are performed on a system with 32 GB RAM and twelve-core Intel Xeon CPU running at 3.6 GHz. Our framework is implemented in C++. Average processing time for a single frame from the Human 3.6m dataset Ionescu et al. (2014) with given 2D annotations amounts to $140$ ms.

4.1 Evaluation Methodology

We follow the established evaluation methodology in the area of NRSfM and rigidly align our 3D reconstructions to the ground truth. We report the reconstruction error $\mathcal{E}_{3D}$ in mm between ground truth joint positions $\overline{{\bf S}_{n}^{t}}$ and aligned 3D reconstructions $G({\bf S}_{n}^{t})$ :

[TABLE]

where $n\in\{1,\ldots,N\}$ , $t\in\{1,\ldots,T\}$ , $T$ is the number of frames in the sequence and $N$ is the number of joints of the articulated object. For some datasets, we report the normalized mean 3D error:

[TABLE]

where $\sigma_{tx},\sigma_{ty}$ and $\sigma_{tz}$ denote normalized variances of reconstructions $G({\bf S}_{n}^{t})$ along the $x,y,z$ -axes respectively.

4.2 Human Pose Estimation

4.2.1 Human 3.6m Dataset

Human 3.6m Ionescu et al. (2014) is currently the largest dataset for monocular 3D human pose sensing. It is widely used for evaluation of learning-based human pose estimation methods. Table 1 gives an overview of the quantitative results on the Human 3.6m Ionescu et al. (2014). We highlight approaches that are trained on Human 3.6m Ionescu et al. (2014) with “”. We follow three common evaluation protocols. In Protocol #1, we compare the methods on two subjects ( $S9$ and $S11$ ). The original framerate $50$ $fps$ is reduced to $10$ $fps$ . The learning-based approaches marked with “” use subjects $S1$ , $S5$ , $S6$ , $S7$ , $S8$ and all camera views for training. Testing is done for all cameras. For Protocol #2, only the frontal view (“camera3”) is used for evaluation. For Protocol #3, evaluation is done on every $64^{th}$ frame of subject $S11$ for all cameras. The learning-based approaches marked with “*” use subjects $S1$ , $S5$ , $S6$ , $S7$ , $S8$ and $S9$ for training.

For all methods and under all evaluation protocols, we report the reconstruction error $\mathcal{E}_{3D}$ after the rigid alignment of the recovered structures with ground truth. In our method, the bone lengths are initialized with the average values for all the subjects from the dataset.

As we see from Table 1, we show competitive accuracy to best performing learning-based approaches that are trained on Human 3.6m Ionescu et al. (2014). In Section 4.2.5, we demonstrate that our approach works better in real-world scenes which are different from this dataset.

In Figure 4, we visualize several reconstructions of highly challenging scenes by SMSR Ansari et al. (2017) and the proposed SfAM. See Figure 8 for additional visualizations.

4.2.2 Robustness to Inaccurate 2D Point Tracks

We validate the robustness of our approach to inaccuracies in 2D landmarks on Human 3.6m Ionescu et al. (2014).

We compare our SfAM to state-of-the-art learning-based methods Wandt and Rosenhahn (2019); Moreno-Noguer (2017); Martinez et al. (2017) trained on ground truth 2D data. We add Gaussian noise with increasing values of the standard deviation to the 2D ground truth point tracks. The reconstruction error as the function of the standard deviation of the noise is plotted in Figure 5a. SfAM is more robust than the compared methods for moderate and high perturbations, and the error grows very slowly with the increasing noise level. In contrast to our SfAM, the errors of Wandt and Rosenhahn (2019); Moreno-Noguer (2017); Martinez et al. (2017) grow very fast even with a low level of noise. Note that we evaluate our method on a higher level of noise than Wandt and Rosenhahn (2019); Moreno-Noguer (2017); Martinez et al. (2017). The average error of the currently best performing 2D detectors is between 10–15 pixels Wei et al. (2016); Newell et al. (2016). We see that, for 10–15 pixels, SfAM has comparable error to the most accurate learning-based approaches while not relying on training data and being generalizable for different object classes.

4.2.3 Robustness to Incorrectly Initialized Bone Lengths and Real Bone Length Recovery

We study the accuracy of SfAM in recovering articulated structures given incorrectly initialized bone proportions (normalized bone lengths) on the subject $S11$ from Human 3.6m Ionescu et al. (2014). Starting from the ground truth initialization of bone lengths (obtained from the dataset), we change every bone length by adding different amounts of Gaussian noise with increasing standard deviations in the range $[0;70]$ mm. This allows us to analyze the recovered bone lengths and the robustness of SfAM to noise in a controlled and well-defined setting. The results of the experiment are plotted in Figure 5b. If the structure is initialized with anthropometric priors from C. Gordon et al. (1989), the error increases by only 3%. Note that our error in bone length estimation is slightly affected by the increasing levels of noise. It is equal to $54$ mm with ground truth initialization and grows just to $66$ mm with $\sigma=70$ mm. Note that the anthropometric prior corresponds to $\sigma\approx 15$ mm.

Given incorrect initial bone lengths, SfAM recovers not only correct poses, but also accurate sequence-specific bone lengths. We calculate the average difference between ground truth bone lengths of subject $S11$ and the initial ones, provided to our method. We do the same for the recovered structures. The results are best viewed in Figure 5c. Thus, SfAM can be used for precise skeleton estimation.

We also calculate standard deviations of bone lengths of the reconstructed objects for SMSR Ansari et al. (2017) and SfAM. Figure 5d shows that the standard deviation of bone lengths is very high for SMSR Ansari et al. (2017), as it considers a human as a general non-rigid object and changes the bone lengths from frame to frame. SfAM reduces the average standard deviation by $514$ % leading to a more accurate pose reconstruction and structure recovery. In Figure 5d, “Upper Legs” and “Lower Legs” denote bones between the hip/knee and knee/ankle, respectively; “Upper Arms” and “Lower Arms” denote bones between shoulder/elbow and elbow/wrist, respectively.

4.2.4 Synthetic NRSfM Datasets

Synthetic sequences of Akhter et al. Akhter et al. (2011) are commonly used for the evaluation of sparse NRSfM. We compare our approach with previous SfM methods on challenging synthetic sequences with a large variety of human motions Drink, Pickup, Stretch, and Yoga Akhter et al. (2008). Some pairs of joints remain locally rigid in these sequences. We activate the articulated constraint for those points and evaluate our method. Table 2 shows the results of SfAM and previous SfM methods.

The errors $e_{3D}$ for other listed methods are taken from PPTA Agudo and Moreno-Noguer (2018) and SMSR Ansari et al. (2017). Only PPTA Agudo and Moreno-Noguer (2018) outperforms SfAM on Drink, whereas CSF2 Gotardo and Martínez (2011) achieves a comparable $e_{3D}$ . SfAM achieves the most consistent performance among all compared algorithms.

4.2.5 Real-World Videos

Our algorithm is capable of recovering human motion from challenging real-world videos. We compare our results with the state-of-the-art learning-based approach of Martinez et al. Martinez et al. (2017) and one of the best performing general-purpose NRSfM methods SMSR Ansari et al. (2017). Since ground truth 2D annotations are not available, we use OpenPose Cao et al. (2017) for 2D human body landmark extraction. Bone lengths are initialized with the values from anthropometric data tables C. Gordon et al. (1989). As Figure 6 shows, Martinez et al. (2017) fails to correctly recover poses that are different from the training dataset Ionescu et al. (2014). SMSR Ansari et al. (2017) produces unrealistic human body structures. In contrast to Martinez et al. (2017); Ansari et al. (2017), our method successfully recovers 3D human poses in real-world scenes.

4.3 Hand Pose Estimation

We also evaluate SfAM on the NYU hand pose dataset Tompson et al. (2014), which provides 2D and 3D ground truth annotations for $8252$ different hand poses. The hand model consists of $30$ bones. Hand pose recovery is a challenging problem due to occlusion and many degrees of freedom. We compare the performance of our approach with SMSR Ansari et al. (2017) and its modification with local rigidity constraint from Rehan et al. Rehan et al. (2014). Quantitatively, SfAM achieves $\mathcal{E}_{3D}$ of $14.2$ mm. In contrast, $\mathcal{E}_{3D}$ of SMSR Ansari et al. (2017) is $22.2$ mm, and SMSR with articulated body constraints Rehan et al. (2014) shows $\mathcal{E}_{3D}$ of $19.4$ mm. Hence, the inclusion of our articulated prior term to Ansari et al. (2017) achieves an error improvement of 56%. The qualitative results are shown in Figure 7. Similar to human bodies, SfAM achieves lower error due to keeping bone lengths constant between frames. When SMSR Ansari et al. (2017) fails to reconstruct the correct 3D pose, SfAM still outputs plausible results.

5 Conclusions

We present a new method for 3D articulated structure recovery from 2D landmarks. The proposed approach is general and not restricted to specific structures or motions. Integration of our soft articulated prior term into a general-purpose NRSfM approach and alternating optimization resulted in accurate and stable results.

In contrast to the vast majority of state-of-the-art approaches, SfAM does not require training data or known bone lengths. By ensuring consistency of bone lengths throughout the whole sequence, it optimizes sequence-specific bone proportions and recovers 3D structures. In extensive experiments, it proves its generalizability and shows accuracy close to state-of-the-art on public benchmarks. It also shows a remarkable improvement in accuracy compared to other model-based approaches. Moreover, our method outperforms learning-based approaches in complicated real-world videos. All in all, we show that high accuracy on benchmarks can be achieved without the need for training and parameter tuning for specific datasets.

In future work, we plan to apply SfAM to animal shape estimation and recovery of personalized human skeletons. We also believe it can boost the development of methods for human and hand pose estimation with semi-supervision.

\funding

This research was funded by the project VIDETE of the German Federal Ministry of Education and Research (BMBF), Grant No. 01IW18002.

\abbreviations

The following abbreviations are used in this manuscript:

[TABLE]

\appendixtitles

no

Appendix A

\reftitle

References

Bibliography82

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ramakrishna et al. (2012) Ramakrishna, V.; Kanade, T.; Sheikh, Y. Reconstructing 3D Human Pose from 2D Image Landmarks. In Proceedings of the European Conference on Computer Vision (ECCV), Florence, Italy, 7–13 October 2012; pp. 573–586.
2Wandt et al. (2016) Wandt, B.; Ackermann, H.; Rosenhahn, B. 3D Reconstruction of Human Motion from Monocular Image Sequences. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 2016 , 38 , 1505–1516.
3Zhou et al. (2016) Zhou, X.; Zhu, M.; Derpanis, K.; Daniilidis, K. Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016.
4Leonardos et al. (2016) Leonardos, S.; Zhou, X.; Daniilidis, K. Articulated motion estimation from a monocular image sequence using spherical tangent bundles. In Proceedings of the International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; pp. 587–593.
5Lee and Chen (1985) Lee, H.J.; Chen, Z. Determination of 3D human body postures from a single view. Comput. Vis. Graph. Image Process. (ICVGIP) 1985 , 30 , 148–168.
6Hossain and Little (2018) Hossain, M.R.I.; Little, J.J. Exploiting Temporal Information for 3D Human Pose Estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 69–86.
7Zhou et al. (2017) Zhou, X.; Huang, Q.; Sun, X.; Xue, X.; Wei, Y. Towards 3D Human Pose Estimation in the Wild: A Weakly-Supervised Approach. In Proceedings of the International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 398–407.
8Mehta et al. (2017) Mehta, D.; Rhodin, H.; Casas, D.; Fua, P.; Sotnychenko, O.; Xu, W.; Theobalt, C. Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision. In Proceedings of the International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; pp. 506–516.