Human Pose and Path Estimation from Aerial Video using Dynamic   Classifier Selection

Asanka G Perera; Yee Wei Law; Javaan Chahl

arXiv:1812.06408·cs.CV·December 18, 2018

Human Pose and Path Estimation from Aerial Video using Dynamic Classifier Selection

Asanka G Perera, Yee Wei Law, Javaan Chahl

PDF

TL;DR

This paper introduces a dynamic classifier selection method for real-time human pose and path estimation from aerial video, combining perspective correction, feature extraction, and classification to improve accuracy and efficiency.

Contribution

The paper presents a novel dynamic classifier selection architecture that enhances human pose and trajectory estimation from aerial video by reducing classification complexity and confining errors.

Findings

01

HOG features outperform CNN features in accuracy.

02

Achieved 99.6% viewpoint accuracy on a walking dataset.

03

Achieved 96.2% pose estimation accuracy.

Abstract

We consider the problem of estimating human pose and trajectory by an aerial robot with a monocular camera in near real time. We present a preliminary solution whose distinguishing feature is a dynamic classifier selection architecture. In our solution, each video frame is corrected for perspective using projective transformation. Then, two alternative feature sets are used: (i) Histogram of Oriented Gradients (HOG) of the silhouette, (ii) Convolutional Neural Network (CNN) features of the RGB image. The features (HOG or CNN) are classified using a dynamic classifier. A class is defined as a pose-viewpoint pair, and a total of 64 classes are defined to represent a forward walking and turning gait sequence. Our solution provides three main advantages: (i) Classification is efficient due to dynamic selection (4-class vs. 64-class classification). (ii) Classification errors are confined to…

Figures38

Click any figure to enlarge with its caption.

Tables4

Table 1. Table 1: Nomenclature

$𝐇$	Homography matrix.
$ϕ$ , $θ$	Elevation angle, azimuth angle.
$ℝ$ , $𝕍$ , $ℙ$	Real number space, vector space, projective space.
$V_{i}$	The $i$ th viewpoint, $i = 1, \dots, 8$ .
$P_{j}$	The $j$ th pose, $j = 1, \dots, 8$ .
$S$	Training sample set.
$K$	Number of classes in a training set.
$K^{'}$	Predicted class.
$𝐌$	An ECOC coding matrix.
$C_{64}$	The 64-class classifier invoked in the initialization stage.
$C_{4} (P, V)$	The 4-class classifier associated with pose $P$ and viewpoint $V$ .

Table 2. Table 2: Estimation errors of C 64 subscript 𝐶 64 C_{64} and the dynamic classifier using HOG features.

Experiment/dataset	#frames	$e_{pose, with TE}$		$e_{viewpoint, with TE}$
Experiment/dataset	#frames	$C_{64}$	Dynamic classifier	$C_{64}$	Dynamic classifier
CMU MoBo	35	62.9%	0%	14.3%	5.7%
HumanEva2	130	73.8%	59.2%	31.5%	19.2%
Scenario 1 ( $h = 2$ m)	250	48.8%	36.8%	6.4%	4.8%
Scenario 2 ( $h = 10$ m)	784	30%	23.5%	11.9%	13%
Scenario 3 ( $h = 10$ m)	1652	27.5%	23.5%	16.2%	16.9%

Table 3. Table 3: Estimation errors of the dynamic classifier using HOG/CNN features on UAV-captured videos.

Experiment	#frames	$e_{viewpoint}$		$e_{pose}$
Experiment	#frames	CNN	HOG	CNN	HOG
Scenario 1 ( $h = 2$ m), with TE	250	22.8%	4.8%	34%	36.8%
Scenario 1 ( $h = 2$ m), no TE	250	0%	0%	3.2%	1.2%
Scenario 2 ( $h = 10$ m), with TE	787	44.5%	13%	52.7%	23.5%
Scenario 2 ( $h = 10$ m), no TE	787	15.7%	0%	30.4%	3.2%
Scenario 3 ( $h = 10$ m), with TE	1652	30.3%	16.9%	41.5%	23.4%
Scenario 3 ( $h = 10$ m), no TE	1652	16.9%	0.4%	17%	3.8%

Table 4. Table 4: Estimation errors of C 64 subscript 𝐶 64 C_{64} and the dynamic classifier for perspective-distorted and perspective-corrected videos. Here, “PD” and “PC” refer to “perspective distorted” and “perspective corrected” respectively.

Scenario 2		#frames	$e_{pose, with TE}$		$e_{viewpoint, with TE}$
Scenario 2		#frames	$C_{64}$	Dynamic classifier	$C_{64}$	Dynamic classifier
$h = 10$ m	No distortion	787	30%	23.5%	11.9%	13%
$h = 20$ m	PD	784	37.2%	22.1%	18.9%	17.2%
$h = 20$ m	PC	784	49.9%	39.9%	20.9%	20.4%
$h = 30$ m	PD	810	57.9%	56.7%	28.9%	44.8%
$h = 30$ m	PC	810	42.5%	40.6%	25.7%	37.2%
$h = 40$ m	PD	817	68.4%	74.4%	38%	42.6%
$h = 40$ m	PC	817	53.5%	37.3%	30.5%	24.8%

Equations19

\mathchoice{\mbox{\boldmath$\displaystyle x$}}{\mbox{\boldmath$\textstyle x$}}{\mbox{\boldmath$\scriptstyle x$}}{\mbox{\boldmath$\scriptscriptstyle x$}}^{\prime}=\mathbf{H}\mathchoice{\mbox{\boldmath$\displaystyle x$}}{\mbox{\boldmath$\textstyle x$}}{\mbox{\boldmath$\scriptstyle x$}}{\mbox{\boldmath$\scriptscriptstyle x$}}.

\mathchoice{\mbox{\boldmath$\displaystyle x$}}{\mbox{\boldmath$\textstyle x$}}{\mbox{\boldmath$\scriptstyle x$}}{\mbox{\boldmath$\scriptscriptstyle x$}}^{\prime}=\mathbf{H}\mathchoice{\mbox{\boldmath$\displaystyle x$}}{\mbox{\boldmath$\textstyle x$}}{\mbox{\boldmath$\scriptstyle x$}}{\mbox{\boldmath$\scriptscriptstyle x$}}.

S=\{(\mathchoice{\mbox{\boldmath$\displaystyle x$}}{\mbox{\boldmath$\textstyle x$}}{\mbox{\boldmath$\scriptstyle x$}}{\mbox{\boldmath$\scriptscriptstyle x$}}_{1},y_{1}),(\mathchoice{\mbox{\boldmath$\displaystyle x$}}{\mbox{\boldmath$\textstyle x$}}{\mbox{\boldmath$\scriptstyle x$}}{\mbox{\boldmath$\scriptscriptstyle x$}}_{2},y_{2}),\ldots,(\mathchoice{\mbox{\boldmath$\displaystyle x$}}{\mbox{\boldmath$\textstyle x$}}{\mbox{\boldmath$\scriptstyle x$}}{\mbox{\boldmath$\scriptscriptstyle x$}}_{n},y_{n})\},

S=\{(\mathchoice{\mbox{\boldmath$\displaystyle x$}}{\mbox{\boldmath$\textstyle x$}}{\mbox{\boldmath$\scriptstyle x$}}{\mbox{\boldmath$\scriptscriptstyle x$}}_{1},y_{1}),(\mathchoice{\mbox{\boldmath$\displaystyle x$}}{\mbox{\boldmath$\textstyle x$}}{\mbox{\boldmath$\scriptstyle x$}}{\mbox{\boldmath$\scriptscriptstyle x$}}_{2},y_{2}),\ldots,(\mathchoice{\mbox{\boldmath$\displaystyle x$}}{\mbox{\boldmath$\textstyle x$}}{\mbox{\boldmath$\scriptstyle x$}}{\mbox{\boldmath$\scriptscriptstyle x$}}_{n},y_{n})\},

K^{'} = k \in {1, \dots, K} ar g min n = 1 \sum N L (m_{k, n}, f_{n} (x)) .

K^{'} = k \in {1, \dots, K} ar g min n = 1 \sum N L (m_{k, n}, f_{n} (x)) .

{(P_{i}, V_{j}), (P_{i ⊞ 1}, V_{j}), (P_{i ⊞ 1}, V_{j ⊟ 1}), (P_{i ⊞ 1}, V_{j ⊞ 1})},

{(P_{i}, V_{j}), (P_{i ⊞ 1}, V_{j}), (P_{i ⊞ 1}, V_{j ⊟ 1}), (P_{i ⊞ 1}, V_{j ⊞ 1})},

i ⊞ j = (i + j + 1) mod 8 - 1,

i ⊞ j = (i + j + 1) mod 8 - 1,

i ⊟ j = (i - j - 1) mod 8 + 1.

e_{\text{pose, with TE}}\overset{\mathrm{def}}{=}\frac{\Big{|}\begin{subarray}{c}\text{\#frames with misclassified poses}\end{subarray}\Big{|}}{\text{\#frames}}\times 100\%.

e_{\text{pose, with TE}}\overset{\mathrm{def}}{=}\frac{\Big{|}\begin{subarray}{c}\text{\#frames with misclassified poses}\end{subarray}\Big{|}}{\text{\#frames}}\times 100\%.

e_{\text{pose, no TE}}\overset{\mathrm{def}}{=}\frac{\Big{|}\begin{subarray}{c}\text{\#frames with misclassified poses}\\ -\text{\#frames with pose TE}\end{subarray}\Big{|}}{\text{\#frames}}\times 100\%.

e_{\text{pose, no TE}}\overset{\mathrm{def}}{=}\frac{\Big{|}\begin{subarray}{c}\text{\#frames with misclassified poses}\\ -\text{\#frames with pose TE}\end{subarray}\Big{|}}{\text{\#frames}}\times 100\%.

e_{\text{viewpoint, with TE}}\overset{\mathrm{def}}{=}\frac{\Big{|}\begin{subarray}{c}\text{\#frames with misclassified}\\ \text{viewpoints}\end{subarray}\Big{|}}{\text{\#frames}}\times 100\%.

e_{\text{viewpoint, with TE}}\overset{\mathrm{def}}{=}\frac{\Big{|}\begin{subarray}{c}\text{\#frames with misclassified}\\ \text{viewpoints}\end{subarray}\Big{|}}{\text{\#frames}}\times 100\%.

e_{\text{viewpoint, no TE}}\overset{\mathrm{def}}{=}\frac{\Big{|}\begin{subarray}{c}\text{\#frames with misclassified}\\ \text{viewpoints}\\ -\text{\#frames with viewpoint TE}\end{subarray}\Big{|}}{\text{\#frames}}\times 100\%.

e_{\text{viewpoint, no TE}}\overset{\mathrm{def}}{=}\frac{\Big{|}\begin{subarray}{c}\text{\#frames with misclassified}\\ \text{viewpoints}\\ -\text{\#frames with viewpoint TE}\end{subarray}\Big{|}}{\text{\#frames}}\times 100\%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

∎

11institutetext: A. G. Perera 22institutetext: Y. W. Law 33institutetext: J. Chahl 44institutetext: School of Engineering, University of South Australia, Mawson Lakes, SA 5095, Australia

44email: [email protected] 55institutetext: Y. W. Law 66institutetext: 66email: [email protected] 77institutetext: J. Chahl 88institutetext: Joint and Operations Analysis Division, Defence Science and Technology Group, Melbourne, Victoria 3207, Australia

88email: [email protected]

Human Pose and Path Estimation from Aerial Video using Dynamic Classifier Selection

Asanka G Perera

Yee Wei Law

Javaan Chahl

(Received: date / Accepted: date)

Abstract

Background / introduction— We consider the problem of estimating human pose and trajectory by an aerial robot with a monocular camera in near real time. We present a preliminary solution whose distinguishing feature is a dynamic classifier selection architecture.

Methods— In our solution, each video frame is corrected for perspective using projective transformation. Then, two alternative feature sets are used: (i) Histogram of Oriented Gradients (HOG) of the silhouette, (ii) Convolutional Neural Network (CNN) features of the RGB image. The features (HOG or CNN) are classified using a dynamic classifier. A class is defined as a pose-viewpoint pair, and a total of 64 classes are defined to represent a forward walking and turning gait sequence.

Results— Our solution provides three main advantages: (i) Classification is efficient due to dynamic selection (4-class vs. 64-class classification). (ii) Classification errors are confined to neighbors of the true viewpoints. (iii) The robust temporal relationship between poses is used to resolve the left-right ambiguities of human silhouettes.

Conclusions— Experiments conducted on both fronto-parallel videos and aerial videos confirm our solution can achieve accurate pose and trajectory estimation for both scenarios. We found using HOG features provides higher accuracy than using CNN features. For example, applying the HOG-based variant of our scheme to the “walking on a figure 8-shaped path” dataset (1652 frames) achieved estimation accuracies of 99.6% for viewpoints and 96.2% for number of poses.

Keywords:

Pose estimation Gait estimation Trajectory estimation Dynamic classifier selection UAV Drone

1 Introduction

The research reported in this article is motivated by the application scenario — for example, in disaster response, where an unmanned aerial vehicle (UAV) is required to recognize the actions of a human subject, then take responsive actions. The application scenario invites the following challenges: (i) Before a UAV can even begin recognizing human actions, the UAV will first have to compute how to orientate itself towards the human subject. (ii) Many UAVs are equipped with only one monocular camera; hence, additional data provided by stereoscopic, infrared and more advanced cameras is unavailable. (iii) Recognizing human actions from videos captured from a stationary platform is already a challenging task, owing to the articulated structure and range of poses of the human body. (iv) The difficulty of recognition is compounded by the quality of videos which include perspective distortion, occlusion and motion blur.

We assume the UAV is in hovering flight, having a human subject within its field of view111https://asankagp.github.io/aerialgaitdataset/. We estimate the gait sequence and movement trajectory of a person from a video captured by a UAV. Our solution consists of the following modules:

The perspective correction module compensates for perspective distortion in aerial videos using a projective transformation technique similar to orrite04shape ; richter11perspectives ; rogez14exploiting . Instead of one homography matrix, pre-annotated homography matrices are used for different levels of distortion caused by different camera elevation angles.

The segmentation and feature extraction module performs segmentation, using Histograms of Oriented Gradients (HOG) dalad05histograms or alternatively, Convolutional Neural Network (CNN) lecun98gradient features as shape descriptors. The model-free approach is combined with silhouette-based shape matching (in the case of HOG) or RGB-based shape matching (in the case of CNN) for efficient processing.

The pose estimation module uses a dynamic classifier selection architecture inspired by woods97combination ; ko08dynamic . A total of 64 classes are defined for the combinations of 8 poses/gaits and 8 viewpoints in a human gait cycle. These views include front, back, side and diagonal views. Instead of performing 64-class classification, our dynamic classifier leverages the temporal relationships between poses and viewpoints, and performs significantly more reliable 4-class classifications instead.

The trajectory estimation module estimates the trajectory of the human subject by reconstructing the pose sequence using 3-D skeletons and localizing them with respect to the initial pose and viewpoint. The reconstructed poses provide the approximate shape of the ground-truth walking path.

The contribution of the paper is twofold.

•

A classifier architecture for efficiently and robustly estimating human gaits in monocular aerial videos. By exploiting the temporal relationships between poses and viewpoints, our method can limit the wrongly estimated viewpoints to the adjacent viewpoints of the ground truth. The loss of limb and joint details in images usually leads to a left-right ambiguity issue, especially in front and back views agarwal06recovering . However, our pose estimation solves this problem by taking into account possible temporal transitions between states. The dynamic classifier architecture presented in this work does not execute all the classifiers in the pool to make a decision. Instead, only the relevant classifier is selected based on the state transition graphs. This is a significant difference compared to similar architectures in the literature woods97combination ; kuncheva01decision ; tulyakov08review . Experimental results confirm the proposed dynamic classifier is suitable for gait estimation in both ground and aerial videos.

•

The creation of a training dataset from an aerial platform for gait recognition. This dataset accounts for the natural twists and self-occlusions of a turning human body and minimizes the false positives caused by minor variations in heading.

The rest of this article is organized as follows. Section 2 discusses closely related work on perspective correction, pose estimation and their applications to UAV-based scenarios. Section 3 describes our solution. Section 4 reports experimental results. Discussion of issues and potential improvements is presented in Section 5. Section 6 concludes.

2 Related Work

This work is as an extension to the approach described in perera18human . Compared to perera18human , here, we propose perspective correction for reducing errors of the dynamic classifier. Compensation for the perspective distortion is analyzed for videos captured at different heights/angles. Further, we combine transfer learning and dynamic classifiers to perform CNN-based classification. The dynamic classifiers are evaluated for HOG and CNN features. The performance measures are calculated with respect to the ground truth of test videos. The dataset used for this work will be publicly available. The study reported in perera18human has been performed only with HOG features of silhouettes, and the ground truth pose information has not been considered for performance analysis.

The problem of recognizing human pose in statically captured videos has been studied extensively in recent literature wang10review . Here, we discuss some closely related work.

2.1 Perspective correction

Perspective distortion needs to be corrected for processing that is robust to distance changes. Projective transformation, or homography, is an established approach for correcting perspective distortion hartley03multiple , but this traditional approach requires the vanishing point to be manually specified. Rogez et al. rogez06viewpoint ; rogez14exploiting used vertical scene lines to estimate the vanishing point and localize the reconstructed poses based on the vanishing point, but their approach still requires manual determination of which lines are vertical. Our homography step is similar to Rogez et al.’s, except we determine the vanishing point based on the altitude and angle of the camera. Moreover, Rogez et al. conducted their study on statically captured videos, while we use video captured from a UAV.

2.2 Pose estimation by classification

Dynamic classifier selection (DCS) was originally proposed by Woods et al. woods97combination , and is based on the local accuracy estimation of each individual classifier. Their approach selects an individual classifier which is most likely correct for a given sample. The final decision is made only by the selected classifier. Kuncheva kuncheva02switching proposed a classifier selection and fusion method, but the experimental results show that DCS is the best performer while Kuncheva’s is the second best, provided the classifiers have the same structure and training protocol. Ko et al. ko08dynamic developed a relatively similar classifier by integrating a majority voting system. Our classifier follows the DCS principles, but the best individual classifier is selected without executing the entire ensemble of classifiers.

Gaits estimated using direct classification wang10review do not respect any temporal order xue10infrared ; collins02silhouette . A better approach is to consider the temporal relationship between poses, using techniques such as the ratio of the number of pixels in the intersection to the number of pixels in the union of two silhouette frames sarkar05humanid , dynamic time warping veeraraghavan05matching , lower limb joint angles zeng14model and frequency analysis of spatio-temporal gait signatures boulgouris05gait , to mention a few. Furthermore, general approaches to spatio-temporal action recognition can be found in sheikh05exploring ; rao02action ; rapantzikos11spatiotemporal ; chen16action . The temporal order of our reconstructed poses is based on the state transition model of poses and viewpoints.

Recent human pose estimation research has shown significant performance improvements by incorporating deep learning techniques weibo17survey . The state of the art in human pose estimation has adopted convolutional neural networks as the main building block wei16convolutional ; newell16stacked ; rogez17lcr ; pishchulin16deepcut . Deep learning models have also been adopted for human trajectory estimation in a range of settings. Some notable extensions are Recurrent Neural Networks (RNN) rajiv16applying , Behaviour-CNN yi16pedestrian , unsupervised feature learning for classification and regression tharindu17soft , and deep recurrent Long Short Term Memories (LSTMs) labbaci17deep , to mention a few.

Left-right ambiguity is an inherent problem in sil-houette-based pose estimation. Some strategies have been proposed in relation to depth information. Shotton et al. shotton11realtime trained classifiers to learn subtle visual cues from silhouettes to resolve the left-right ambiguity. In zhao15strategy , the depth of each pixel in the silhouette was used to expand the 2D shape context into 3D space. Approaches have also been proposed for silhouettes in RGB images. Sigal et al. sigal06measure proposed switching the left-right limbs of their graphical model to fit the silhouette with the smallest error. Some notable studies incorporated temporal information to infer the correct views of silhouettes. Both Discrete Cosine Transform (DCT) temporal prior in huang17towards and sequential clonal selection algorithm (CSA) in li14generative handled this issue using the temporal continuity in images. Similarly, Lan et al. lan05beyond selected the left vs. right configuration that is most consistent with the previous frame. Our proposed algorithm also uses temporal information between images but unlike the above methods, the transitions are determined based-on simple state transition graphs.

In recent literature, transfer learning approaches have been used effectively for human motion analysis problems. In a transfer learning setting, the first selected layers of a base network are used to develop the target network yosinki14how . Notable work was done by Chaturvedi et al. chaturvedi15deep , that employed deep transfer learning (DTL) to analyse the trajectories of basketball players using time-delayed Gaussian networks. Martín-Félez et al. Martin-Felez12gait developed a system that learned gait features independently of the identity of people by applying transfer learning on a bipartite ranking model. A transfer learning approach similar to the one described in this paper can be found in farrajota16deep , which introduced a framework for pose and gait estimation of elderly people. Their transfer learning was based on the Alexnet model alex12imagenet , followed by a Siamese network to compare faces and upper/full bodies. Some related work focused on recognizing human actions across changes in the observer viewpoint rahmani17learning ; farhadi08learning . Rahmani et al. rahmani17learning proposed a Robust Non-Linear Knowledge Transfer Model (R-NKTM) for action recognition from novel viewpoints. A similar problem was addressed by Farhadi et al. in farhadi08learning by training a discriminative appearance model. The authors used Maximum Margin Clustering to construct split-based features in the source view, then trained a classifier by transferring the splits in the source view to the target view.

Using UAV imagery for human pose or gait estimation is challenging due to platform mobility and susceptibility to wind gusts. UAVs are deployed in situations where it would be beneficial to interpret human movement, particularly in search and rescue applications andriluka10vision , human-machine interface systems naseer13followme and surveillance systems lim15monocular ; aguilar17pedestrian . When employed in surveillance or search and rescue the movement of human subjects and their trajectory are vital information. Trajectory can be used for semantic analysis of human activities and prediction of future locations from video sequences lao09automatic . Our vision-based trajectory estimation is relatively similar to lim15monocular in terms of visual sensing.

2.3 UAV-based applications

Utilizing UAVs in human tracking and action recognition missions is a relatively new topic. Human detection methods from aerial videos have been suggested in relation to search and rescue missions rudol08human ; andriluka10vision . The primary focus of these studies was to identify humans lying or sitting on the ground. Al-Naji et al. al-naji17remote used a hovering UAV to detect the vital signs of a human subject from the head and neck areas. Some studies focused on human identity recognition in low-resolution aerial videos. Oreifej et al. oreifej10human presented an algorithm relying on a weighted voter-candidate formulation. The algorithm detects targets by analyzing the “blobs” of candidates against voters and addresses the need for human blob detection and tracking. Yeh et al. yeh16fast proposed a relatively similar blob matching approach using an adaptive reference set of previously identified people. A system developed for UAV onboard gesture recognition was proposed by Monajjemi et al. in monajjemi15uav . The system identified periodic movements of waving hands from other periodic movements like walking and running in an outdoor environment. Our experimental set-up is most similar to Monajjemi et al.’s. A crowd detection and localization approach using one UAV and a number of unmanned guided vehicles (UGVs) was presented in minaeian16vision . In contrast, our study uses a simpler configuration, but performs robustly on aerial video.

3 Methodology

This section provides details of the perspective correction, segmentation and feature extraction, pose estimation and trajectory estimation modules. The block diagram of the entire process is given in Fig. 1. See Table 1 for the nomenclature used in this article.

3.1 Perspective correction

The relative orientation between the human subject on the ground and the camera in the sky is captured in a horizontal coordinate system, with coordinates $\phi$ and $\theta$ (see Fig. 2(a)). The camera viewpoint can take any $(\phi,\theta)$ pair depending on the UAV position, where $\phi\in[0,\pi/2]$ and $\theta\in[0,2\pi)$ . A major problem with aerial photography is vertical perspective distortion, which occurs when $\phi>0$ , and worsens as $\phi$ gets larger. At low altitude the distorted human shape tends to have a large head and shoulders and small feet (see Fig. 2(b)). When $\phi=90^{\circ}$ , perspective distortion cannot be corrected. For $60^{\circ}\leq\phi<90^{\circ}$ , the captured images have a severely distorted perspective that is difficult to accurately compensate. Therefore in this study, we limit the maximum $\phi$ to $60^{\circ}$ .

Perspective correction is done by mapping the distorted image plane (see Fig. 2(b)) to the undistorted vertical plane through homography. Segments on the undistorted vertical plane then enable the matching of test and training images.

A homography is a mapping from a projective space to itself. A projective space is an extension of Euclidean space in which two lines always meet at a point, and a point in the projective space is called a homogeneous point. Given an image, for every homogeneous point on the image plane, $\textstyle x$ , there exists a homography matrix $\mathbf{H}$ (smith04invitation, , Section 3.1) that maps it to a homogeneous point, $\mathchoice{\mbox{\boldmath$ \displaystyle x $}}{\mbox{\boldmath$ \textstyle x $}}{\mbox{\boldmath$ \scriptstyle x $}}{\mbox{\boldmath$ \scriptscriptstyle x $}}^{\prime}$ , on the undistorted vertical plane, i.e.,

[TABLE]

The matrix $\mathbf{H}$ depends on the elevation angle $\phi$ . Instead of calculating $\mathbf{H}$ for each frame, we calculate it offline for each of the following values of $\phi$ :

•

$\arctan(10/30)=18.4^{\circ}$ ,

•

$\arctan(20/30)=33.7^{\circ}$ ,

•

$\arctan(30/30)=45.0^{\circ}$ ,

•

$\arctan(40/30)=53.1^{\circ}$ ;

and manually pre-annotate videos of the same elevation angle with the corresponding $\mathbf{H}$ . To calculate $\mathbf{H}$ , we manually select four points in a sample video frame to (i) delineate the area of interest and (ii) generate the vertical scene lines, as shown in Fig. 2(b). The vertical scene lines define the homography matrix $\mathbf{H}$ .

3.2 Segmentation and feature extraction

After perspective correction, the human silhouette is segmented. The size of the silhouette in the image plane varies depending on the direct distance between the camera and the human subject. Perspective correction alone cannot address this scaling issue. Thus, the test silhouette is scaled up or down to match the scale of the training images. Prior to feature extraction, we use the online video annotation tool VATIC vondrick13efficiently to annotate the test videos. Two types of features are used, resulting in two variants of our scheme: HOG-based and CNN-based.

3.2.1 Feature extraction using HOG

For each frame, the RGB image is converted into a binary image and its bounding box area is segmented. Noise is removed using a Gaussian filter and small objects containing fewer than a threshold number of pixels are also removed. The remaining blob or blobs are considered to represent the human silhouette. Currently, the denoising parameters and segmentation parameters are customized for each video clip to obtain the best possible silhouette, so they are subject to improvements.

For feature extraction, the image window is divided into small spatial regions called “HOG cells” dalad05histograms . The weighted gradients in a HOG cell form a 1-D histogram which represents the orientation of the edge lines. The feature vector is formed from the HOG blocks, each of which represents a group of HOG cells.

3.2.2 Feature extraction using CNN

For each frame, the RGB image is cropped to meet the size requirement of the deep CNN AlexNet alex12imagenet . The AlexNet is a 11-layer network including 5 convolutional layers, 3 fully connected layers and 3 max pooling layers (see Fig. 3). The early convolutional layers have small receptive field sizes for learning low-level features, and later layers have larger field sizes for learning higher-level features. AlexNet has been pre-trained on 1.2 million ImageNet deng09imagenet images of 1000 classes, and showed the best performance in the ImageNet Large Scale Visual Recognition Challenge in 2012 alex12imagenet . Some classes of AlexNet are trained on images of humans in different settings; therefore, this pre-trained network was selected for our work.

In a pre-trained network, the weights for the deep layers are pre-determined. Instead of re-training AlexNet with our comparatively small dataset, we apply transfer learning in the standard way. We take the 4096-dimensional vector right before the last fully-connected layer of AlexNet as the feature vector (see Fig. 3). We then use the feature vectors to train an SVM classifier (as described in Section 3.4).

3.3 Pose estimation: classifier training

The training dataset is created to identify the eight sub-steps whittle14gait of the human gait cycle (see Fig. 4). Each sub-step (or pose) has viewpoints from eight radial directions (azimuth angles that are $45^{\circ}$ apart), giving rise to $8\times 8=64$ pose-viewpoint pairs. The finite number of elevation-azimuth angle pairs are equivalent to the discretized viewing hemisphere described in rogez14exploiting ; rogez06viewpoint ; rosales06combining .

We create two training datasets using silhouettes and color images. The silhouette dataset has 1017 images across 64 classes. The color image dataset contains images from our field videos, the MoBo Aligned dataset rogez12fast and the HumanEva dataset sigal09humaneva . To create this dataset, we collected images with varying perspective distortion for each class. The field images recorded from moving subjects have varying perspective distortions because the subjects walked in a circle (see the first three images in Fig. 7). The original MoBo images gross01cmu were recorded on a treadmill using fixed cameras. We changed the backgrounds of MoBo Aligned images and manually added some perspective distortion to the images (see the second three images in Fig. 7) and no modifications were done to HumanEva images (the last two images in Fig. 7). We manually selected 4 points on the MoBo Aligned images and applied perspective transformation to get a bird’s eye view of them. We collected field images of 3 subjects, MoBo Aligned images of 15 subjects and HumanEva images of 2 subjects. The color dataset contains 8111 color images across 64 classes.

We used the entire training dataset (silhouettes or color images) to train the classifiers. The testing has been done using five selected videos (silhouettes or color images): a video each from CMU Motion of Body (MoBo) gross01cmu and HumanEva2 sigal09humaneva datasets (see Sect. 4.1 for more details) and three aerial videos (see Sect. 4.2 for more details). The annotated training and testing data will be publicly released in 2018.

Figures 5–6 show only the silhouette images, but the same technique is followed to create the color dataset. An example of training data collection is shown in Fig. 5. The silhouettes in the figure correspond to the first sub-step of the human gait cycle shown in Fig. 4, namely $P_{1}$ . The training data are collected at a camera distance of 30m and camera height of 10m (i.e., $\phi=18.4^{\circ}$ ), while the human subject walks on a marked circle of radius 5m in clockwise and anticlockwise directions. When walking from $A$ to $B$ on the circle, the orientation gradually changes up to $45^{\circ}$ with respect to the orientation at $A$ . In the training dataset, the images corresponding to the walk from $A$ to $B$ are considered as walking a straight line from $A$ to $B$ . This assumption can introduce a maximum of $22.5^{\circ}$ orientation error. This error can be reduced by selecting more viewpoints (in other words selecting a viewpoint separation angle of less than $45^{\circ}$ ). However, we limit this study to eight viewpoints for simplicity and efficiency. The images captured at locations $A$ , $B$ , $E$ and $F$ on the circle have the maximum orientation error. Only the images captured at the mid-points of chords $AB$ and $EF$ represent the true orientation. The training images are selected in order to cover all of the possible heading directions within the accepted error margin ( $\pm 22.5^{\circ}$ ). The same procedure is followed to create the other 63 pose-viewpoint pairs as well. When the training dataset is used to estimate the poses in a test video of a person walking in a circle, the reconstructed path will not be a perfect circle (even assuming zero estimation errors) but a polygon. The reason for this polygon shape is the orientation angle error associated with each viewpoint.

One advantage of the training data collection method above is it accounts for the natural twists and self-occlusions of walking better than collecting data only from walking straight. For example, the images captured at points $A$ and $E$ in Fig. 5 have the same orientation error. However, the silhouettes of the same pose can hold differences as the person at $A$ turns to his right and the person at $E$ turns to his left on the circle. Another significant advantage is it reduces the false positives of the classifier arising from slight variations of heading. By including slightly oriented silhouettes (with respect to walking straight) in the dataset, we approximate all of the small variations in orientation to be within the range $[0^{\circ},22.5^{\circ}]$ . This is a useful approximation when analyzing real walking patterns of human subjects because most of the time, people walk in straight lines and do not change their orientation frequently.

The collected training data consists of 64 labels, representing eight sub-steps of the gait cycle and eight viewpoints (see Fig. 6). For each label, the training data consists of silhouettes (for HOG) and RGB images (for CNN) created under different illuminations and orientations. Some sample images of class $P_{5}V_{4}$ are shown in Fig. 7.

3.4 Pose estimation: classifier design

We denote a training dataset of $n$ observations by

[TABLE]

where $\mathchoice{\mbox{\boldmath$ \displaystyle x $}}{\mbox{\boldmath$ \textstyle x $}}{\mbox{\boldmath$ \scriptstyle x $}}{\mbox{\boldmath$ \scriptscriptstyle x $}}_{i}$ is the $i$ th feature vector, and $y_{i}$ the $i$ th label. Suppose $\mathchoice{\mbox{\boldmath$ \displaystyle x $}}{\mbox{\boldmath$ \textstyle x $}}{\mbox{\boldmath$ \scriptstyle x $}}{\mbox{\boldmath$ \scriptscriptstyle x $}}_{i}\in\mathcal{X}\subset\mathbb{R}^{m}$ , where $m$ is the dimension of a feature vector; and $y\in\mathcal{Y}=\{1,...,K\}$ , where $K$ is the number of classes. We can formulate the pose estimation problem, like most classification problems in computer vision garcia06improving , as a $K$ -class classification problem: finding $f\mathrel{\mathop{\mathchar 58\relax}}\mathcal{X}\rightarrow\mathcal{Y}$ such that the classification error is minimized.

However, many real world problems are multiclass problems, $K>2$ . A standard way to create multiclass classifiers such as multiclass SVM is to map a multiclass problem onto many, possibly simpler, twoclass problems garcia06improving . A potential solution is a classifier combination method, the basic idea of which is to execute an ensemble of classifiers, and combine their outputs through a voting system, combination function, or weighting function tulyakov08review . In such a design, although each classifier is trained with a subset of the entire training set, each iteration involves the entire ensemble of classifiers. As a more efficient solution, we propose a dynamic classifier selection architecture, combining (i) a state transition model for the pose and viewpoint and (ii) an SVM-based error-correcting output codes (ECOC) framework dietterich95solving for our multiclass pose-viewpoint classification problem. Next, we will discuss this state transition model and ECOC framework in turn.

3.4.1 Viewpoint and pose transition

We model a pose-viewpoint pair as a state, and the transition of states using a state transition graph. This graph should not be confused with a Markov chain, because we do not assign probabilities to state transitions. Our state transition model is similar to Lan et al.’s model lan04unified , with the differences being:

•

Lan et al. used 8 viewpoints and different numbers of poses per viewpoint, whereas we use 8 viewpoints and 8 poses per viewpoint;

•

each of their side views and $45^{\circ}$ views is associated with 4 poses, and each of their front and back views is associated with 1 pose, resulting in a total of 26 states; whereas, our model consists of 64 states.

Our state transition graphs (see Fig. 8) are constructed based on the assumption that the human subject walks forward at a constant speed, does not take sharp turns and does not twist their body while turning. This assumption does not preclude left, right, or backward turns, as long as the turn is not abrupt, as exemplified by the yellow-border windows in Fig. 6. The state transition graphs establish a temporal relationship between the states, and dictate admissible state transitions, on which the classification outputs are based.

The admissible state transitions restrict the next classifier prediction to be one of the states the current state can transition to. Given the current pose and viewpoint, when a new image is available, the associated pose is predicted to be either the current pose (conceivably, the same pose appears in multiple consecutive frames when the video frame rate is high), or the pose in the next sub-step of the gait cycle (see Fig. 4). When the pose changes from the current state to the next state, the viewpoint of the next pose has to be one of the following: the same viewpoint (moving straight), $45^{\circ}$ clockwise from the current viewpoint (turning left), or $45^{\circ}$ anticlockwise from the current viewpoint (turning right).

3.4.2 Classification with error-correcting output codes (ECOC)

Considering the pose-viewpoint pairs as multiple classes in the problem domain, we select the ECOC framework dietterich95solving for multiclass classification. The ECOC is considered to be a powerful and popular multiclass classification technique furnkranz02round . Good results have been reported in furnkranz02round ; masulli04effectiveness ; masulli00effectiveness ; ghani00using using ECOC for different multiclass classification problems.

The ECOC framework uses a set of binary classifiers to solve a multiclass classification problem. In a problem domain of $K$ classes, the ECOC framework forms $N$ binary problems (dichotomizers), where $N>K$ . Dietterich et al. dietterich95solving represent the ECOC model of $N$ binary problems using a coding matrix $\mathbf{M}=[m_{k,n}]=\{-1,+1\}^{K\times N}$ , where each row encodes an $N$ -dimensional binary vector (a codeword), and each column is used to train a binary learner. The coding design is such that $+1$ represents a positive example of a class, whereas $-1$ represents a negative example.

We use the ternary ECOC framework proposed by Allwein et al. allwein00reducing that follows the steps below:

•

The coding matrix is defined as $\mathbf{M}=\{-1,0,+1\}^{K\times N}$ , where [math] tells the binary learner to ignore the corresponding class during training.

•

We use one-versus-one hastie98classification coding design which constructs $K(K-1)/2$ binary learners.

•

The selected decoding scheme is loss-based decoding allwein00reducing , and the binary learner is an SVM learner.

•

In the classification stage, when an input $x$ is available, the vector of predictions $\mathchoice{\mbox{\boldmath$ \displaystyle f $}}{\mbox{\boldmath$ \textstyle f $}}{\mbox{\boldmath$ \scriptstyle f $}}{\mbox{\boldmath$ \scriptscriptstyle f $}}(x)=[f_{1}(x)\;\cdots\;f_{N}(x)]$ is formed from the predicted outputs of the $N$ classifiers.

•

The predicted class is the class that minimizes some loss function $L$ (allwein00reducing, , Equation (5)):

[TABLE]

3.4.3 Classifier combination by dynamic classifier selection (DCS)

Our DCS architecture consists of a single 64-class SVM classifier denoted $C_{64}$ , and 64 4-class SVM classifiers denoted $C_{4}(P_{i},V_{j})$ , $i,j\in\{1,\ldots,8\}$ . The classifier $C_{4}(P_{i},V_{j})$ is associated with pose $P_{i}$ and viewpoint $V_{j}$ , and is trained to recognize the set of four classes:

[TABLE]

where $i,j\in\{1,\ldots,8\}$ and the operators $\boxplus,\boxminus$ are defined as follows:

[TABLE]

For example, the classifier $C_{4}(P_{4},V_{5})$ is trained to recognize the four classes labeled $a$ , $b$ , $c$ and $d$ in Fig. 6.

As depicted in Algorithm 1, our classification process works in two stages: (i) the initialization stage and (ii) the DCS stage. In the initialization stage, the first $q$ video frames are classified using classifier $C_{64}$ . The DCS stage starts with the $(q+1)$ th video frame. In this stage, each frame is classified with a classifier chosen based on the class label predicted by the previous iteration.

To elaborate, consider the example in Fig. 6. Suppose $q=4$ , and the blue- and red-border windows are sample classes predicated by the classifier $C_{64}$ . The red-border window highlights the class predicted for the $q$ th frame. Since this class is $(P_{4},V_{5})$ , the classifier $C_{4}(P_{4},V_{5})$ is chosen to classify the $(q+1)$ th frame. The training subsets for $C_{4}(P_{4},V_{5})$ are highlighted with the yellow-border windows $a$ , $b$ , $c$ and $d$ . A training subset refers to the images corresponding to a single class ko08dynamic .

The most significant difference between the classifier architecture presented here and architectures in the recent literature woods97combination ; kuncheva01decision ; tulyakov08review is that this architecture does not execute all of the classifiers to make a decision. Instead, only the relevant classifier is selected for every next image. The relevance of the classifier is determined by its training subsets, and the training subsets are selected based on the state transition graphs.

In Algorithm 1, the most resource-demanding component is ECOC SVM classification. The time and space complexities of this component are $\mathcal{O}(n_{\text{sv}})$ and $\mathcal{O}(n_{\text{sv}}m)$ respectively, where $n_{\text{sv}}$ is the number of support vectors, and $m$ is the number of features.

When using HOG features, the SVM model was trained using a one-versus-one coding design, which involves $K(K-1)/2$ support vectors. Final cropped silhouettes were resized to $96\times 160$ pixels. The HOG cell size was selected to be $4\times 4$ resulting in a 32292-dimensional feature vector. In this case, for the 4-class dynamic classifier, $n_{\text{sv}}$ is 6 and $m$ is 32292.

When using CNN features, the SVM model was trained using a one-versus-all coding design, which involves $K$ support vectors. In this case, for the 4-class dynamic classifier, $n_{\text{sv}}$ is 4 and $m$ is 4096.

3.5 Trajectory estimation

Trajectory estimation refers to estimation of the shape of path traversed by the human subject. Trajectory estimation is performed using the estimated viewpoints as inputs and thus, require the classifier to have minimal viewpoint estimation errors. The estimated trajectory is inevitably a polygonal approximation of the actual shape of the path. Various interpolation techniques could be applied to smoothen the estimated trajectory and thereby improve the approximation.

As shown in Algorithm 2, each estimated viewpoint serves as an estimation of the walker’s orientation. For each estimated orientation, a 3-D pose is reconstructed from the estimated pose. The algorithm can be thought of as primarily handling two cases:

•

Whenever an estimated pose is the same as the previous, the reason is assumed to be the camera’s high frame rate and/or the subject’s slow movement, and thus the subject is assumed to remain at the same location.

•

Whenever an estimated pose differs from the previous, the subject is assumed to have moved a fixed distance from the location of the previous pose. When the orientation changes by $x$ degrees, the next pose is positioned at a fixed distance from the location of the previous pose at an angle of $\pm x$ degrees ( $+$ ve for right turns, $-$ ve for left turns). Due to the way the viewpoint angle is discretized, as explained in Section 3.3, $x$ is a multiple of $360^{\circ}/8=45^{\circ}$ .

The trajectory estimation algorithm uses the dynamic classifier’s ability to resolve the left-right ambiguities of images. Without dynamic classifier selection, the classifier $C_{64}$ can make errors between the front and back views (rows 3 and 7 in Fig. 6), as a result of self-occlusions, or loss of joint angle and limb length information after binary conversion. The time and space complexities of Algorithm 2 are both $\mathcal{O}(1)$ .

4 Experimental results

We conducted three groups of experiments, which we discuss in the subsequent subsections. Across the experiments, the scenery and walking patterns vary significantly. These experiments include view variations between front, diagonal, side and back views.

Pose/viewpoint estimation errors are expressed in terms of (i) classification errors and (ii) viewpoint and pose transitional errors (TE). We define transitional errors as follows:

Definition 1

If the classifier prediction is different from the ground truth but is confined between the adjacent viewpoints (or poses), such predictions are considered as viewpoint transition errors (or pose transition errors).

For example, when the ground-truth pose transitions from $P_{1}$ to $P_{2}$ , given the similarity of the poses, it is likely for the classifier to still identify ground truth $P_{2}$ as $P_{1}$ for a few frames. Likewise, the classifier is likely to misclassify $P_{1}$ as $P_{2}$ before the ground-truth transition occurs. In other words, transitional errors can delay or advance true pose/viewpoint estimation. In this example, the performance measures without transitional errors are calculated as follows: when the ground truth is $P_{2}$ , the predicted adjacent poses ( $P_{1}$ and $P_{3}$ ) are considered to be true predictions, and all the other estimations are considered to be incorrect. Hereafter, we use the abbreviation TE in the equations, tables and figures to refer to transitional errors.

We now define pose/viewpoint estimation errors formally:

Definition 2

The percent pose estimation error, including transitional errors, is

[TABLE]

The percent pose estimation error, excluding transitional errors, is

[TABLE]

The percent viewpoint estimation error, including transitional errors, is

[TABLE]

The percent viewpoint estimation error, excluding transitional errors, is

[TABLE]

For trajectory estimation, each estimated trajectory is plotted on a 2-D plane with unitless axes, and the starting location mapped to the origin. Along a trajectory, the estimated poses are reconstructed using Rogez et al.’s 3-D, 13-jointed skeletal models rogez08spatio . The proximity of the estimated trajectories to the actual trajectories was assessed.

4.1 Experiments with publicly available datasets

In this group of experiments, we used two publicly available human motion datasets: (i) CMU Motion of Body (MoBo) gross01cmu and (ii) HumanEva2 sigal09humaneva . Both CMU MoBo and HumanEva2 are recorded indoors by a ground-based camera with a static background. For these datasets, background subtraction is used for foreground/background segmentation, and tested only with HOG features.

From the CMU MoBo dataset, the image sequence for subject 4071 was selected, which shows the subject walking on a treadmill at a constant speed. Fig. 9(a) shows the original images, the segmented silhouettes and the estimated poses. The reconstructed trajectory in Fig. 9(b) was a straight path, but the orientation was skewed by $45^{\circ}$ due to the $45^{\circ}$ error in the first estimated viewpoint. Table 2 shows that the dynamic classifier had significantly lower values for $e_{\text{pose}}$ and $e_{\text{viewpoint}}$ than $C_{64}$ .

From the HumanEva2 dataset, the image sequence for subject S2 combo C2 was selected, which shows the subject rounding a left turn. Figure 10(b) shows the 3-D reconstruction of the estimated poses and trajectory. The viewpoint of the third skeleton is incorrectly estimated as its adjacent viewpoint ( $+45^{\circ}$ error). However, when moving from the third skeleton to the fourth one, the viewpoint was corrected. Table 2 shows that the dynamic classifier had significantly lower values for $e_{\text{pose}}$ and $e_{\text{viewpoint}}$ than $C_{64}$ .

4.2 Experiments with video captured from a UAV

In this group of experiments, three video datasets representing three different scenarios were captured from a rotorcraft UAV — specifically, a 3DR Solo — in a slow and low-altitude flight mode. For recording videos, we use a GoPro Hero 4 black camera with an anti-fish eye replacement lens (5.4mm, 10MP, IR CUT) and a 3-axis Solo gimbal. The images were sampled at a rate of 30fps. In order to ease the segmentation process, the videos were recorded with an uncluttered background and with the human subject wearing dark clothes. The UAV-captured videos are segmented as described in Section 3.2. These experiments were conducted using both HOG and CNN features.

Certain assumptions were made to ease the coordinate transformation between the camera and the human subject:

•

The human subject stands upright on flat ground.

•

The camera roll angle is zero.

•

The roll, pitch and yaw angles of the UAV are zero during slow flight. Thus, the flight dynamics of the UAV has negligible effects on the camera elevation angle.

•

The human subject is approximately centered in the video.

These are valid assumptions in the case of an aerial platform designed to track human motion in a large field of view (see Fig. 11). The camera elevation angle and height were directly recorded from the UAV control interface. The UAV was operated at a known ground distance (camera distance) from the human subject.

4.2.1 Scenario 1

As depicted in Fig. 11(a), a human subject was filmed from his left walking along a straight line, by a UAV moving in synchrony. To achieve synchrony, the UAV was manually operated to maintain roughly the same speed as the subject, at a constant ground distance of 5m from the subject. The camera was horizontal and 2m above ground. Here are the findings:

•

Table 2 shows that the dynamic classifier had lower values for $e_{\text{pose}}$ and $e_{\text{viewpoint}}$ than $C_{64}$ .

•

When comparing HOG and CNN in Table 3 and Figs. 15–16, significantly lower values for $e_{\text{viewpoint}}$ and slightly higher values for $e_{\text{pose}}$ can be seen for HOG. Zero to very low errors are observable once the transitional errors were removed.

•

Figure 12 shows a successful 3-D reconstruction of the estimated poses and trajectory for a segment of the path using HOG and separately CNN features.

•

The confusion matrix in Fig. 17(a) shows viewpoint confusion was rare and confined to a neighbor of the true viewpoint.

4.2.2 Scenario 2

As depicted in Fig. 11(b), a human subject was filmed walking on a marked circle by a UAV pointing at the center of the circle. The camera was 30m from the center of the circle and 10m above ground. Here are the findings:

•

Table 2 shows that the dynamic classifier had a significantly lower $e_{\text{pose}}$ than $C_{64}$ and a slightly higher $e_{\text{viewpoint}}$ than $C_{64}$ .

•

Table 3 shows that CNN gives higher estimation errors than HOG does. The errors for CNN dropped significantly upon removal of the transitional errors.

•

Figure 13(b) shows the HOG-based estimated trajectory is approximately circular, as is the true trajectory. It also shows a 3-D reconstruction of the estimated poses and trajectory, with a small number of visibly wrong viewpoints. Figure 13(c) shows the trajectory estimated using CNN features failed to follow the ground truth in the second half of the circular path.

•

The confusion matrix in Fig. 17(b) shows viewpoint confusions are confined to neighbors of the true viewpoints. The lowest confusion rates are associated with the diagonal viewpoints $V_{2}$ , $V_{4}$ , $V_{6}$ and $V_{8}$ ; whereas, high classification accuracy has been recorded for $V_{1}$ , $V_{3}$ , $V_{5}$ and $V_{7}$ . The reason is viewpoints $V_{1}$ , $V_{3}$ , $V_{5}$ and $V_{7}$ correspond to the front, back and side views. These four viewpoints suffer minimal self-occlusions in the silhouettes compared to the others, and hence, provide better image details.

4.2.3 Scenario 3

As depicted in Fig. 11(c), a human subject was filmed walking on a marked 8-shaped path by a UAV pointing at the center of the path, which was created by joining two circles of radius 5m. The walk starts and ends at the same point in the marked path. The camera distance from the middle of the path, where the two circles meet, was 35m and the camera height was 10m. Perspective distortion of the video frames was negligible due to the small elevation angle, and so, the video frames were segmented without perspective correction. Here are the findings:

•

Table 2 shows that the dynamic classifier has a significantly lower $e_{\text{pose}}$ than $C_{64}$ and a comparable $e_{\text{viewpoint}}$ to $C_{64}$ .

•

Table 3 shows CNN gave higher estimation errors than HOG does. These results are consistent with those for Scenarios 1–2.

•

Figure 14(b) shows a 3-D reconstruction of the estimated poses and trajectory using HOG features for a segment of the path, which contains some visibly wrong viewpoints. However, Figs. 14(c) and 14(d) show the estimated trajectory approximates the a figure 8 well. In Fig. 14(e), results for CNN approximately reflect the shape of the path, but both types of errors $e_{\text{pose}}$ and $e_{\text{viewpoint}}$ are significantly higher than those for HOG.

•

The confusion matrix in Fig. 17(c) shows most of the viewpoint confusion was confined to neighbors of the true viewpoints. The worst confusion rate was associated with viewpoint $V_{2}$ , whereas the highest classification accuracy was recorded for $V_{3}$ . Generally, self-occlusions and loss of limb details are comparatively mild in $V_{1}$ , $V_{3}$ , $V_{5}$ and $V_{7}$ , and hence they had the lowest confusion rates. Nevertheless, confusion rates depended largely on individual body dynamics.

4.3 Experiments with perspective distortion

This group of experiments was conducted using HOG features to analyze the effect of perspective distortion in detail. These experiments were extensions of the Scenario 2 experiments discussed in the previous subsection. In addition to 10m, the UAV was flown at heights of 20m, 30m and 40m (see Fig. 18). The lowest height of 10m caused negligible perspective distortion, but at $h=40$ m ( $\phi=53.1^{\circ}$ ), the video suffers from severe perspective distortion. The main observations are:

•

In terms of pose estimation accuracy, perspective correction helped the dynamic classifier, but not $C_{64}$ , which was significantly worse than the dynamic classifier.

•

In terms of viewpoint estimation accuracy, perspective correction helps the dynamic classifier much more than it helped $C_{64}$ .

•

The advantage of perspective correction was more pronounced on more distorted videos.

•

The advantage of perspective correction was more pronounced for the dynamic classifier than $C_{64}$ .

Table 4 once again confirms the advantage of the dynamic classifier over $C_{64}$ , which does not take into account the ordinal relationship between poses. The advantage was more pronounced for more distorted videos, provided perspective correction was applied.

5 Discussion

Our discussion pertains to the dynamic classifier, HOG features, CNN features, perspective correction, limitations of the approach and considerations for practical implementation.

Dynamic classifier. A drawback of the dynamic classifier is its dependence on accurate initial estimation. The solution given here is to use a multiclass classifier for the initialization, namely $C_{64}$ , that recognizes all pose-viewpoint pairs. However, like all classifiers, $C_{64}$ sometimes makes mistakes, throwing the $C_{4}(\cdot,\cdot)$ classifiers off-course. A potential improvement is to re-initialize the dynamic classifier (see Algorithm. 1) periodically.

HOG features. HOG features are traditionally considered to be handcrafted features, and in some domains, they have been replaced by CNN features. HOG cells in the literature do not capture additional information compared to CNN and they are significantly different features. HOG features are based on the weighted gradients in a HOG cell which represents the orientation of the edge lines. HOG are low-level features while CNN are high-level features with the ability to adapt to the task at hand during training.

However, given the robustness achieved in these multiclass classification experiments, HOG features outperform CNN features in pose and trajectory estimation. In our experiment, we extracted HOG features from a silhouette and CNN features from a color image. Our observation for the overall robustness of HOG features is that it is dependent on silhouettes and hence significantly on edges. However, segmentation of aerial images (for HOG) is very challenging due to the varying resolution and background, and can benefit from the latest advances in semantic segmentation.

CNN features. The accuracy of CNN-based feature extraction depends on many factors such as the neural network model, nature of the original training dataset and complexity of the test image. The followings are the possible reasons why we achieved a lower accuracy for CNN compared to HOG:

•

In the HOG approach, all the images are silhouettes and the features are formed from the edge details. In contrast, the CNN is sensitive to high-level features such as texture, background, face and gender, in addition to edges, and this can cause overfitting. An overfitting model learns the noise and random details in training data in addition to the targeted details. A similar observation of HOG features outperforming CNN features in classification due to overfitting by the latter has been reported in sentas18performance . Techniques such as dropout srivastava14dropout and DeCov regularizer cogswell15reducing have been proposed to reduce overfitting and increase generalization. However, we did not apply these techniques, and the scope of our finding is limited to standard transfer learning.

•

Transfer learning is a successful approach for many computer vision-related problems chaturvedi15deep , but it has some constraints from the base network when copying the first $n$ layers of the base network to the first $n$ layers of the target network (left frozen feature layers). As a result, the feature layers do not change during the training of the new task. A possible alternative is to use deep transfer learning (DTL) which offers more flexibility when extracting high-level features yosinki14how . DTL can perform layer-by-layer feature transference to solve a target problem in either a supervised or unsupervised setting kandaswamy17multi .

•

In a CNN, the features detected by earlier layers include low-level image details such as edges and colors. However, in the later layers the features progressively become more specific to the object categories of the original dataset.

•

We used the original weights of AlexNet because our new dataset is very small compared to the original pre-trained dataset. This standard practice of not changing weights for a small dataset helps to reduce overfitting yosinki14how .

•

In many computer vision applications, CNN features outperform low-level features when the neural network has been trained with a sufficiently large, application-specific dataset jain14modeep . On the other hand, HOG does not need such a large dataset to achieve high accuracy.

Considering the factors above, it is not surprising CNN was outperformed by HOG in our experiments.

Left-right ambiguity. To illustrate how our algorithm handles left-right ambiguity (i.e., confusion between front and back views), we present an example in Fig. 19. Here, we consider a subject turning left from a side view to the back view. Once the subject turns $90^{\circ}$ to the left, his shape should be identifiable as the back view rather than the front view. As demonstrated in the figure, the back view cannot be confused with the front view, although they are similar in shapes, because the 4 classes are confined to three adjacent viewpoints related to the left turn.

Perspective correction. The results presented in Section 4.3 confirm the intuition that perspective correction is imperative for severely perspective-distorted videos. Our solution has problems with purely frontal or rear views, because frontal and rear silhouettes do not provide sufficient details for differentiating pose. A potential solution is provided by the mobility of the aerial platform itself. The UAV can be programmed to seek a good elevation angle and azimuth angle, before it starts analyzing the human subject’s action. This will require control algorithms and machine intelligence that go beyond the scope of this work.

Limitations of the approach. In this study, we tried to validate the suitability of dynamic classifiers for perspective distorted image sequences. We limited our work to gait estimation. However, the dynamic classifiers can be extended to estimate complex human poses. Another limitation is our system cannot handle complex gait sequences like sharp turns, twists and walking backwards. These are possible extensions to the current system, and can be addressed in future work. We used the standard transfer learning approach with Alexnet. However, an application-specific deep transfer learning framework can offer more flexibility to fine tune the neural network model. Finally, our training dataset is relatively small. The accuracy and the robustness of the classifiers can be further improved by adding more diverse images to the training data.

Practical implementation. The original motivation for this work was to make UAVs intelligent enough to recognize human activities, so the question about whether the proposed solution can run on an embedded platform is relevant. The most computationally intensive components of the proposed solution include homography, human detection, HOG feature extraction and SVM classification. The most computationally intensive is SVM, but even this can be implemented on resource-constrained devices anguita07hardware . Further efficiency is ensured by the fact that a single 4-class classifier needs to run after initialization (recall Algorithm. 1). In conclusion, all the algorithmic components are practical for an embedded platform. Note that 3-D reconstruction of the estimated poses and trajectory is meant for visualizations, not embedded applications.

6 Conclusion and future work

As a first step toward solving the problem of estimating human pose and trajectory in monocular videos from an aerial platform, the paper presents a solution that consists of perspective correction by homography, HOG/CNN feature extraction and dynamic classifier selection. The dynamic classifier is the defining feature of our solution, consisting of a 64-class classifier (namely $C_{64}$ ) and 64 4-class classifiers. The dynamic classifier works in conjunction with (i) a state transition model for the pose and viewpoint; and (ii) an SVM-based ECOC framework, which reduces multiclass classification to a set of efficiently solvable binary classification sub-problems (see Sects. 3.3–3.4). Trajectory estimation is for the estimation of the shape of the path traversed by the human subject, and is dependent on viewpoint estimation (see Section 3.5).

Experiments have been conducted with the CMU MoBo and HumanEva2 datasets and our own UAV-captured datasets, using $e_{\text{pose}}$ and $e_{\text{viewpoint}}$ as defined in Equations. (6)–(9) as performance measures. The performance measures were calculated using two alternative feature sets (HOG and CNN), and the accuracies were compared before and after removing the transitional errors. Results show that

•

The dynamic classifier outperforms $C_{64}$ .

•

Classification errors in the confusion matrix are evidently confined to neighbors of the true viewpoints. This property of the dynamic classifier enables fast recovery from incorrect estimations.

•

HOG features, compared to CNN features, facilitate more accurate estimation.

•

The more perspective-distorted a video is, the more necessary perspective correction is for reducing the estimation errors of the dynamic classifier.

•

The proposed solution works well with both indoor and outdoor videos, and both ground videos and perspectively distorted aerial videos.

•

The estimated trajectories approximate the actual trajectories well.

The solution proposed in this article is limited to estimating walking gaits. Our immediate plan is to extend the current work to the recognition of gestures performed during either walking or standing. Replacing the current HOG descriptors with Yang et al.’s flexible mixtures-of-parts model yang11articulated should provide a promising start.

Acknowledgements.

This project was partly supported by Project Tyche, the Trusted Autonomy Initiative of the Defence Science and Technology Group (grant number myIP6780).

**Compliance with Ethical Standards

**

Conflict of Interest The authors declare that they have no conflict of interest.

Informed Consent The data collection was conducted under the approval of University of South Australia’s Human Research Ethics Committee (protocol no. 0000035185).

Bibliography83

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) Orrite C, Herrero JE. Shape matching of partially occluded curves invariant under projective transformation. Computer Vision and Image Understanding. 2004;93(1):34 – 64.
2(2) Richter-Gebert J. Perspectives on projective geometry: a guided tour through real and complex geometry. Springer Science & Business Media; 2011.
3(3) Rogez G, Orrite C, Guerrero JJ, Torr PHS. Exploiting projective geometry for view-invariant monocular human motion analysis in man-made environments. Computer Vision and Image Understanding. 2014;120:126 – 140.
4(4) Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). vol. 1; 2005. p. 886–893 vol. 1.
5(5) Le Cun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE. 1998;86(11):2278–2324.
6(6) Woods K, Kegelmeyer WP, Bowyer K. Combination of multiple classifiers using local accuracy estimates. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1997 Apr;19(4):405–410.
7(7) Ko AHR, Sabourin R, Britto AS, Jr. From dynamic classifier selection to dynamic ensemble selection. Pattern Recognition. 2008;41(5):1718 – 1731.
8(8) Agarwal A, Triggs B. Recovering 3D human pose from monocular images. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 2006 Jan;28(1):44–58.