View-Invariant Recognition of Action Style Self-Dissimilarity

Yuping Shen; Hassan Foroosh

arXiv:1705.07609·cs.CV·May 23, 2017

View-Invariant Recognition of Action Style Self-Dissimilarity

Yuping Shen, Hassan Foroosh

PDF

Open Access

TL;DR

This paper introduces view-invariant self-dissimilarity matrices for classifying action styles, demonstrating their effectiveness in gender recognition across different viewpoints using PCA and FDA frameworks.

Contribution

It presents a novel approach to intra-class dissimilarity for action style classification that is invariant to view and camera parameters.

Findings

01

Effective discrimination of action styles across viewpoints.

02

High accuracy in gender recognition from video data.

03

Invariant measures outperform non-invariant methods.

Abstract

Self-similarity was recently introduced as a measure of inter-class congruence for classification of actions. Herein, we investigate the dual problem of intra-class dissimilarity for classification of action styles. We introduce self-dissimilarity matrices that discriminate between same actions performed by different subjects regardless of viewing direction and camera parameters. We investigate two frameworks using these invariant style dissimilarity measures based on Principal Component Analysis (PCA) and Fisher Discriminant Analysis (FDA). Extensive experiments performed on IXMAS dataset indicate remarkably good discriminant characteristics for the proposed invariant measures for gender recognition from video data.

Tables5

Table 1. TABLE I: Male and female classification rates with different d ′ superscript 𝑑 ′ d^{\prime} for walking action.

$d^{'}$	1	2	3	4	5	6	7	8	9	10	11
Male classification rate	.524	.603	.635	.714	.762	.810	.841	.873	.873	.873	.873
Female classification rate	.476	.571	.603	.714	.746	.762	.825	.825	.873	.889	.889

Table 2. TABLE II: Male and female classification rates with different d ′ superscript 𝑑 ′ d^{\prime} for kicking action.

$d^{'}$	1	2	3	4	5	6	7	8	9	10	11	12	13
Male classification rate	.540	.619	.635	.730	.778	.778	.841	.857	.873	.873	.873	.873	.873
Female classification rate	.492	.571	.603	.683	.762	.810	.825	.825	.873	.873	.873	.889	.889

Table 3. TABLE III: Male and female classification rates with different d ′ superscript 𝑑 ′ d^{\prime} for throwing action.

$d^{'}$	1	2	3	4	5	6	7	8	9	10	11	12	13
Male classification rate	0.520	0.611	0.635	0.660	0.682	0.703	0.721	0.742	0.751	0.783	0.803	0.823	0.843
Female classification rate	0.498	0.561	0.600	0.681	0.721	0.743	0.754	0.782	0.795	0.832	0.842	0.854	0.8810

Table 4. TABLE IV: Male and female classification rates with different d ′ superscript 𝑑 ′ d^{\prime} for sit down action.

$d^{'}$	1	2	3	4	5	6	7	8	9	10	11
Male classification rate	0.525	0.600	0.631	0.710	0.751	0.812	0.841	0.855	0.859	0.861	0.863
Female classification rate	0.479	0.601	0.633	0.714	0.750	0.800	0.843	0.854	0.860	0.864	0.869

Table 5. TABLE V: Male and female classification rates with different d ′ superscript 𝑑 ′ d^{\prime} for stand up action.

$d^{'}$	1	2	3	4	5	6	7	8	9	10	11
Male classification rate	0.521	0.590	0.630	0.710	0.742	0.810	0.841	0.850	0.852	0.855	0.860
Female classification rate	0.470	0.600	0.629	0.710	0.752	0.801	0.840	0.845	0.850	0.857	0.8600

Equations36

M (P, Q)

M (P, Q)

T_{j}^{*} = ar g j min M (T_{j}, R_{j^{'}})

T_{j}^{*} = ar g j min M (T_{j}, R_{j^{'}})

P = [p_{ij}], \mbox w h er e p_{ij} = M (T_{j}, R_{j^{'}})

P = [p_{ij}], \mbox w h er e p_{ij} = M (T_{j}, R_{j^{'}})

T = [t_{ij}], \mbox w h er e t_{ij} = \frac{σ _{i 1}}{σ _{i 2}} - μ_{j}

T = [t_{ij}], \mbox w h er e t_{ij} = \frac{σ _{i 1}}{σ _{i 2}} - μ_{j}

x = [T_{s}, P_{s}] .

x = [T_{s}, P_{s}] .

J_{d} = k = 1 \sum N (m + i = 1 \sum d a_{i}^{k} e_{i} - x_{k})^{2}

J_{d} = k = 1 \sum N (m + i = 1 \sum d a_{i}^{k} e_{i} - x_{k})^{2}

m = \frac{1}{n} k = 1 \sum N x_{k},

m = \frac{1}{n} k = 1 \sum N x_{k},

\mathbf{a}^{k}=[\begin{array}[]{cccc}a^{k}_{1}&a^{k}_{2}&\dots&a^{k}_{d^{\prime}}\end{array}]^{T}

\mathbf{a}^{k}=[\begin{array}[]{cccc}a^{k}_{1}&a^{k}_{2}&\dots&a^{k}_{d^{\prime}}\end{array}]^{T}

S = k = 1 \sum N (x_{k} - m) (x_{k} - m)^{T},

S = k = 1 \sum N (x_{k} - m) (x_{k} - m)^{T},

\tilde{x}_{k} = A^{T} (x_{k} - m),

\tilde{x}_{k} = A^{T} (x_{k} - m),

\mathbf{A}=\left[\begin{array}[]{cccc}\mathbf{e}_{1}&\mathbf{e}_{2}&\dots&\mathbf{e}_{d^{\prime}}\end{array}\right].

\mathbf{A}=\left[\begin{array}[]{cccc}\mathbf{e}_{1}&\mathbf{e}_{2}&\dots&\mathbf{e}_{d^{\prime}}\end{array}\right].

y = w^{T} x,

y = w^{T} x,

J (w) = \frac{w ^{T} S _{B} w}{w ^{T} S _{W} w},

J (w) = \frac{w ^{T} S _{B} w}{w ^{T} S _{W} w},

S_{w} = x \in w_{0} \sum (x - m_{0}) (x - m_{0})^{T} + x \in w_{1} \sum (x - m_{1}) (x - m_{1})^{T},

S_{w} = x \in w_{0} \sum (x - m_{0}) (x - m_{0})^{T} + x \in w_{1} \sum (x - m_{1}) (x - m_{1})^{T},

S_{B} = (m_{1} - m_{2}) (m_{1} - m_{2})^{T},

S_{B} = (m_{1} - m_{2}) (m_{1} - m_{2})^{T},

m_{i} = \frac{1}{n} x_{i} \in w_{i} \sum x_{i} .

m_{i} = \frac{1}{n} x_{i} \in w_{i} \sum x_{i} .

w = S_{w}^{- 1} (m_{1} - m_{2}) .

w = S_{w}^{- 1} (m_{1} - m_{2}) .

c = w (\frac{m _{1} + m _{2}}{2}) .

c = w (\frac{m _{1} + m _{2}}{2}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Gait Recognition and Analysis · Anomaly Detection Techniques and Applications

Full text

View-Invariant Recognition of Action Style Self-Dissimilarity

Yuping Shen and Hassan Foroosh Yuping Shen was with the Department of Computer Science, University of Central Florida, Orlando, FL, 32816 USA at the time this project was conducted. (e-mail: [email protected]).Hassan Foroosh is with the Department of Computer Science, University of Central Florida, Orlando, FL, 32816 USA (e-mail: [email protected]).

Abstract

Self-similarity was recently introduced as a measure of inter-class congruence for classification of actions. Herein, we investigate the dual problem of intra-class dissimilarity for classification of action styles. We introduce self-dissimilarity matrices that discriminate between same actions performed by different subjects regardless of viewing direction and camera parameters. We investigate two frameworks using these invariant style dissimilarity measures based on Principal Component Analysis (PCA) and Fisher Discriminant Analysis (FDA). Extensive experiments performed on IXMAS dataset indicate remarkably good discriminant characteristics for the proposed invariant measures for gender recognition from video data.

Index Terms:

Invariants, Action Style Recognition, Self-Dissimilarity

I Introduction

Human action recognition from video data data has a wide range of applications in areas such as surveillance and image retrieval [69, 133, 75, 132, 12, 135, 67, 66, 68, 13, 9, 130], image annotation [140, 139, 141, 138, 137], video post-production and editing [25, 94, 4, 5, 58, 19], and self-localization [71, 70, 76, 77, 78], to name a few.

The literature on human action recognition from video data includes both monocular and multiple view methods [127, 14, 134, 123, 131, 15, 122, 125, 11, 37, 124, 126, 8]. Often, multiple view methods are designed to tackle viewpoint invariant recognition [127, 14, 123, 15, 125, 11, 124, 8], although such methods may require calibration across views [72, 64, 73, 65, 80, 81, 82, 7, 79, 24], image registration [57, 54, 26, 27, 6, 16, 17, 115, 55, 114, 23, 22, 21, 56, 52, 106, 20, 18, 53, 51], or tracking across views [128, 91, 93, 118, 92]. There are also methods that rely on human-object interaction [99, 152, 151], which often require identifying image contents other than humans [88, 148, 41, 40, 90, 95, 2, 1, 50, 3, 39, 44]. Other preprocessing steps that may be needed include image restoration [109, 117, 112, 32, 113, 89, 107, 103, 104, 108, 105, 31, 119, 33, 60, 110, 111, 116], or scene modeling [74, 34, 62, 10].

In this paper, we look at a very specific problem of determining stylistic differences in an action performed by different groups of people of people, e.g. stylistic difference due to age or gender differences. Human action style analysis is an important area in interpreting activities, which is motivated by the need for various applications, such as surveillance system, ergonomic evaluation, etc. At the very limit, of course, an individual could be considered as a category of their own, in which case this problem reduces to determining the identity of the individual, e.g. in gate recognition.

The problem of action style analysis is related to action recognition problem in that, action recognition systems aim at finding the features that distinguish different actions, while in action style analysis, stylistic features, e.g., stride parameters of walking gaits, are extracted from instances of the same action to reflect the style variations of individuals, or groups of individuals.

II Related Work

It has been proven that humans can recognize actions from limited types of input such as point lights and low quality video [35, 61]. Even with such limited information, we are capable to differentiate stylistic action differences, such as the gender [28, 98] and age [46] of a walking person. There has been several recent work on the study of action style variation in computer vision. Wilson and Bobick [150] use a Parameterized-HMM to model spatial pointing gestures by adding a global variation paramters in the output probabilities of the HMM states. In [142] Tenenbaum et al. use a bilinear model to separate perceptual content and style parameters. Davis [46] proposed an approach to determine age of people based on variations in relative stride length and stride frequency over various walking speeds. In [45] Davis and Taylor use regularities in walking to classify typical from atypical gaits. Davis et al. [47] presented a three-mode (body pose, time, and style) expressive-feature model for representing and recognizing performance styles of human actions. The application of style analysis in computer animation for generating new animation styles of human motion have also been reported in [145, 38, 146, 147, 48, 42].

Gait recognition is another problem related to action style analysis, which has very important implications for different domains such as surveillance, medical diagnosis, etc. It is based on the widely held belief that humans can distinguish between gait patterns of different individuals, by examining gait properties such as stride length, bounce, rhythm, and speed, etc. An early report on the ability to recognize people from gait was presented by Beardsworth and Buckner [29]. They showed that the ability to recognize oneself from point-light features is greater than that of recognizing others, a surprising result. The ability of people to identify others using gait information alone has also been supported by the studies of Stevenage et al. [129] and Schollhorn et al. [101]. Existing approaches in gait recognition can be broadly grouped into two categories: model based approaches [86, 144, 96] and model free approaches [30, 59, 87, 84, 136, 100, 143]. The model based methods assume a priori parameterized model and use these parameters as identify features of human. They try to fit the model to the 2D image sequences, and when the model and images are matched, the feature correspondence is automatically achieved. One example of model based approach is proposed in [86], where a model consists of severa ellipses is used, and parameters of these ellipses such as their centroid and eccentricity are used as human identity features. Troje [144] proposed to determine the gender of walkers from trajectories of projection coefficients of body pose. Murray [96] model the hip rotation angle as a simple pendulum, and approximate its motion by simple harmonic motion. A similar work is also reported by Cunado et al. [36], who use an articulated pendulum-like motion model and extract a gait signiture by fitting the motion of the thighs to it. The model free approaches can be further divided as deterministic or stochastic methods. Examples of deterministric methods include the work of Benabdelkader et al. [30], where the image self-similarity plot is used as a gait feature. Huang et al. [59] use optical flow to derive a motion image sequence for a walk cycle, and apply principle components analysis to silhouettes to derive so-called eignen gaits. Little and Boyd [87] extract frequencey and phase features from moments of of the motion image, and use template matching to identify people by their gait. Kale et al. [84] extract gait features from width vectors, velocity profile etc. and use the sequence of feature vectors to represent gait. They then use dynamic time-warping approach to match two gait sequence to identify people. Other deterministic approaches also include [84, 97, 43]. Examples of stochastic methods include the work of Sundaresan [136], where an HMM [100] is used to represent the gait of each individual. Another approach based on HMM is also reported in [85]. Tolliver et al. [143] extract shape from a cluster of similar pose obtained from a spectral partitioning framework, and use it to identify different individuals.

Like in other problems of human motion analysis, view invariance is an important requirement in gait recognition. However, only a few papers in the literature take into account view invariance . Shakhnarovich et al. [102] propose an approach that integrates face and gait recognition from multiple views. As in other works that use multiple view data, their approach is limited to the number of views being used and is not “truly” view-invariant. Kale et al.[83] propose a view-invariant method for the case when the person is far from the camera. They synthesize a side view from any other arbitrary view using a single camera, and apply methods based on side view of walking to solve the gait recognition problem.

III Action Style Analysis Using Homographies

In a recent work, Shen et al. [120, 121] suggested that the non-rigid motion of an articulated body can be decomposed into rigid motions of planes given by triplets of points corresponding to the joints of the articulated body. This essentially implies that an articulated non-rigid motion can be described by a set of homographies. As a result the non-linear problem of modeling the motion of an articulated body can be reduced to a set of linear problems in terms of motions of a collection of planes associated with point triplets. Their work focused on inter-class classification using invariants associated with these homographies. Herein, we consider a similar decomposition, but will focus on the dual problem of intra-class classification. For example, in the case of two walking sequences by a male and a female subject, the motion of the body point triplet that includes the hip, knee and the foot usually appears different in the two sequences, due to different styles of swagger and body swing in male and female subjects. Therefore, it should be possible to derive invariant style features from the motion of such body point triplets for analysis of intra-class differences, e.g. for applications such as gender identification from video data.

The first step is alignment of the target sequences to a reference sequence, which is discussed in the next section.

III-A Action Sequence Alignment

We call a camera view of a person’s body motion from one body pose to another, a pose transition. Our action sequence alignment is based on aligning pose transitions of two actions viewed by two different cameras. We start by a view-invariant similarity measure proposed in [120, 121]: Suppose two different subjects perform actions that are viewed by two different cameras. Any triplet of body points in one view and the corresponding triplet of body points in the second view define a homography ${\bf H}_{1}$ between the two cameras. After transition of the triplets to a new position in space a second homography ${\bf H}_{2}$ would be induced. If the triplet motions by the two subjects differ only up to a similarity transformation, then the two homographies would be consistent with the fundamental matrix, and as a result the cross-homography defined by ${\bf H}={\bf H}_{1}{\bf H}_{2}^{-1}$ would reduce to a homology. In other words two of the eigenvalues of ${\bf H}$ would be equal, e.g. $\sigma_{1}=\sigma_{2}$ . The equality of the two eigenvalues of the cross-homography ${\bf H}$ , would thus provide a similarity measure for the motion of the two body point triplets. Furthermore, this measure is invariant to viewing directions and the camera parameters. Given $11$ body points as shown in Figure 1, there are $165$ such triplets, all of which could be used to provide a combined measure of similarity of two actions.

Using this invariant measure, we define the following median absolute deviation estimator (MAD) as a measure of similarity of two pose transitions:

[TABLE]

where $\mu$ is the expected value of the ratio of the two closest eigenvalues over a subset of $N$ triplets. Since in our problem, we are only interested in intra-class alignment, $\mu$ can be set to one.

Suppose now we are given a target sequence of $m$ pose transitions $\mathcal{T}_{j}$ , $j=1,...,m$ and a reference sequence of $n$ pose transitions $\mathcal{R}_{j^{\prime}}$ , $j^{\prime}=1,...,n$ . The optimal pose transition $\mathcal{T}_{j}^{*}$ , in the target sequence that best matches a given pose transition $\mathcal{R}_{j^{\prime}}$ in the reference sequence can be obtained by:

[TABLE]

In order to find the optimal alignment $\psi\mathrel{\mathop{\ordinarycolon}}\mathcal{T}\rightarrow\mathcal{R}$ between the two sequences, we build the following matching error matrix

[TABLE]

The problem is clearly well suited for dynamic programming, and hence the solution is found as the path in the error matrix that minimizes the cumulative error.

Once a sequence is aligned to the reference sequence in a class of actions, the next question to answer is how dissimilar it is to the action class representative, i.e. the reference sequence. We will show in the next section that the dissimilarities are reflected in the motion patterns of planes defined by body point triplets, and can be measured in terms of the matching errors with respect to the reference sequence that represents our action class. Typically an intra-class classification is a much harder problem than an inter-class classification. Our approach has two important desirable characteristics:

Our error measures are based on the absolute deviation, and hence by design are meant to maximize discriminative power of our classifier. 2. 2.

Our classifier is invariant to camera parameters and orientations, and hence can rely on a much smaller set of training or reference sequences.

In the next section, we introduce our intra-class dissimilarity measures and demonstrate their power.

III-B Self-Dissimilarity

By self dissimilarity we mean how dissimilar is an instance of an action relative to its class representative. Once a target sequence is aligned to a reference sequence as described above, for every pose transition $\mathcal{T}_{j}$ in the target sequence we will have the corresponding pose transition $\mathcal{R}_{j^{\prime}}$ in the reference sequence. We next build two matrices that reflect the differences in action style in the target sequence $\mathcal{T}$ compared with the reference sequence:

Triplet Deviation Matrix (TDM)

: We construct a matrix $\mathbf{M}_{\psi}$ as follows:

[TABLE]

where the subscript $j=1,...,m$ are pose transition indices and $i=1,...,N$ are the triplet indices.

Therefore each column of $\mathbf{T}$ is an $N$ -vector containing the absolute deviation of all triplets for the matched pose transition $\mathcal{T}_{j}$ . On the other hand, each row $i$ of the matrix corresponds to the absolute deviations of the triplet $i$ across all matching pose transitions.

Pose Deviation Matrix (PDM)

: For aligning the target sequence $\mathcal{T}$ to the reference sequence $\mathcal{R}$ we applied dynamic programming on the matrix ${\bf P}$ . In this matrix each element is the dissimilarity error of pose $i$ of $\mathcal{R}$ and pose $j$ of $\mathcal{T}$ . In a sense, the elements of ${\bf P}$ are a measure of correlation between pose transitions in the two sequences. When $\mathcal{T}$ and $\mathcal{R}$ are the same sequence, ${\bf P}$ is a special case of the self-similarity matrix, which was used for action recognition in [63]. The patterns in ${\bf P}$ represent the characteristic features of the target sequence, and can be used to describe the style deviation of the target sequence from the reference sequence.

TDM and PDM describe the stylistic deviations of a sequence from the reference in two different ways: TDM captures localized low-level (body point triplet level) style deviations, while PDM captures a global motion style deviation by looking at the whole body pose.

In the following sections, we will discuss the properties of TDM and PDM. Without loss of generality, we study the action of kicking as an example. In our study, we use the IXMAS dataset [149] in which videos of 13 actions are captured under 5 cameras, and each action is performed by 11 actors for 3 times/instances. For kicking action, we randomly chose the “bao1” sequence from camera 2 as the reference sequence.

III-B1 Deviations of Individual Subjects

We selected 3 kicking sequences performed by 3 actors in the data set, aligned them to the reference sequence and computed the corresponding TDM and PDM for each sequence (see Figures 3 and 4).

As shown in Figures 3 and 4, the TDM and PDM have different patterns in the three sequences. The different locations of peaks and valleys in the TDM plots suggest that the corresponding triplets move differently at various time slots in these sequences. The PDM also provides a good indication of different styles in these sequences, although their patterns in the diagonal look similar since they are intra-class measures.

III-B2 View Invariance

To study the invariance of TDM and PDM in different viewpoints with different camera parameters, we arbitrarily selected one instance of a subject and its captured videos by 4 different cameras. Similarly we computed the TDM and PDM for each sequence, which are plotted in Figure 5.

As shown in Figure 5, the computed TDM for a subject under various camera setups have similar peaks and valleys. Although minor variations exist, it is still easy to distinguish individuals from the visual differences of TDMs. The same observation is made regarding the PDMs. These observations show that as expected TDM and PDM are view invariant, since they are based on invariants associated with cross-homographies.

III-C Gender Recognition

In this section, we discuss the application of our solution to gender recognition using action style representations TDM and PDM. We are provided with a set of training sequences which are labeled as $w_{0}$ (female) and $w_{1}$ (male). Our goal is to find a classifier that correctly categorizes an input action sequence $\mathcal{T}$ as $w_{0}$ or $w_{1}$ . As discussed in previous sections, TDM and PDM provide good representations of action styles. However, due to the irregularities in human motion, the same action may be performed slightly differently even by the same subject in various instances, producing thus different patterns in TDM and PDM. While some of these patterns are essential for recognizing action styles, others are merely noise, making style analysis extremely challenging. Another challenge of course is the dimensionality of the problem. To tackle these problems, we first serialize the two matrices $\mathbf{T}$ and $\mathbf{P}$ as two vectors $\mathbf{T}_{s}$ and $\mathbf{P}_{s}$ and then stacked them together in a single $d$ -dimensional vector

[TABLE]

In the next two sections, we propose two frameworks for gender classification using $\mathbf{x}$ that summarizes TDM and PDM.

III-C1 Gender Classification using PCA

In order to reveal the underlying stylistic information in the TDM and PDM, we need to describe the data in a way that makes “critical” (significant) and “trivial” (insignificant) triplets or pose patterns better discriminated. Principal Component Analysis (PCA) provides a good solution for this purpose, providing both feature selection and dimensionality reduction. Suppose we have a set of $N$ $d$ -dimensional samples $\mathbf{x}_{1},\dots,\mathbf{x}_{N}$ . The goal here is to find a natural set of $d$ orthonormal basis vectors $\{\mathbf{e}_{i}\}$ to represent the samples such that the criterion function

[TABLE]

is minimized, where $\mathbf{m}$ is the sample mean,

[TABLE]

and

[TABLE]

are defined as principal components. The solution is to compute the eigenvalues and eigenvectors of the scatter matrix $\mathbf{S}$

[TABLE]

and sort the eigenvalues and eigenvectors according to decreasing eigenvalue. The $d^{\prime}$ largest eigenvectors are then used as the basis vectors $\{\mathbf{e}_{i}\}$ . Usually $d^{\prime}$ is much smaller than $d$ , which implies that the $d^{\prime}$ dimensions are inherent subspaces that govern the samples, while the remaining $d-d^{\prime}$ dimensions are merely noise. A sample $\mathbf{x}_{k}$ can now be represented by principal components though projecting onto the $d^{\prime}$ dimensional subspace as $\tilde{\mathbf{x}}_{k}$ :

[TABLE]

where

[TABLE]

The basis vectors computed by PCA are in the direction of the largest variance of the training vectors, and they convey the stylistic elements inherent to the specific motion. We call these bases the “eigenstyles”. These eigenstyles span a style space of the specific motion. When a sample $\mathbf{x}_{k}$ is projected onto the style space, its vector $\tilde{\mathbf{x}}_{k}$ describes the significance of these eigenstyles in the sample. We therefore define $\tilde{\mathbf{x}}_{k}$ as the stylistic feature of the sample. A style sample/representation can be reconstructed with some error based on the eigenstyles and its stylistic feature from equation (10).

We adapted the k-nearest neighbor algorithm to classify sequences represented by the PCA based stylistic feature as follows. Suppose we are provided with a set of stylistic feature vectors $\{\tilde{\mathbf{x}}_{k}|k=1,2,\dots,n\}$ after PCA, and their corresponding labels $\left\{\mathcal{L}_{k}|k=1,2,\dots,n,\mathbf{L}_{k}\in\{w_{0},w_{1}\}\right\}$ . A target sequence $\mathcal{T}$ is classified as $w_{0}$ or $w_{1}$ based on the following procedure:

The PDM and TDM of sequence $\mathcal{T}$ are first computed, and then seialized as a $d$ -dimensional vector $\mathbf{x}$ . 2. 2.

$\mathbf{x}$ is projected onto the eigenstyle space as $\tilde{\mathbf{x}}$ . 3. 3.

The Euclidian distances between $\tilde{\mathbf{x}}$ and all $\{\tilde{\mathbf{x}}_{k}$ in the training set are computed, and $\mathcal{T}$ is classified as $w_{i}$ which is most frequent among the $k$ training vectors nearest to $\tilde{\mathbf{x}}$ , where $k$ is the closest odd integer to $\sqrt{n}$ .

III-C2 Gender Classification using LDA

The PCA method finds eigenstyles to describe as much deviation in data as possible, and provides good features to describe the data. However, $d-d^{\prime}$ dimensions that are thrown away in PCA may still contain useful information for our classification task. On the other hand, PCA is an unsupervised technique that seeks features which are efficient for describing data. However, it does not make use of the label information in data. Unlike PCA, Linear Discriminant Analysis (LDA) seeks features that are efficient to discriminate the classes given the labeled data. Suppose the data $\{\mathbf{x}_{i}|i=1\dots n\}$ are categorized into $w_{0}$ and $w_{1}$ , LDA projects the data $\mathbf{x}_{i}$ onto point $y$ on a line $\mathbf{w}$ by a linear combination of the components of $\mathbf{x}$ :

[TABLE]

and seeks an optimal $\mathbf{w}$ that results in best separation between points with different labels. It is solved by maximizing the objective function:

[TABLE]

where the intra-class scatter matrix $\mathbf{S}_{W}$ is defined as

[TABLE]

the inter-class scatter matrix $\mathbf{S}_{B}$ is defined as

[TABLE]

and

[TABLE]

As discussed in [49], $J(\cdot)$ is independent of $\|\mathbf{w}\|$ , and the solution of $\mathbf{w}$ that minimizes $J(\cdot)$ is

[TABLE]

Using $\mathbf{w}$ , we project our style vectors $\mathbf{x}$ on a line, and the projected scalar value $y$ is our extracted stylistic feature. We thus convert the $d$ -dimensional classification problem to a far more manageable one-dimensional one. All that remains for our task is to find a threshold that separates the projected points into $w_{0}$ and $w_{1}$ . Here the decision surface is reduced to a scalar value. We assume that the stylistic vectors of both classes exhibit approximately the same distributions, therefore we choose the separation threshold as

[TABLE]

IV Experiments

In this section, we present experiments to demonstrate the effectiveness of our proposed gender recognition from human motion, based on the proposed stylistic features. We tested our methods on all the 13 actions from the IXMAS data set. IXMAS dataset consists of 13 everyday actions performed 3 times by 11 actors at arbitrary positions and orientations, and observed by 5 cameras set up at various viewpoints. We assumed that body points were tracked. We then arbitrarily selected one of the subjects as reference for each action, and selected a small number of sequences as a training set, which included 2 female subjects and 2 male subjects, with reference, training and testing sets completely disjoint. For each action, the PCA based method and LDA method were applied to classify the testing sequences as male or female.

Figure 6 displays the first 3 eigenstyles we obtained for kicking action. We used different values of $d^{\prime}$ when using our PCA method, and measured the resulted classification rates. Results for 5 of the actions are shown below. We found that the remaining actions in the dataset did not provide sufficient information to distinguish male and female actors. Examples of such actions are ”watch time”, ”cross arms”, ”wave hand”, etc. Our explanation for this is that male and female subjects essentially perform these simple tasks in almost identical manner. Also, these actions are extremely simple and very little parts of the body are involved. As a result, little information is present in the data for gender classification. On the other hand the five actions for which the results are illustrated in the graphs and the tables below involve more sophisticated body part motions, providing thus a better means of distinguishing gender.

With the LDA method, we projected the high dimensional data onto a one dimension subspace. Figure 7 illustrate examples of the distribution of the projected points for the two actions of “kicking” and “walking”. As shown in the histograms, the data of two classes are well separated, and could be distinguished by simple thresholding.

We plot the resulted classification rates based on LDA with those based on PCA in Figures 8: (a)-(e). As can be seen in these results, the LDA method is more efficient in the task of gender recognition, partially due to the fact that LDA makes better use of labels in the training set, and also the exact features that are more efficient for discriminating classes.

V Conclusion

We propose two invariant measures that can be used for intra-class classification of actions performed by different subjects that are captured by different cameras from different viewing points. We successfully demonstrate their very powerful property of discriminating action styles by using these measures as the feature vectors within two frameworks based on PCA (eigenstyles) and LDA. Our paper makes several main contributions: (i) our methods are invariant to viewpoint variations and camera parameters due to using view-invariant feature vectors, (ii) very little training set is required for our methods while providing very good performance, (iii) we show with extensive experiments that the proposed new eigenstyles and LDA method can reliably classify genders from video data of different actions. Our results can be readily extended to other applications such as age recognition, human identification using gait, and identification of abnormal action features such as carrying extra weight, or walking on an uneven surface.

Bibliography152

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Muhamad Ali and Hassan Foroosh. Natural scene character recognition without dependency on specific features. In Proc. International Conference on Computer Vision Theory and Applications , 2015.
2[2] Muhamad Ali and Hassan Foroosh. A holistic method to recognize characters in natural scenes. In Proc. International Conference on Computer Vision Theory and Applications , 2016.
3[3] Muhammad Ali and Hassan Foroosh. Character recognition in natural scene images using rank-1 tensor decomposition. In Proc. of International Conference on Image Processing (ICIP) , pages 2891–2895, 2016.
4[4] Mais Alnasser and Hassan Foroosh. Image-based rendering of synthetic diffuse objects in natural scenes. In Proc. IAPR Int. Conference on Pattern Recognition , volume 4, pages 787–790, 2006.
5[5] Mais Alnasser and Hassan Foroosh. Rendering synthetic objects in natural scenes. In Proc. of IEEE International Conference on Image Processing (ICIP) , pages 493–496, 2006.
6[6] Mais Alnasser and Hassan Foroosh. Phase shifting for non-separable 2d haar wavelets. IEEE Transactions on Image Processing , 16:1061–1068, 2008.
7[7] Nazim Ashraf and Hassan Foroosh. Robust auto-calibration of a ptz camera with non-overlapping fov. In Proc. International Conference on Pattern Recognition (ICPR) , 2008.
8[8] Nazim Ashraf and Hassan Foroosh. Human action recognition in video data using invariant characteristic vectors. In Proc. of IEEE Int. Conf. on Image Processing (ICIP) , pages 1385–1388, 2012.