An Invariant Model of the Significance of Different Body Parts in Recognizing Different Actions
Yuping Shen, Hassan Foroosh

TL;DR
This paper introduces an invariant weighting method for body parts in action recognition from videos, emphasizing the unequal importance of body parts and improving recognition accuracy.
Contribution
The paper proposes a novel, view-invariant method to assign weights to body parts based on their significance in recognizing actions, inspired by human perceptual processes.
Findings
Significant performance improvement when using weighted body parts
Weights are invariant to viewing angles and camera parameters
Method validated through extensive experiments
Abstract
In this paper, we show that different body parts do not play equally important roles in recognizing a human action in video data. We investigate to what extent a body part plays a role in recognition of different actions and hence propose a generic method of assigning weights to different body points. The approach is inspired by the strong evidence in the applied perception community that humans perform recognition in a foveated manner, that is they recognize events or objects by only focusing on visually significant aspects. An important contribution of our method is that the computation of the weights assigned to body parts is invariant to viewing directions and camera parameters in the input data. We have performed extensive experiments to validate the proposed approach and demonstrate its significance. In particular, results show that considerable improvement in performance is…
| Ground-truth | Recognized as | ||||
|---|---|---|---|---|---|
| Walk | Jump | Golf Swing | Run | Climb | |
| Walk | 46 | 1 | 1 | 2 | |
| Jump | 1 | 48 | 1 | ||
| Golf Swing | 1 | 48 | 1 | ||
| Run | 2 | 48 | |||
| Climb | 4 | 1 | 1 | 44 | |
| Ground-truth | Recognized as | ||||
|---|---|---|---|---|---|
| Walk | Jump | Golf Swing | Run | Climb | |
| Walk | 45 | 1 | 1 | 2 | 1 |
| Jump | 2 | 47 | 1 | ||
| Golf Swing | 1 | 47 | 1 | 1 | |
| Run | 3 | 1 | 45 | 1 | |
| Climb | 4 | 1 | 1 | 2 | 42 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Gait Recognition and Analysis · Anomaly Detection Techniques and Applications
An Invariant Model of the Significance of Different Body Parts in Recognizing Different Actions
Yuping Shen and Hassan Foroosh Yuping Shen was with the Department of Computer Science, University of Central Florida, Orlando, FL, 32816 USA at the time this project was conducted. (e-mail: [email protected]).Hassan Foroosh is with the Department of Computer Science, University of Central Florida, Orlando, FL, 32816 USA (e-mail: [email protected]).
Abstract
In this paper, we show that different body parts do not play equally important roles in recognizing a human action in video data. We investigate to what extent a body part plays a role in recognition of different actions and hence propose a generic method of assigning weights to different body points. The approach is inspired by the strong evidence in the applied perception community that humans perform recognition in a foveated manner, that is they recognize events or objects by only focusing on visually significant aspects. An important contribution of our method is that the computation of the weights assigned to body parts is invariant to viewing directions and camera parameters in the input data. We have performed extensive experiments to validate the proposed approach and demonstrate its significance. In particular, results show that considerable improvement in performance is gained by taking into account the relative importance of different body parts as defined by our approach.
Index Terms:
Invariants, Action Recognition, Body Parts
I Introduction
Human action recognition from video data data has a wide range of applications in areas such as surveillance and image retrieval [61, 122, 67, 121, 13, 124, 59, 58, 60, 14, 10, 119], image annotation [128, 127, 129, 126, 125], video post-production and editing [26, 84, 5, 6, 50, 20], and self-localization [63, 62, 68, 69, 70], to name a few.
The literature on human action recognition from video data includes both monocular and multiple view methods [117, 15, 123, 113, 120, 16, 112, 115, 12, 35, 114, 116, 9]. Often, multiple view methods are designed to tackle viewpoint invariant recognition [117, 15, 113, 16, 115, 12, 114, 9], although such methods may require calibration across views [64, 56, 65, 57, 72, 73, 74, 8, 71, 25], image registration [49, 46, 27, 28, 7, 17, 18, 104, 47, 103, 24, 23, 22, 48, 44, 95, 21, 19, 45, 43], or tracking across views [118, 79, 81, 107, 80]. There are also methods that rely on human-object interaction [87, 136, 135], which often require identifying image contents other than humans [76, 134, 38, 37, 78, 85, 3, 2, 42, 4, 36, 40]. Other preprocessing steps that may be needed include image restoration [98, 106, 101, 30, 102, 77, 96, 92, 93, 97, 94, 29, 108, 31, 53, 99, 100, 105], or scene modeling [66, 32, 54, 11].
In this paper, we look at a very specific problem of determining which body parts play a role in recognizing an action and to what extent. This is a crucial information that may serve as a prior information to any method in the literature, whether video-based or single-image method.
II Related Work
The literature in human action recognition has been extremely active in the past two decades and significant progress has been made in this area [51, 82, 83, 131]. Human action recognition methods start by assuming a model of the human body, e.g. silhouette, body points, stick model, etc., and build algorithms that use the adopted model to recognize body pose and its motion over time. While there has been many different ways of classifying existing action recognition methods in the literature [51, 82, 83, 131], we present herein a different perspective. We consider action recognition at three different levels: (i) the subject level, which requires studying the kinematics of the articulated human body, (ii) the image level, which requires studying the process of imaging the 3D human body both in terms of the geometry and lighting effects, and (iii) the observer level, which requires studying how an observer would interpret visual data and recognize an action. This way of categorizing action recognition is rather uncommon but offers a different point of view of how one can approach this problem.
Herein we are interested in investigating observer level issues from a new point of view. It has been long argued in the applied perception community [90] that human observers use a foveated strategy, which may be interpreted as an approach using “importance sampling”. What this implies is that humans sample only the most significant aspects of an event or action for recognition, and do not give equal importance to every observed data point. Most of existing action recognition methods in the literature have primarily focused on subject level [33, 132, 34, 133, 137, 86, 138] and image level [41, 75, 89, 140, 130] issues. A number of methods have focused on the observer level [1, 89], but mostly from a machine learning point of view. An interesting work by Sheikh et al. [91] recently has perhaps a close connection to what we aim to address in this paper. However, they investigate a holistic approach in an abstract framework of action space. Our goal is to propose an observer level approach with more direct connection to image level features, which we believe would be more “intuitive”. In other words, we are interested in exploring an idea similar to importance sampling where image level features are directly assigned different importance weights that are derived directly from the extent to which the feature affects the recognition performance for a given action.
One particular issue that has attracted attention among many of these groups is invariance [39, 52, 55, 86, 88, 110, 111, 88]. Invariance is an important issue in any recognition problem from image or video data, since appearance is substantially affected by the imaging level, i.e. the viewing angle, camera parameters, lighting, etc. In action recognition the issue of invariant recognition is compounded with additional complexity introduced by the high degrees of freedom of human body [139].
Our focus in this paper is twofold: to investigate action recognition at the observer level which takes into account how humans recognize actions by paying attention to only the most significant visual cues, and to introduce this approach within an invariant framework
III Our Body Model
We use a point-based model similar to [110, 111], which consists of a set of 11 body points (see Figure 1). We also adopt the idea of decomposing the non-rigid motion of human body into rigid motions of planes given by triplets of body points [110, 111]. This implies that the non-rigid motion of human body is described by a set of homographies induced by planes associated with the body point triplets . As a result the complex problem of estimating the human body motion is reduced to a set of linear problems associated with the motions of planes.
With the triplet representation of human pose and action, we may consider the relative importance of body point triplets for different actions. For instance, the triplets composed of shoulders and hips have similar motion in walking and jogging, and thus make trivial contribution to distinguish them, while other triplets that consist of shoulder, knee and foot joints carry more essential information of the differences between walking and jogging (See Figure 2). Understanding the roles of body point triplets in human motion/action could help us retrieve more accurate information on human motion, and thus improve the performance of our recognition methods.
Our problem can be described as follows. We have a database of reference actions. Given a target sequence we are interested in finding out which reference sequence in the database best matches our target sequence. We are interested in performing this task in a manner that is invariant to viewing directions and the camera parameters. Our first step is to align the target sequences to be recognized to the reference sequences in our database, as described in the next section.
III-A Sequence Alignment
The body motion perceived in two different frames is often referred to as body pose transition. Our sequence alignment method is based on finding corresponding pose transitions in two actions sequences viewed by two different cameras. Our starting point is the view-invariant similarity measure that was proposed in [110, 111]: Any triplet of body points in one camera iamge and the corresponding triplet of points in a second camera define a homography between the two cameras. After a pose transition the triplets move to new positions defining a new homography . If the triplet motions are similar, then the two homographies become consistent with the fundamental matrix, and as a result the homography defined by will become a homology, two of whose eigenvalues are equal. The equality of the two eigenvalues of the homography , provides thus a measure of similarity for the motion of the two point triplets, which is invariant to viewing angles and the camera parameters [110]. Given body points as shown in Figure 1, there are such triplets, all of which are used to provide a measure of similarity for the two actions being compared. We use the following measure of similarity over all possible triplets for evaluating the similarity of two pose transitions and :
[TABLE]
where and are the two closest eigenvaluse of the homography defined by the -th triplet, and for 11 body points.
Given a target sequence of pose transitions , and a reference sequence of pose transitions , , in order to find the optimal alignment between the two actions and , we build a matching error matrix using (refeq:costf2), and use dynamic programming to find the optimal mapping. Once two sequences are aligned, the main issue is to determine how similar the two actions are. In the next section, we describe our weighting-based action recognition method.
IV Weighting-based Human Action Recognition
To study the roles of body-point triplets in action recognition, we select two different sequences of walking action and , and a sequence of running action . We then align sequence and to , using the alignment method described in Section 2, and obtain the corresponding alignment/mapping and . As discussed in Section 2, the similarity of two poses is computed based on error scores of all body-point triplets motion. For each matched poses , we stack the error scores of all triplets as a vector :
[TABLE]
where .
We then build an error score matrix for alignment :
[TABLE]
Each row of indicates the dissimilarity scores of triplet across the sequence, and the expected value of each column of is the dissimilarity score of pose and . Similarly we build an error score matrix for alignment . and are illustrated visually in Figure 3.
To study the role of a triplet in distinguishing walking and running, we compare the -th row of and , as plotted in Figure 4 (a) - (f). We found that, some triplets such as triplets 1 , 21 and 90 have similar error scores in both cases, which means the motion of these triplets are similar in walking and running. On the other hand, triplets 55, 94 and 116 have high error scores in and low error scores in , that is, the motion of these triplets in a running sequence is different from their motion in a walking sequence. Triplets 55, 94 and 116 reflect the variation in actions of walking and running, thus are more informative than triplets 1 , 21 and 90 for the task of distinguishing walking and running actions.
In the following experiments, we compare sequences of different individuals performing the same action, and study the roles of triplets in categorizing them in the same group of action: Select four sequences , , , and of golf-swing action, and align , , and to using the alignment method described in Section 2, and then build error score matrix , , correspondingly as in above experiments. From the illustrations of , , in Figure 5 (a), (b) and (c). The dissimilarity scores of some triplets, such as triplet 120 (see Figure 5 (f)) , is very consistent across individuals. Some other triplets such as triplets 20 (Figure 5 (d)) and 162 (Figure 5 (e)) have various error score patterns across individuals, that is, these triplets represent the variations of individuals performing the same action.
Definition 1
If a triplet reflects the essential differences between an action and other actions, we call it a significant triplet of action . All triplets other than significant triplets are referred to as trivial triplets of action .
A typical significant triplet should (1) convey the variations between actions and/or (2) tolerate the individual variations of the same action. For example, triplets 55, 94 and 116 are significant triplets for walking action, and triplet 20 is a significant triplet for the golf-swing action.
Intuitively, in the task of action recognition, we should place more focus on the significant triplets while reducing the negative impact of trivial triplets, that is, assigning appropriate influence factor to the body-point triplets. In our approach to action recognition, this can be achieved by assigning appropriate weights to the similarity errors of body point triplets in equation (1). That is, equation (1) could be rewritten as:
[TABLE]
where , is the number of body points in the human body model.
The next question is, how to determine the optimal set of weights for different actions. Manual assignment of weights could be biased and difficult for a large database of actions, and is inefficient when new actions are added in. Automatic assignment of weight values is desired for a robust and efficient action recognition system. To achieve this goal, we propose to use a fixed size dataset of training sequences to learn weight values. Suppose we are given a training dataset which consists of action sequences for different actions, each of which with pre-aligned sequences performed by various individuals. Let be the weight value of body joint with label () for the action (). Our goal is to find optimal assignment of which maximize the similarity error between sequences of different actions and minimize those of same actions. Since the size of the dataset and the alignments of sequences are fixed, this turns out to be an optimization problem on . Our task is to define a good objective function for this purpose, and to apply optimization to solve the problem.
IV-A Weights on Triplets versus Weights on Body Points
Given a human body model of points, we could obtain at most triplets, and need to solve a dimensional optimization problem for weight assignment. Even with a simplified human body model of 11 points, this yields a extremely high dimensional ( dimensions) problem. On the other hand, the body point triplets are not independent of each other. In fact, adjacent triplets are correlated by their common body points, and the importance of a triplet is also determined by the importance of its three corners (body points). Therefore, instead of using variables for weights of triplets, we assign weights to the body points , where:
[TABLE]
The weight of a triplet are then computed as:
[TABLE]
Note that the definition of in (6) ensures that . Using (6), equation (4) is rewritten as:
[TABLE]
By introducing weights to body points, we reduce the high dimensional optimization problem to a lower dimensional, and more tractable problem.
IV-B Automatic Adjustment of Weights
Before moving on to the automatic adjustment of weights, we first discuss the similarity score of two pre-aligned sequences. Given two sequences , , and the known alignment , the similarity of and is:
[TABLE]
where and are computed reference poses, and is a threshold, which we set as suggested in [110, 111]. Therefore, the proximate similarity score of and is:
[TABLE]
Considering that , , and are constants given the alignment , equation (IV-B) can be further rewritten into a simpler form:
[TABLE]
where are constants computed from (IV-B).
Now let us return to our problem of automatic weights assignment for action recognition. As discussed earlier, a good objective function would reflect the intuition that, significant triplets should be assigned higher weights, while trivial triplets should be assigned lower weights. Suppose we have a training dataset which consists of action sequences for different actions, each of which with pre-aligned sequences performed by various individuals. is the -th sequence in the group of action , and is the reference sequence of action . To find the optimal weight assignment for action , we define the objective function as:
[TABLE]
where and are non-negative constants and
[TABLE]
[TABLE]
[TABLE]
The optimal weights for action are then computed using:
[TABLE]
In this objective function, we use as the reference sequence for action , and the term and are the mean and variance of similarity scores between and other sequences in the same action. is the mean of similarity scores between and all sequences in other different actions. Hence achieves high similarity scores for all sequences of same action , and low similarity scores for sequences of different actions. The second term may be interpreted as a regularization term to ensure the consistency of sequences in the same group.
As and are linear functions, and is quadratic polynomial, our objective function is quadratic polynomial function, and the optimization problem becomes a quadratic programming (QP) problem. There are a variety of methods for solving the QP problem, including interior point, active set, conjugate gradient, etc. In our problem, we adopted the conjugate gradient method, with the initial weight values set to .
V Experiments
In this section, we apply the proposed weighting based approach to the action recognition problem, and compare its performance with non-weighting methods proposed in [110, 109, 111]. For comparison, we use the same MoCap testing data as in [110, 109, 111], and build a MoCap training dataset which consists of total of sequences for actions (walk, jump, golf swing, run, and climb): each action is performed by subjects, and each instance of action is observed by 17 cameras set up different random locations. The same 11-point human body model in Figure 1 is adopted in the training data. We use the same set of reference sequences for the 5 actions, and align the sequences in the training set against the reference sequences.
To obtain optimal weighting for each action , we first aligned all testing sequences against the reference sequence , and stored the similarity scores of triplets for each pair of matched poses. The objective function is then built based on equation (11), and the computed similarity scores of triplets in the alignments. is a 10-dimensional function, and the weights are constrained by
[TABLE]
The optimal weights are then searched to maximize , with the initialization at . The conjugate gradient method is then applied to solve this optimization problem.
After performing the above steps for all the actions, we obtained a set of weights for each action in our database. In order to compare our results with existing unweighted methods, we then carried out full comparisons with two methods proposed recently in the literature: action recognition using the fundamental ratios [109], and the method proposed in [110, 111]. As these methods behave differently, the objective functions we obtain for estimating weights may contain slightly different sets of coefficients. Figure 6 shows the computed weights for walking and jumping when using for the two methods. Although the weights were slightly different, as shown in the figure similar patterns emerged in terms of significant and trivial triplets: same triplets have relatively high weights in both results.
We repeated all the action recognition experiments reported in [109] and [110, 111] using the CMU MoCap data and compared our performance. Results are summarized in Tables I and II for the two methods. The overall recognition rate is 93.6% using a weighted eigenvalue method, and 90.4% using a weighted fundamental ratios method, which are improved by 2% and 8.8% compared to the unweighted cases, respectively.
VI Conclusion
We propose a generic method of assigning weights to body points using a small set of training set, in order to improve action recognition from video data. Our method is motivated by the studies in applied perception community on selective processing of visual data for recognition. Our experimental results strongly support our hypothesis that weighting body points differently for different actions leads to significant improvement in performance. Furthermore, since our formulation is based on invariant features, our method shows outstanding performance in the presence of varying camera orientations and parameters. Finally, we believe similar frameworks can be applied to other body models such as silhouette, motion flow, and stick-model.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Ahmad and S.W. Lee. HMM-based Human Action Recognition Using Multiview Image Sequences. ICPR’06 , 1:263–266, 2006.
- 2[2] Muhamad Ali and Hassan Foroosh. Natural scene character recognition without dependency on specific features. In Proc. International Conference on Computer Vision Theory and Applications , 2015.
- 3[3] Muhamad Ali and Hassan Foroosh. A holistic method to recognize characters in natural scenes. In Proc. International Conference on Computer Vision Theory and Applications , 2016.
- 4[4] Muhammad Ali and Hassan Foroosh. Character recognition in natural scene images using rank-1 tensor decomposition. In Proc. of International Conference on Image Processing (ICIP) , pages 2891–2895, 2016.
- 5[5] Mais Alnasser and Hassan Foroosh. Image-based rendering of synthetic diffuse objects in natural scenes. In Proc. IAPR Int. Conference on Pattern Recognition , volume 4, pages 787–790, 2006.
- 6[6] Mais Alnasser and Hassan Foroosh. Rendering synthetic objects in natural scenes. In Proc. of IEEE International Conference on Image Processing (ICIP) , pages 493–496, 2006.
- 7[7] Mais Alnasser and Hassan Foroosh. Phase shifting for non-separable 2d haar wavelets. IEEE Transactions on Image Processing , 16:1061–1068, 2008.
- 8[8] Nazim Ashraf and Hassan Foroosh. Robust auto-calibration of a ptz camera with non-overlapping fov. In Proc. International Conference on Pattern Recognition (ICPR) , 2008.
