The role of ego vision in view-invariant action recognition
Gaurvi Goyal, Nicoletta Noceti, Francesca Odone, Alessandra Sciutti

TL;DR
This paper investigates how ego-vision influences view-invariant action recognition, leveraging transfer learning in CNNs to understand the capabilities and limitations of egocentric data for recognizing actions across different viewpoints.
Contribution
It introduces a transfer learning approach in CNNs to analyze view-invariant action recognition in egocentric videos, highlighting the peculiarities and potential of ego-vision.
Findings
Transfer learning improves view-invariance in egocentric action recognition.
Ego-vision data presents unique challenges for view-invariant recognition.
The study provides insights into the limitations of current CNN-based methods for egocentric data.
Abstract
Analysis and interpretation of egocentric video data is becoming more and more important with the increasing availability and use of wearable cameras. Exploring and fully understanding affinities and differences between ego and allo (or third-person) vision is paramount for the design of effective methods to process, analyse and interpret egocentric data. In addition, a deeper understanding of ego-vision and its peculiarities may enable new research perspectives in which first person viewpoints can act either as a mean for easily acquiring large amounts of data to be employed in general-purpose recognition systems, and as a challenging test-bed to assess the usability of techniques specifically tailored to deal with allocentric vision on more challenging settings. Our work, with an eye to cognitive science findings, leverages transfer learning in Convolutional Neural Networks to…
| SourceTarget | 00 | 11 | 22 | 0,1,20,1,2 | 0,12 | 0,21 | 1,20 | 01 | 02 | 10 | 12 | 20 | 21 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SLP | 93.25 | 91.11 | 92.70 | 87.37 | 68.33 | 46.03 | 68.10 | 47.38 | 68.33 | 47.38 | 32.86 | 66.27 | 34.84 |
| 3DConv | 96.25 | 96.35 | 96.43 | 94.81 | 62.30 | 61.67 | 62.70 | 50.63 | 64.84 | 33.10 | 36.35 | 61.67 | 54.92 |
| SourceTarget | Mean | 01 | 04 | 14 | 24 | 34 | 40 | 41 | 42 | 43 | 0,34 | 0,1,2,34 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DT [20] | 61.7 | 93.9 | 27.6 | 22.4 | 53.3 | 34.8 | 42.1 | 25.8 | 63.3 | 48.8 | – | – |
| Hankelets [12] | 56.4 | 83.7 | 33.6 | 26.9 | 60.1 | 31.2 | 39.6 | 32.8 | 68.1 | 37.4 | – | – |
| SLP | 69.4 | 84.4 | 48.8 | 47.0 | 66.0 | 45.0 | 53.3 | 56.6 | 69.3 | 53.5 | 57.3 | 62.8 |
| 3DConv | 68.5 | 89.0 | 44.4 | 42.5 | 61.3 | 45.4 | 48.6 | 49.1 | 57.9 | 46.7 | 49.2 | 57.9 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Video Surveillance and Tracking Methods
The role of ego vision in view-invariant action recognition
Gaurvi Goyal, Nicoletta Noceti, Francesca Odone
DIBRIS - University of Genova, IT
Alessandra Sciutti
Istituto Italiano di Tecnologia
Abstract
Analysis and interpretation of egocentric video data is becoming more and more important with the increasing availability and use of wearable cameras. Exploring and fully understanding affinities and differences between ego and allo (or third-person) vision is paramount for the design of effective methods to process, analyse and interpret egocentric data. In addition, a deeper understanding of ego-vision and its peculiarities may enable new research perspectives in which first person viewpoints can act either as a mean for easily acquiring large amounts of data to be employed in general-purpose recognition systems, and as a challenging test-bed to assess the usability of techniques specifically tailored to deal with allocentric vision on more challenging settings. Our work, with an eye to cognitive science findings, leverages transfer learning in Convolutional Neural Networks to demonstrate capabilities and limitations of an implicitly learnt view-invariant representation in the specific case of action recognition.
1 Introduction
Action recognition is a core topic in computer vision with applications in a variety of artificial intelligence systems — there included, human-computer interaction, robotics, video-surveillance, just to name a few. We are experiencing today considerable leaps forward in the action recognition research, with better algorithms and models being proposed more and more frequently. Among the open problems the research community is dealing with, we focus on the tolerance to view-point changes. This property is not easily obtained in recognition tasks, and requires special care. In the domain of ego-vision, view invariant action recognition is an important element with two different implications: first, ego-vision systems may provide us with large amount of data streams, which could be fuelling general purpose recognition systems; conversely, in the design of the algorithms for an ego-vision system, one may want to incorporate information learnt from allocentric vision data. While action recognition from egocentric view has been explored to some degree [16, 18], within view-invariance the subject remains largely untouched. Over the years, the problem of view-invariant motion recognition has been addressed considering two different settings, i.e. observing the same dynamic event simultaneously from multiple cameras [27, 13] or considering independent instances of a same dynamic concept [10, 9].
Methodologically, early works approached it as an epipolar geometry problem [19, 24], while later works can be categorized into methods acting at a descriptor level, to design representations explicitly embedding view-invariant information [10, 12, 9, 16], or at a similarity level [23, 8, 26]. In this category are also methods addressing the problem with a transfer learning formulation, from one view to another [27], or to a common virtual view, sometimes in a 3D reference frame [13]. In the last decade, the availability of affordable 3D acquisitions systems has facilitated approaches combining multiple types of information, e.g. videos and skeletal data (for more details see [17]).
View-invariant action recognition plays a crucial role in humans, supporting the capability to solve the correspondence problem, i.e., identifying a mapping between the others’ actions and their own, which is necessary for crucial activities like social learning, imitation or mimicry [15]. From results in neuroscience, it emerges that such view invariance is a property of higher order visual areas, such as the Superior Temporal Sulcus [7], which could however also be supported by pooling together the responses of other view-dependent areas. Indeed, from studies in the macaque brain it is suggested that view-dependent mirror neurons in the premotor cortex (area F5) play an essential role in the formation of view-invariant representations. Alternatively, it has been speculated that a top-down stream of information from view-dependent mirror neurons might modulate the activity of visual representations in the STS, reinforcing the processing of visual patterns that are associated with different views of the same action [2]. Although some human perceptual abilities are immediately available, being part of an innate background of skills, a large portion of them are acquired over time leveraging what we can generally call the “human experience”. In modern artificial intelligence, its role is often played by a large amount of data. With the success of data driven methods, especially deep learning, the variety in the available data has a corresponding effect on the variety and the effectiveness of applications. Very complex architectures leverage on the availability of large datasets, which allow us to learn not only input-output relationships with good generalization properties, but also multiple intermediate representations, that could be exploited to address other tasks, through transfer learning. The ability of pre-trained deep neural networks to extract relevant information from new data is documented [22] and applied often, but only recently in action recognition [4, 21].
Considering this context, in our work we are assessing the potential of pre-trained features in mimicking the role of view-dependent neurons and view invariant higher level descriptions. We consider the MoCA (Multimodal Cooking Actions) [14] dataset specifically acquired to study view invariance, both in artificial and biological systems111The dataset will be soon made available to the research community.. It includes three different views (an egocentric and two allocentric) of a set of 20 different upper body activities — see Fig. 1. We discuss the effectiveness of intermediate pre-trained features, in dealing with different degrees of view invariance, with specific reference to situations in which the egocentric one is involved.
2 Methodology
Our approach is based on learning intermediate level features with the help of a pre-trained architecture, and applying this representation as an input to a multi-class classification architecture which depends on the specific task of interest. To learn the representation we consider a variant of the Inception 3D model [4] taking optical flow estimates as inputs and 3D convolutional filters to incorporate and compress both spatial and temporal information. The model is pre-trained on ImageNet dataset [5] and on Kinetics-400 [11]. Once trained, the network may be seen as a multi-resolution representation of image sequences.
In order to pin-point an appropriate point of extraction of intermediate features, we identify intermediate layers, which should be producing representations tolerant to view point changes, without being too connected to a specific classification task, hence a point 2 layers before the end was selected. (see [4] for details). Thereafter, for a given multi-class classification task, segmented video clips of the actions are used as inputs to the action recognition pipeline. From them, the optical flow is extracted, using the TV-L1 algorithm [25]. The optical flow is input into the trained Inception 3D model and the activations or learnt intermediate spatio-temporal features are then fed to a multi-class classifier. In Section 3, we will compare results obtained with two different classifiers with different degrees of complexity: Single Layered Perceptron (SLP) and a convolutional neural network (3DConv). The simple SLP allows us to comment on the intrinsic ability of the learnt features to deal with view invariance and with the complexity of ego-vision. The more complex 3DConv, shows the potential of the approach under different challenging classification tasks.
3 Results
The core of our experimental analysis focuses on the MoCA dataset [14], consisting of 20 cooking action primitives, involving one or two arms of a volunteer, with subtle differences between different actions. The dataset comprises synchronized videos of actions from 3 different viewpoints see Fig 1: Lateral (V0), Egocentric(V1), and Frontal (V2). Training (TR) and Test (TE) sequences are available for each action and viewpoint. In different iterations of the experiment, we trained the classifiers with a variety of subsets of the TR split and tested on subsections of the TE split. Validation splits were processed using a batch-wise protocol with batch normalization parameters calculated per batch.
The resulting validation accuracies are shown in Table 1. We carried out a set of baseline experiments, where TR and TE are uniform: ({ii}, with ). We also include another baseline, ({0,1,20,1,2}), where a view-invariant model is obtained simply by training the classifier on multiple views. We can see that in all these cases, the classification performances are high. Next, we consider a one-view out protocol, when the classifiers are trained with 2 viewpoints and tested on the third; in this case, there is a notable and expected drop is the capability of the classifiers to correctly classify the actions, but considering they are not explicitly trained to identify actions view-invariantly, this drop is modest and thus not remarkable. Notice in particular how the egocentric view is the hardest to classify if it does not participate in the training phase.
Finally, we adopt a one-one protocol training classifiers on a single viewpoint and evaluating on another viewpoint, to analyse view-view relationship. When both views are allocentric ({02},{20}), the resulting values are almost as high as in one-view out experiments. But in all cases where V1 is involved in the one-one protocol ({01},{10},{12},{21}), there is a noticeable drop in the performance. The results highlight the specific challenge in dealing with view invariance, when ego-vision is one of the views considered. This appears to be understandable, considering the smaller amount of dynamic information included in the ego view, but it is also in contrast with findings in cognitive science. Indeed, from recent neuro-scientific literature it can be derived that not all views are equally important. First-person view seems to have a prominent role with respect to other perspectives in terms of responsiveness in the sensorimotor areas of the brain during action observation [1] and has been shown to facilitate certain forms of action understanding (e.g., estimating the size of an object to be grasped) [3]. Beyond egocentric perspective, also the frontal view seems to have a peculiar role, eliciting a stronger activity in the ventral premotor cortex if compared with lateral view, suggesting a preference for “face-to-face interactions” [6].
Figure 3 shows the confusion matrix for the above experiment, in case the Conv3D classifier is trained on the egocentric view V1 and tested on V2. Notice that carrot (grating carrots) is almost always classified as cut (cutting a bread). The motion of the two actions is very similar from these two perspectives. Also note, that many actions are often confused with the eating action. This is probably because since the face is not visible in either view, the amount of information available makes it very easy to confuse with actions like transporting (moving an object across the table).
We conclude by reporting a further set of experiments, carried out with the same protocol on the IXMAS benchmark (see Fig. 2). This dataset does not include an ego-vision, but incorporates instead a top view which is very different from the others. The results reported in Table 2 confirm the observation that the architecture exhibits a good amount of view-invariance, in particular for views that are more likely to be observed. It is instead less robust on view , the top view, which is less common. Similarly to biological systems, our architecture appears to be better tuned for a set of more likely view points.
4 Discussion
Our analysis suggests the relationship between egocentric and individual allocentric viewpoints is significantly less strong than the relationship among allocentric viewpoints (even if they have been acquired by widely different perspectives). This could be explained by the reduced amount of information conveyed by egocentric data, which is compensated in case of biological vision by proprioception or the awareness of the position and movements of one’s own body.
However, the relationship still exists, as is demonstrated by the ability of the classifiers to recognise actions to some extent despite not having any significant information about the egocentric view and how different the actions look from this viewpoint. It is a very interesting observation that the combination of two allocentric viewpoints together were able to train the 3DConv classifier well enough to identify actions from the egocentric viewpoint, with almost the same accuracy as for other scenarios with unseen viewpoints. This apparent ability deserves further investigation to be carried out on wider multi-view datasets, to assess the generality of our observations.
Acknowledgment
Some results incorporated in this publication have received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme, G.A. No 804388.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Angelini, M. Fabbri-Destro, N. F. Lopomo, M. Gobbo, G. Rizzolatti, and P. Avanzini. Perspective-dependent reactivity of sensorimotor mu rhythm in alpha and beta ranges during action observation: An eeg study. Scientific reports 2018 .
- 2[2] V. Caggiano, L. Fogassi, G. Rizzolatti, J. K. Pomper, P. Thier, M. A. Giese, and A. Casile. View-based encoding of actions in mirror neurons of area f 5 in macaque premotor cortex. Current Biology , 21(2):144–148, 2011.
- 3[3] F. Campanella, G. Sandini, and M. C. Morrone. Visual information gleaned by observing grasping movement in allocentric and egocentric perspectives. Proceedings of the Royal Society B: Biological Sciences , 278(1715):2142–2149, 2010.
- 4[4] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017 .
- 5[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR 2009 .
- 6[6] S. Ferri, K. Pauwels, G. Rizzolatti, and G. Orban. Stereoscopically observing manipulative actions. Cerebral cortex , 26(8):3591–3610, 2016.
- 7[7] E. D. Grossman, N. L. Jardine, and J. A. Pyles. fmr-adaptation reveals invariant coding of biological motion on human sts. Frontiers in human neuroscience 2010 .
- 8[8] C.-H. Huang, Y.-R. Yeh, and Y.-C. F. Wang. Recognizing actions across cameras by exploring the correlated subspace. In ECCV 2012 .
