An Analysis of Deep Neural Networks with Attention for Action Recognition from a Neurophysiological Perspective
Swathikiran Sudhakaran, Oswald Lanz

TL;DR
This paper reviews three deep learning methods for action recognition, comparing them from a neurophysiological perspective to explore analogies with human brain hypotheses.
Contribution
It provides a comparative analysis linking deep learning methods for action recognition to neurophysiological theories of brain function.
Findings
Identifies analogies between deep learning models and brain hypotheses
Highlights similarities in processing mechanisms
Suggests neurophysiological insights for improving models
Abstract
We review three recent deep learning based methods for action recognition and present a brief comparative analysis of the methods from a neurophyisiological point of view. We posit that there are some analogy between the three presented deep learning based methods and some of the existing hypotheses regarding the functioning of human brain.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEEG and Brain-Computer Interfaces · Neural dynamics and brain function · Functional Brain Connectivity Studies
An Analysis of Deep Neural Networks with Attention for Action Recognition from a Neurophysiological Perspective
Swathikiran Sudhakaran1,2 and Oswald Lanz1
1Fondazione Bruno Kessler, Trento, Italy
2University of Trento, Trento, Italy
{sudhakaran,lanz}@fbk.eu
Abstract
We review three recent deep learning based methods for action recognition and present a brief comparative analysis of the methods from a neurophyisiological point of view. We posit that there are some analogy between the three presented deep learning based methods and some of the existing hypotheses regarding the functioning of human brain.
1 Introduction
Human visual system have the remarkable capability to accurately recognize an object present in a scene within a very short span of time, in the order of milliseconds [14]. This is achieved even in the presence of wide range of identity preserving transformations such as rotation, shift in spatial position, changes in the color, size and view. Several studies have been conducted to understand the mechanism underlying this achievement and has led to several hypotheses, some of which are yet to be proved.
Computer vision researchers have tried to develop systems that can emulate the performance of human visual systems. Some of these approaches are inspired by the hypotheses and understandings developed by neuroscientists based on their study of the primate visual system. The most notable approach among these is the neocognitron [4] based on the primate visual model proposed by Hubel and Wiesel [6]. The neocognitron inspired the development of Convolutional Neural Networks [8] which revolutionised the area of Deep Learning (DL) and resulted in the development of CNNs than can rival human performance in image recognition task [5, 13, 1].
Recently, neuroscientists have started to analyze Deep Neural Networks to obtain more detailed understanding of the functioning of primate visual systems by studying on the similarities of the representations generated by both systems. These studies have confirmed that the representations of the visual scene generated by CNNs are similar to the ones generated in the brain. Similar objects are found to be nearer while different objects are found to be farther in this representational space in both systems [1]. Further studies have also confirmed that the ventral stream of the visual system which is responsible for object recognition has a hierarchical structure for generating visual representation of the visual scene in the form of light entering the eyes, similar to the hierarchical structure of CNNs [3, 7, 18].
This extended abstract tries to continue this study from an analytical point of view by comparing existing hypotheses about the functioning of the visual system in primates to the improvements obtained by recent DL approaches after adopting these hypotheses. The contributions include a brief review of our recent works [12, 11, 10] for action recognition from videos; an analysis of the above papers from a neurophysiological point of view; and an attempt to compare them with some of the hypotheses developed by neuroscientists regarding the functioning of the brain.
2 Computer Vision Perspective
2.1 Object-centric Attention (Ego-RNN)
In our paper [11], we present a Convolutional Neural Network (CNN)- Recurrent Neural Network (RNN) architecture that is trained in a weak supervision setting to predict the raw video-level activity-class label associated with the clip. Our CNN backbone is pre-trained for generic image recognition and augmented on top with an attention mechanism that uses class activation maps for spatially selective feature extraction. The memory tensor of a Convolutional Long Short-Term Memory (ConvLSTM) then tracks the discriminative frame-based features distilled from the video for activity classification. Our design choices are grounded to fine grained activity recognition because: (i) Frame-based activation maps are not bound to reflect image recognition classes, they develop their own representation classes implicitly while training the video-level classification; (ii) ConvLSTM maintains the spatial structure of the input sequence all the way up to the final video descriptor used by the activity classification layer, thus facilitating the spatio-temporal encoding of objects and their locations into the descriptor as they develop into the activity over time.
2.2 Long Short-Term Attention (LSTA)
In the method proposed in [11], the attention maps are generated independently for each frame. This can result in the network attending to different regions in adjacent frames. In order to address this limitation, we derive LSTA, a new recurrent neural unit that augments LSTM with built-in recurrent spatial attention and a revised output gating. The first enables LSTA to attend the feature regions of interest while the second constraints it to expose a distilled view of internal memory. Our study also confirms that it is effective to improve the output gating of recurrent unit since it does not only affect prediction overall but controls the recurrence, being responsible for a smooth and focused tracking of the latent memory state across the sequence. This output pooling applies attention on the RNN memory, thereby enabling the network to localize on the relevant spatio-temporal patterns present in the video. Fig. 1 shows the attention map generated by Ego-RNN and LSTA on a video sequence from GTEA 61 dataset.
2.3 Top-Down Attention VLAD (TA-VLAD)
Our recently published paper [12] presents an end-to-end trainable deep architecture that integrates top-down spatial attention with temporally aggregated VLAD encoding for action recognition in videos. TA-VLAD uses (i) class specific activation maps obtained from a deep CNN pre-trained for image recognition as the spatial attention mechanism, a (ii) latent cluster representation of the feature space, obtained using Vector of Locally Aggregated Descriptor (VLAD) encoding, and (iii) Gated Reccurrent Units for temporal encoding in the cluster space. TA-VLAD can be trained end-to-end using video-level annotations, that is, the parameters of (i) and (iii) together with the compact representation of feature space (ii) are learned from videos paired with action class labels. Fig. 2 shows the attention map generated by the network on some of the frames in HMDB51 dataset.
3 Neurophysiological Perspective
3.1 Top-down Attention
Studies on the human brain have shown that there is a limit to the number of objects that can be processed simultaneously [2]. As a result, the brain selects the relevant regions in the scene to generate an effective representation. This is achieved by the attention mechanism present in the brain. Studies have confirmed that the human brain employs two types of attention mechanisms to select relevant regions present in the scene, namely bottom-up attention and top-down attention [16]. Bottom-up attention is triggered by the salient features of the scene such as color and shape whereas top-down attention is based on the prior information present in the brain which results in a bias to select some regions over others.
In [11, 12, 10], we apply top-down attention on the CNN features obtained from each frame to weight the relevant regions present in the frame. The top-down attention is generated from Class Activation Maps obtained from a CNN pre-trained for object classification. In the networks, each frame is first applied to an imagenet pre-trained CNN to obtain a class-category score. The CAM of the class-category with the highest class score is then used to generate the attention map. This has some analogy to the top-down attention mechanism in primate brain which selects regions in the scene based on the internal bias and goals. Empirical studies have shown that weighting the regions present in the scene in this way improves the action recognition performance of the network.
Majority of the existing studies comparing the representational similarities of CNNs and primate brain consider the object recognition task. Comparative studies on the representations generated by the brain on action recognition task and CNN-RNN architectures like ours could shed some light on how spatio-temporal information is processed in the brain and may assist in the further development of effective action recognition techniques. Such a study could also help reveal and explain the benefit of output pooling introduced in Long Short-Term Attention (LSTA) [10].
3.2 Multiple Pathway Hypothesis
Multiple pathway hypothesis states that there are several parallel information streams in the brain that carry information from one region to the other for further processing. It is assumed that these streams are weighted with different values and that there might be complex interactions between these streams which results in the final representation of the scene in the inferio-temporal (IT) cortex of the brain [17, 9].
In TA-VLAD [12], we encode the temporal evolution of the features corresponding to each of the cluster centers separately, using a network of GRU layers. This is comparable to the multiple pathway hypothesis proposed in the primate visual system. Experiments with a single GRU layer that encodes the flattened feature descriptor obtained by combining all the cluster centers significantly reduced the performance of the network. On top of this, the top-down attention allows in focusing on the relevant regions in the video, specifically the objects present in the scene. This same approach of encoding the cluster representation using multiple streams could be further investigated in the context of LSTA [10].
4 Conclusion
In this extended abstract, we presented three recent works based on deep learning for addressing the problem of action recognition. We also made an analytical comparison of the proposed methods with the existing hypotheses and understandings regarding the functioning of the human visual system. From the comparative study, it is seen that the application of attention mechanism is beneficial for improving the action recognition task. However the presented works apply only top-down attention on the CNN features while the primate brain makes use of both bottom-up and top-down attention mechanisms for focusing onto the relevant objects or regions in the scene. Recently, Tu et al. [15] found that there is a dynamic switching between bottom-up and top-down attentions during dynamic decision making process, which shows that DNNs should also leverage both the attention mechanisms for improving their performance.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] C. F. Cadieu, H. Hong, D. Yamins, N. Pinto, D. Ardila, E. Solomon, N. Majaj, and J. Di Carlo. Deep neural networks rival the representation of primate it cortex for core visual object recognition. P Lo S computational biology , 10(12), 2014.
- 2[2] J. Duncan. Selective attention and the organization of visual information. Journal of Experimental Psychology: General , 113(4):501, 1984.
- 3[3] M. Eickenberg, A. Gramfort, G. Varoquaux, and B. Thirion. Seeing it all: Convolutional network layers map the function of the human visual system. Neuro Image , 152:184–194, 2017.
- 4[4] K. Fukushima and S. Miyake. Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern recognition , 15(6):455–469, 1982.
- 5[5] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. CVPR , 2016.
- 6[6] D. Hubel and T. Wiesel. Ferrier lecture: Functional architecture of macaque monkey visual cortex. Proceedings of the Royal Society of London. Series B, Biological Sciences , pages 1–59, 1977.
- 7[7] S. Kheradpisheh, M. Ghodrati, M. Ganjtabesh, and T. Masquelier. Deep networks can resemble human feed-forward vision in invariant object recognition. Scientific reports , 6:32672, 2016.
- 8[8] Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE , 86(11):2278–2324, 1998.
