Efficient Modelling Across Time of Human Actions and Interactions
Alexandros Stergiou

TL;DR
This thesis proposes improved spatio-temporal modeling techniques for video-based human action recognition, emphasizing variable temporal receptive fields, class distinction enhancement, and explainability, achieving competitive results efficiently.
Contribution
It introduces size-varying video segments for better temporal modeling, a feature amplification regularization for class distinction, and a human-understandable explanation method, advancing current video understanding methods.
Findings
Achieves competitive accuracy on benchmark datasets.
Reduces computational complexity compared to state-of-the-art.
Provides visual explanations for learned spatio-temporal features.
Abstract
This thesis focuses on video understanding for human action and interaction recognition. We start by identifying the main challenges related to action recognition from videos and review how they have been addressed by current methods. Based on these challenges, and by focusing on the temporal aspect of actions, we argue that current fixed-sized spatio-temporal kernels in 3D convolutional neural networks (CNNs) can be improved to better deal with temporal variations in the input. Our contributions are based on the enlargement of the convolutional receptive fields through the introduction of spatio-temporal size-varying segments of videos, as well as the discovery of the local feature relevance over the entire video sequence. The resulting extracted features encapsulate information that includes the importance of local features across multiple temporal durations, as well as the entire…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
