Leveraging Local Temporal Information for Multimodal Scene Classification
Saurabh Sahu, Palash Goyal

TL;DR
This paper introduces a novel self-attention block that captures local and global temporal relationships in videos, improving scene classification by better understanding frame-level context, especially on large-scale datasets.
Contribution
It proposes a new self-attention mechanism that exploits local and global temporal information for enhanced video understanding in transformer models.
Findings
Improved accuracy on YouTube-8M dataset.
Effective modeling of local and global temporal relationships.
Enhanced frame-level representations for video classification.
Abstract
Robust video scene classification models should capture the spatial (pixel-wise) and temporal (frame-wise) characteristics of a video effectively. Transformer models with self-attention which are designed to get contextualized representations for individual tokens given a sequence of tokens, are becoming increasingly popular in many computer vision tasks. However, the use of Transformer based models for video understanding is still relatively unexplored. Moreover, these models fail to exploit the strong temporal relationships between the neighboring video frames to get potent frame-level representations. In this paper, we propose a novel self-attention block that leverages both local and global temporal relationships between the video frames to obtain better contextualized representations for the individual frames. This enables the model to understand the video at various granularities.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Absolute Position Encodings · Softmax · Dense Connections · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Label Smoothing
