Leveraging Local Temporal Information for Multimodal Scene   Classification

Saurabh Sahu; Palash Goyal

arXiv:2110.13992·cs.CV·October 28, 2021

Leveraging Local Temporal Information for Multimodal Scene Classification

Saurabh Sahu, Palash Goyal

PDF

Open Access

TL;DR

This paper introduces a novel self-attention block that captures local and global temporal relationships in videos, improving scene classification by better understanding frame-level context, especially on large-scale datasets.

Contribution

It proposes a new self-attention mechanism that exploits local and global temporal information for enhanced video understanding in transformer models.

Findings

01

Improved accuracy on YouTube-8M dataset.

02

Effective modeling of local and global temporal relationships.

03

Enhanced frame-level representations for video classification.

Abstract

Robust video scene classification models should capture the spatial (pixel-wise) and temporal (frame-wise) characteristics of a video effectively. Transformer models with self-attention which are designed to get contextualized representations for individual tokens given a sequence of tokens, are becoming increasingly popular in many computer vision tasks. However, the use of Transformer based models for video understanding is still relatively unexplored. Moreover, these models fail to exploit the strong temporal relationships between the neighboring video frames to get potent frame-level representations. In this paper, we propose a novel self-attention block that leverages both local and global temporal relationships between the video frames to obtain better contextualized representations for the individual frames. This enables the model to understand the video at various granularities.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Absolute Position Encodings · Softmax · Dense Connections · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Label Smoothing