ViT-ReT: Vision and Recurrent Transformer Neural Networks for Human   Activity Recognition in Videos

James Wensel; Hayat Ullah; Arslan Munir

arXiv:2208.07929·cs.CV·August 26, 2022·5 cites

ViT-ReT: Vision and Recurrent Transformer Neural Networks for Human Activity Recognition in Videos

James Wensel, Hayat Ullah, Arslan Munir

PDF

Open Access

TL;DR

This paper introduces ViT-ReT, a novel combination of vision and recurrent transformer neural networks designed to enhance human activity recognition in videos, offering improvements in speed and accuracy over traditional CNN and RNN models.

Contribution

The paper presents two new transformer-based models, ReT and ViT, specifically tailored for human activity recognition, advancing beyond conventional CNN and RNN approaches.

Findings

01

Transformer models outperform CNN/RNN in accuracy

02

ReT and ViT show improved speed and scalability

03

Extensive comparison validates effectiveness of proposed models

Abstract

Human activity recognition is an emerging and important area in computer vision which seeks to determine the activity an individual or group of individuals are performing. The applications of this field ranges from generating highlight videos in sports, to intelligent surveillance and gesture recognition. Most activity recognition systems rely on a combination of convolutional neural networks (CNNs) to perform feature extraction from the data and recurrent neural networks (RNNs) to determine the time dependent nature of the data. This paper proposes and designs two transformer neural networks for human activity recognition: a recurrent transformer (ReT), a specialized neural network used to make predictions on sequences of data, as well as a vision transformer (ViT), a transformer optimized for extracting salient features from images, to improve speed and scalability of activity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Video Surveillance and Tracking Methods

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Residual Connection · Softmax · Layer Normalization · Dense Connections · Vision Transformer