ViT-ReT: Vision and Recurrent Transformer Neural Networks for Human Activity Recognition in Videos
James Wensel, Hayat Ullah, Arslan Munir

TL;DR
This paper introduces ViT-ReT, a novel combination of vision and recurrent transformer neural networks designed to enhance human activity recognition in videos, offering improvements in speed and accuracy over traditional CNN and RNN models.
Contribution
The paper presents two new transformer-based models, ReT and ViT, specifically tailored for human activity recognition, advancing beyond conventional CNN and RNN approaches.
Findings
Transformer models outperform CNN/RNN in accuracy
ReT and ViT show improved speed and scalability
Extensive comparison validates effectiveness of proposed models
Abstract
Human activity recognition is an emerging and important area in computer vision which seeks to determine the activity an individual or group of individuals are performing. The applications of this field ranges from generating highlight videos in sports, to intelligent surveillance and gesture recognition. Most activity recognition systems rely on a combination of convolutional neural networks (CNNs) to perform feature extraction from the data and recurrent neural networks (RNNs) to determine the time dependent nature of the data. This paper proposes and designs two transformer neural networks for human activity recognition: a recurrent transformer (ReT), a specialized neural network used to make predictions on sequences of data, as well as a vision transformer (ViT), a transformer optimized for extracting salient features from images, to improve speed and scalability of activity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Video Surveillance and Tracking Methods
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Residual Connection · Softmax · Layer Normalization · Dense Connections · Vision Transformer
