Human Action Recognition in Still Images Using ConViT
Seyed Rohollah Hosseyni, Sanaz Seyedin, Hasan Taheri

TL;DR
This paper introduces a hybrid model combining CNNs and Vision Transformers to improve human action recognition in still images by better capturing relationships between image parts, leading to higher accuracy.
Contribution
It proposes a novel module integrating ViT with CNNs, enhancing the extraction of meaningful image parts and relationships for human action recognition.
Findings
Achieved 95.5% mAP on Stanford40 dataset.
Achieved 91.5% mAP on PASCAL VOC 2012 dataset.
Outperformed several state-of-the-art methods.
Abstract
Understanding the relationship between different parts of an image is crucial in a variety of applications, including object recognition, scene understanding, and image classification. Despite the fact that Convolutional Neural Networks (CNNs) have demonstrated impressive results in classifying and detecting objects, they lack the capability to extract the relationship between different parts of an image, which is a crucial factor in Human Action Recognition (HAR). To address this problem, this paper proposes a new module that functions like a convolutional layer that uses Vision Transformer (ViT). In the proposed model, the Vision Transformer can complement a convolutional neural network in a variety of tasks by helping it to effectively extract the relationship among various parts of an image. It is shown that the proposed model, compared to a simple CNN, can extract meaningful parts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Neural Network Applications · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Residual Connection · Absolute Position Encodings · Adam · Layer Normalization · Label Smoothing
