Egocentric RGB+Depth Action Recognition in Industry-Like Settings
Jyoti Kini, Sarah Fleischer, Ishan Dave, Mubarak Shah

TL;DR
This paper introduces a multimodal action recognition framework using RGB and Depth data with a 3D Video SWIN Transformer, achieving state-of-the-art results in industry-like settings and winning a challenge.
Contribution
The work presents a novel approach combining RGB and Depth modalities with a transformer-based model and a focal loss strategy for skewed data, advancing egocentric action recognition.
Findings
Outperforms prior methods on MECCANO dataset
Secured first place at ICIAP 2023 challenge
Effectively handles multimodal data with late fusion
Abstract
Action recognition from an egocentric viewpoint is a crucial perception task in robotics and enables a wide range of human-robot interactions. While most computer vision approaches prioritize the RGB camera, the Depth modality - which can further amplify the subtleties of actions from an egocentric perspective - remains underexplored. Our work focuses on recognizing actions from egocentric RGB and Depth modalities in an industry-like environment. To study this problem, we consider the recent MECCANO dataset, which provides a wide range of assembling actions. Our framework is based on the 3D Video SWIN Transformer to encode both RGB and Depth modalities effectively. To address the inherent skewness in real-world multimodal action occurrences, we propose a training strategy using an exponentially decaying variant of the focal loss modulating factor. Additionally, to leverage the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Stochastic Depth · Focal Loss · Layer Normalization · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings · Dense Connections
