Egocentric RGB+Depth Action Recognition in Industry-Like Settings

Jyoti Kini; Sarah Fleischer; Ishan Dave; Mubarak Shah

arXiv:2309.13962·cs.CV·September 26, 2023·1 cites

Egocentric RGB+Depth Action Recognition in Industry-Like Settings

Jyoti Kini, Sarah Fleischer, Ishan Dave, Mubarak Shah

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multimodal action recognition framework using RGB and Depth data with a 3D Video SWIN Transformer, achieving state-of-the-art results in industry-like settings and winning a challenge.

Contribution

The work presents a novel approach combining RGB and Depth modalities with a transformer-based model and a focal loss strategy for skewed data, advancing egocentric action recognition.

Findings

01

Outperforms prior methods on MECCANO dataset

02

Secured first place at ICIAP 2023 challenge

03

Effectively handles multimodal data with late fusion

Abstract

Action recognition from an egocentric viewpoint is a crucial perception task in robotics and enables a wide range of human-robot interactions. While most computer vision approaches prioritize the RGB camera, the Depth modality - which can further amplify the subtleties of actions from an egocentric perspective - remains underexplored. Our work focuses on recognizing actions from egocentric RGB and Depth modalities in an industry-like environment. To study this problem, we consider the recent MECCANO dataset, which provides a wide range of assembling actions. Our framework is based on the 3D Video SWIN Transformer to encode both RGB and Depth modalities effectively. To address the inherent skewness in real-world multimodal action occurrences, we propose a training strategy using an exponentially decaying variant of the focal loss modulating factor. Additionally, to leverage the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jkini/Meccano
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Advanced Neural Network Applications

MethodsMulti-Head Attention · Attention Is All You Need · Stochastic Depth · Focal Loss · Layer Normalization · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings · Dense Connections