TL;DR
This paper introduces a neural architecture search-based approach for multi-modal gesture recognition, leveraging enhanced temporal features and optimized multi-rate, multi-modal networks to achieve state-of-the-art results on benchmark datasets.
Contribution
It presents the first NAS-based method for RGB-D gesture recognition, integrating 3D-CDC for temporal enhancement and optimized backbones for multi-rate, multi-modal learning.
Findings
Achieves state-of-the-art performance on IsoGD, NvGesture, and EgoGesture datasets.
Demonstrates effective multi-modal and multi-rate integration for gesture recognition.
Provides a new perspective on RGB and depth modality relationships.
Abstract
Gesture recognition has attracted considerable attention owing to its great potential in applications. Although the great progress has been made recently in multi-modal learning methods, existing methods still lack effective integration to fully explore synergies among spatio-temporal modalities effectively for gesture recognition. The problems are partially due to the fact that the existing manually designed network architectures have low efficiency in the joint learning of multi-modalities. In this paper, we propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition. The proposed method includes two key components: 1) enhanced temporal representation via the proposed 3D Central Difference Convolution (3D-CDC) family, which is able to capture rich temporal context via aggregating temporal difference information; and 2) optimized backbones for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsConvolution
