Multi-modal Fusion for Single-Stage Continuous Gesture Recognition
Harshala Gammulle, Simon Denman, Sridha Sridharan, Clinton Fookes

TL;DR
This paper introduces a novel single-stage multi-modal fusion framework for continuous gesture recognition that detects and classifies multiple gestures in videos without pre-segmentation, outperforming existing methods.
Contribution
The paper presents a unified single-stage model with multi-modal fusion, feature mapping, and a mid-point loss for continuous gesture recognition, advancing beyond two-stage approaches.
Findings
Outperforms state-of-the-art on three datasets
Handles variable-length input videos effectively
Highlights importance of each component through ablation studies
Abstract
Gesture recognition is a much studied research area which has myriad real-world applications including robotics and human-machine interaction. Current gesture recognition methods have focused on recognising isolated gestures, and existing continuous gesture recognition methods are limited to two-stage approaches where independent models are required for detection and classification, with the performance of the latter being constrained by detection performance. In contrast, we introduce a single-stage continuous gesture recognition framework, called Temporal Multi-Modal Fusion (TMMF), that can detect and classify multiple gestures in a video via a single model. This approach learns the natural transitions between gestures and non-gestures without the need for a pre-processing segmentation step to detect individual gestures. To achieve this, we introduce a multi-modal fusion mechanism to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
