ModDrop: adaptive multi-modal gesture recognition
Natalia Neverova, Christian Wolf, Graham W. Taylor, Florian, Nebout

TL;DR
This paper introduces ModDrop, a multi-modal deep learning approach for gesture recognition that fuses spatial and temporal information across modalities and scales, achieving high accuracy and robustness to missing data.
Contribution
The paper proposes ModDrop, a novel training strategy that enables effective multi-modal fusion and robustness in gesture recognition systems.
Findings
Achieved first place in the ChaLearn 2014 gesture recognition challenge.
Fusing multiple modalities improves recognition accuracy significantly.
ModDrop enhances robustness to missing or noisy modality signals.
Abstract
We present a method for gesture detection and localisation based on multi-scale and multi-modal deep learning. Each visual modality captures spatial information at a particular spatial scale (such as motion of the upper body or a hand), and the whole system operates at three temporal scales. Key to our technique is a training strategy which exploits: i) careful initialization of individual modalities; and ii) gradual fusion involving random dropping of separate channels (dubbed ModDrop) for learning cross-modality correlations while preserving uniqueness of each modality-specific representation. We present experiments on the ChaLearn 2014 Looking at People Challenge gesture recognition track, in which we placed first out of 17 teams. Fusing multiple modalities at several spatial and temporal scales leads to a significant increase in recognition rates, allowing the model to compensate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
