TL;DR
This paper introduces Motion Fused Frames, a data level fusion method that incorporates motion information into static images to improve hand gesture recognition accuracy across multiple datasets.
Contribution
The paper presents a novel data level fusion strategy, MFFs, that enhances static images with motion data, compatible with existing deep learning models for improved gesture recognition.
Findings
Achieved 96.28% accuracy on Jester dataset.
Achieved 57.4% accuracy on ChaLearn dataset.
Achieved 84.7% accuracy on NVIDIA dataset, setting a new state-of-the-art.
Abstract
Acquiring spatio-temporal states of an action is the most crucial step for action classification. In this paper, we propose a data level fusion strategy, Motion Fused Frames (MFFs), designed to fuse motion information into static images as better representatives of spatio-temporal states of an action. MFFs can be used as input to any deep learning architecture with very little modification on the network. We evaluate MFFs on hand gesture recognition tasks using three video datasets - Jester, ChaLearn LAP IsoGD and NVIDIA Dynamic Hand Gesture Datasets - which require capturing long-term temporal relations of hand movements. Our approach obtains very competitive performance on Jester and ChaLearn benchmarks with the classification accuracies of 96.28% and 57.4%, respectively, while achieving state-of-the-art performance with 84.7% accuracy on NVIDIA benchmark.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
