TL;DR
This paper introduces a novel neural network architecture combining recurrence and temporal convolutions, demonstrating significant improvements in gesture recognition accuracy over traditional pooling methods.
Contribution
It presents the first comprehensive study showing recurrence and temporal convolutions are essential for effective gesture recognition in videos.
Findings
Recurrence is crucial for capturing temporal dynamics in gesture recognition.
Adding temporal convolutions significantly improves performance.
Achieved state-of-the-art results on the Montalbano dataset.
Abstract
Recent studies have demonstrated the power of recurrent neural networks for machine translation, image captioning and speech recognition. For the task of capturing temporal structure in video, however, there still remain numerous open research questions. Current research suggests using a simple temporal feature pooling strategy to take into account the temporal aspect of video. We demonstrate that this method is not sufficient for gesture recognition, where temporal information is more discriminative compared to general video classification tasks. We explore deep architectures for gesture recognition in video and propose a new end-to-end trainable neural network architecture incorporating temporal convolutions and bidirectional recurrence. Our main contributions are twofold; first, we show that recurrence is crucial for this task; second, we show that adding temporal convolutions leads…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
