Multimodal feature fusion for CNN-based gait recognition: an empirical comparison
Francisco Manuel Castro, Manuel Jes\'us Mar\'in-Jim\'enez, Nicol\'as, Guil, Nicol\'as P\'erez de la Blanca

TL;DR
This paper compares CNN architectures and multimodal fusion methods for gait recognition using raw pixels, optical flow, and depth maps, demonstrating that simple inputs and effective fusion can achieve state-of-the-art results.
Contribution
It provides a comprehensive empirical comparison of different CNN modalities and fusion strategies for gait recognition, highlighting the effectiveness of raw pixel inputs and multimodal fusion.
Findings
Raw pixel inputs are competitive with silhouette-based features.
Fusion of multiple modalities improves recognition accuracy.
Proper CNN architecture design is crucial for optimal performance.
Abstract
People identification in video based on the way they walk (i.e. gait) is a relevant task in computer vision using a non-invasive approach. Standard and current approaches typically derive gait signatures from sequences of binary energy maps of subjects extracted from images, but this process introduces a large amount of non-stationary noise, thus, conditioning their efficacy. In contrast, in this paper we focus on the raw pixels, or simple functions derived from them, letting advanced learning techniques to extract relevant features. Therefore, we present a comparative study of different Convolutional Neural Network (CNN) architectures by using three different modalities (i.e. gray pixels, optical flow channels and depth maps) on two widely-adopted and challenging datasets: TUM-GAID and CASIA-B. In addition, we perform a comparative study between different early and late fusion methods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
