Frame Aggregation and Multi-Modal Fusion Framework for Video-Based Person Recognition
Fangtao Li, Wenzhe Wang, Zihe Liu, Haoran Wang, Chenghao Yan, Bin Wu

TL;DR
This paper introduces a novel framework combining frame aggregation and multi-modal fusion to improve video-based person recognition, effectively handling occlusions, blurring, and angle variations.
Contribution
The paper proposes AttentionVLAD for adaptive frame aggregation and MLMA for multi-modal correlation learning, advancing video person recognition techniques.
Findings
Outperforms state-of-the-art methods on iQIYI-VID-2019 dataset
Effectively reduces impact of low-quality frames
Enhances multi-modal feature integration
Abstract
Video-based person recognition is challenging due to persons being blocked and blurred, and the variation of shooting angle. Previous research always focused on person recognition on still images, ignoring similarity and continuity between video frames. To tackle the challenges above, we propose a novel Frame Aggregation and Multi-Modal Fusion (FAMF) framework for video-based person recognition, which aggregates face features and incorporates them with multi-modal information to identify persons in videos. For frame aggregation, we propose a novel trainable layer based on NetVLAD (named AttentionVLAD), which takes arbitrary number of features as input and computes a fixed-length aggregation feature based on feature quality. We show that introducing an attention mechanism to NetVLAD can effectively decrease the impact of low-quality frames. For the multi-model information of videos, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Face recognition and analysis · Gait Recognition and Analysis
