VILLS -- Video-Image Learning to Learn Semantics for Person Re-Identification
Siyuan Huang, Ram Prabhakar, Yuxiang Guo, Rama Chellappa, Cheng Peng

TL;DR
VILLS is a self-supervised approach that jointly learns spatial and temporal features from images and videos to improve person re-identification robustness in challenging real-world scenarios.
Contribution
It introduces a novel unified framework with semantic extraction and feature adaptation modules, achieving state-of-the-art results in person re-identification.
Findings
VILLS outperforms existing methods significantly.
The method effectively combines image and video modalities.
It demonstrates robustness in real-world, unconstrained environments.
Abstract
Person Re-identification is a research area with significant real world applications. Despite recent progress, existing methods face challenges in robust re-identification in the wild, e.g., by focusing only on a particular modality and on unreliable patterns such as clothing. A generalized method is highly desired, but remains elusive to achieve due to issues such as the trade-off between spatial and temporal resolution and imperfect feature extraction. We propose VILLS (Video-Image Learning to Learn Semantics), a self-supervised method that jointly learns spatial and temporal features from images and videos. VILLS first designs a local semantic extraction module that adaptively extracts semantically consistent and robust spatial features. Then, VILLS designs a unified feature learning and adaptation module to represent image and video modalities in a consistent feature space. By…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Face recognition and analysis · Gait Recognition and Analysis
MethodsFocus
