AXM-Net: Implicit Cross-Modal Feature Alignment for Person Re-identification
Ammarah Farooq, Muhammad Awais, Josef Kittler, Syed Safwan Khalid

TL;DR
AXM-Net introduces a novel CNN architecture with an implicit semantic alignment mechanism for cross-modal person re-identification, significantly improving accuracy in person search and cross-viewpoint scenarios.
Contribution
The paper proposes AXM-Block and a unified framework for implicit cross-modal semantic alignment, enhancing visual-textual feature coherence for person Re-ID.
Findings
Achieves 64.44% Rank@1 on CUHK-PEDES, surpassing SOTA.
Outperforms competitors by over 10% in cross-viewpoint text-to-image Re-ID.
Effectively utilizes textual data as supervision for visual feature learning.
Abstract
Cross-modal person re-identification (Re-ID) is critical for modern video surveillance systems. The key challenge is to align cross-modality representations induced by the semantic information present for a person and ignore background information. This work presents a novel convolutional neural network (CNN) based architecture designed to learn semantically aligned cross-modal visual and textual representations. The underlying building block, named AXM-Block, is a unified multi-layer network that dynamically exploits the multi-scale knowledge from both modalities and re-calibrates each modality according to shared semantics. To complement the convolutional design, contextual attention is applied in the text branch to manipulate long-term dependencies. Moreover, we propose a unique design to enhance visual part-based feature coherence and locality information. Our framework is novel in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Gait Recognition and Analysis · Human Pose and Action Recognition
