ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language
Zhe Wang, Zhiyuan Fang, Jun Wang, Yezhou Yang

TL;DR
ViTAA introduces an attribute-aligning approach for person search using natural language, grounding textual attribute phrases to visual regions, leading to improved accuracy and state-of-the-art results.
Contribution
The paper proposes ViTAA, a novel model that aligns visual attributes with textual descriptions via disentangled feature spaces and contrastive learning, enhancing person search performance.
Findings
Achieves state-of-the-art results on person search by natural language.
Effectively grounds attribute phrases to visual regions.
Improves accuracy through attribute-based feature disentanglement.
Abstract
Person search by natural language aims at retrieving a specific person in a large-scale image pool that matches the given textual descriptions. While most of the current methods treat the task as a holistic visual and textual feature matching one, we approach it from an attribute-aligning perspective that allows grounding specific attribute phrases to the corresponding visual regions. We achieve success as well as the performance boosting by a robust feature learning that the referred identity can be accurately bundled by multiple attribute visual cues. To be concrete, our Visual-Textual Attribute Alignment model (dubbed as ViTAA) learns to disentangle the feature space of a person into subspaces corresponding to attributes using a light auxiliary attribute segmentation computing branch. It then aligns these visual features with the textual attributes parsed from the sentences by using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
