ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural   Language

Zhe Wang; Zhiyuan Fang; Jun Wang; Yezhou Yang

arXiv:2005.07327·cs.CV·July 31, 2020·21 cites

ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language

Zhe Wang, Zhiyuan Fang, Jun Wang, Yezhou Yang

PDF

Open Access 2 Repos

TL;DR

ViTAA introduces an attribute-aligning approach for person search using natural language, grounding textual attribute phrases to visual regions, leading to improved accuracy and state-of-the-art results.

Contribution

The paper proposes ViTAA, a novel model that aligns visual attributes with textual descriptions via disentangled feature spaces and contrastive learning, enhancing person search performance.

Findings

01

Achieves state-of-the-art results on person search by natural language.

02

Effectively grounds attribute phrases to visual regions.

03

Improves accuracy through attribute-based feature disentanglement.

Abstract

Person search by natural language aims at retrieving a specific person in a large-scale image pool that matches the given textual descriptions. While most of the current methods treat the task as a holistic visual and textual feature matching one, we approach it from an attribute-aligning perspective that allows grounding specific attribute phrases to the corresponding visual regions. We achieve success as well as the performance boosting by a robust feature learning that the referred identity can be accurately bundled by multiple attribute visual cues. To be concrete, our Visual-Textual Attribute Alignment model (dubbed as ViTAA) learns to disentangle the feature space of a person into subspaces corresponding to attributes using a light auxiliary attribute segmentation computing branch. It then aligns these visual features with the textual attributes parsed from the sentences by using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques