Contextual Non-Local Alignment over Full-Scale Representation for   Text-Based Person Search

Chenyang Gao; Guanyu Cai; Xinyang Jiang; Feng Zheng; Jun Zhang; Yifei; Gong; Pai Peng; Xiaowei Guo; Xing Sun

arXiv:2101.03036·cs.CV·January 11, 2021·61 cites

Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search

Chenyang Gao, Guanyu Cai, Xinyang Jiang, Feng Zheng, Jun Zhang, Yifei, Gong, Pai Peng, Xiaowei Guo, Xing Sun

PDF

Open Access 2 Repos

TL;DR

This paper introduces NAFS, a novel method for text-based person search that adaptively aligns visual and textual features across all scales using a non-local attention mechanism, significantly improving retrieval accuracy.

Contribution

The paper proposes a full-scale, adaptive alignment approach with a staircase network and locality-constrained BERT, addressing limitations of scale-specific alignment methods.

Findings

01

Outperforms state-of-the-art by 5.53% top-1 accuracy

02

Achieves 5.35% improvement in top-5 accuracy

03

Demonstrates effective multi-scale feature alignment

Abstract

Text-based person search aims at retrieving target person in an image gallery using a descriptive sentence of that person. It is very challenging since modal gap makes effectively extracting discriminative features more difficult. Moreover, the inter-class variance of both pedestrian images and descriptions is small. So comprehensive information is needed to align visual and textual clues across all scales. Most existing methods merely consider the local alignment between images and texts within a single scale (e.g. only global scale or only partial scale) then simply construct alignment at each scale separately. To address this problem, we propose a method that is able to adaptively align image and textual features across all scales, called NAFS (i.e.Non-local Alignment over Full-Scale representations). Firstly, a novel staircase network structure is proposed to extract full-scale…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Multimodal Machine Learning Applications

MethodsLinear Layer · Weight Decay · Linear Warmup With Linear Decay · Softmax · Dropout · Dense Connections · Multi-Head Attention · Attention Is All You Need · WordPiece · Attention Dropout