# Attention-based Natural Language Person Retrieval

**Authors:** Tao Zhou, Muhao Chen, Jie Yu, Demetri Terzopoulos

arXiv: 1705.08923 · 2017-05-26

## TL;DR

This paper introduces a novel attention-based system for natural language person retrieval in images, utilizing a new dataset and deep learning techniques to improve localization accuracy.

## Contribution

It presents a new benchmark dataset and an attention-based deep learning model for natural language person retrieval in images, advancing the state-of-the-art.

## Key findings

- Significant improvement over existing methods in object retrieval accuracy
- Effective integration of visual and textual features using attention mechanisms
- Potential applications in surveillance video search

## Abstract

Following the recent progress in image classification and captioning using deep learning, we develop a novel natural language person retrieval system based on an attention mechanism. More specifically, given the description of a person, the goal is to localize the person in an image. To this end, we first construct a benchmark dataset for natural language person retrieval. To do so, we generate bounding boxes for persons in a public image dataset from the segmentation masks, which are then annotated with descriptions and attributes using the Amazon Mechanical Turk. We then adopt a region proposal network in Faster R-CNN as a candidate region generator. The cropped images based on the region proposals as well as the whole images with attention weights are fed into Convolutional Neural Networks for visual feature extraction, while the natural language expression and attributes are input to Bidirectional Long Short- Term Memory (BLSTM) models for text feature extraction. The visual and text features are integrated to score region proposals, and the one with the highest score is retrieved as the output of our system. The experimental results show significant improvement over the state-of-the-art method for generic object retrieval and this line of research promises to benefit search in surveillance video footage.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1705.08923/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/1705.08923/full.md

## References

31 references — full list in the complete paper: https://tomesphere.com/paper/1705.08923/full.md

---
Source: https://tomesphere.com/paper/1705.08923