Learning Deep Representations of Fine-grained Visual Descriptions

Scott Reed; Zeynep Akata; Bernt Schiele; Honglak Lee

arXiv:1605.05395·cs.CV·May 19, 2016·146 cites

Learning Deep Representations of Fine-grained Visual Descriptions

Scott Reed, Zeynep Akata, Bernt Schiele, Honglak Lee

PDF

Open Access 5 Repos 1 Video

TL;DR

This paper introduces a neural language model trained from scratch to generate fine-grained visual descriptions, enabling zero-shot image recognition and retrieval without relying on attributes, thus providing a natural language interface.

Contribution

The authors propose a novel end-to-end neural language model trained from scratch that aligns raw text with images for fine-grained recognition, surpassing attribute-based methods.

Findings

01

Outperforms attribute-based methods on zero-shot classification

02

Achieves strong results on zero-shot text-based image retrieval

03

Provides a natural language interface for image annotation and retrieval

Abstract

State-of-the-art methods for zero-shot visual recognition formulate learning as a joint embedding problem of images and side information. In these formulations the current best complement to visual features are attributes: manually encoded vectors describing shared characteristics among categories. Despite good performance, attributes have limitations: (1) finer-grained recognition requires commensurately more attributes, and (2) attributes do not provide a natural language interface. We propose to overcome these limitations by training neural language models from scratch; i.e. without pre-training and only consuming words and characters. Our proposed models train end-to-end to align with the fine-grained and category-specific content of images. Natural language provides a flexible and compact way of encoding only the salient visual aspects for distinguishing categories. By training on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Learning Deep Representations of Fine-Grained Visual Descriptions· youtube

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques