Learning Deep Representations of Fine-grained Visual Descriptions
Scott Reed, Zeynep Akata, Bernt Schiele, Honglak Lee

TL;DR
This paper introduces a neural language model trained from scratch to generate fine-grained visual descriptions, enabling zero-shot image recognition and retrieval without relying on attributes, thus providing a natural language interface.
Contribution
The authors propose a novel end-to-end neural language model trained from scratch that aligns raw text with images for fine-grained recognition, surpassing attribute-based methods.
Findings
Outperforms attribute-based methods on zero-shot classification
Achieves strong results on zero-shot text-based image retrieval
Provides a natural language interface for image annotation and retrieval
Abstract
State-of-the-art methods for zero-shot visual recognition formulate learning as a joint embedding problem of images and side information. In these formulations the current best complement to visual features are attributes: manually encoded vectors describing shared characteristics among categories. Despite good performance, attributes have limitations: (1) finer-grained recognition requires commensurately more attributes, and (2) attributes do not provide a natural language interface. We propose to overcome these limitations by training neural language models from scratch; i.e. without pre-training and only consuming words and characters. Our proposed models train end-to-end to align with the fine-grained and category-specific content of images. Natural language provides a flexible and compact way of encoding only the salient visual aspects for distinguishing categories. By training on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Learning Deep Representations of Fine-Grained Visual Descriptions· youtube
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
