Learning Visual N-Grams from Web Data

Ang Li; Allan Jabri; Armand Joulin; Laurens van der Maaten

arXiv:1612.09161·cs.CV·August 8, 2017

Learning Visual N-Grams from Web Data

Ang Li, Allan Jabri, Armand Joulin, Laurens van der Maaten

PDF

Open Access

TL;DR

This paper introduces visual n-gram models trained on web data to improve large-scale image recognition, enabling phrase prediction and zero-shot transfer without extensive manual annotation.

Contribution

It presents a novel approach of training visual n-gram models with new loss functions for phrase prediction and image retrieval from webly supervised data.

Findings

01

Effective phrase prediction from images

02

Improved image retrieval using visual n-grams

03

Successful zero-shot transfer capabilities

Abstract

Real-world image recognition systems need to recognize tens of thousands of classes that constitute a plethora of visual concepts. The traditional approach of annotating thousands of images per class for training is infeasible in such a scenario, prompting the use of webly supervised data. This paper explores the training of image-recognition systems on large numbers of images and associated user comments. In particular, we develop visual n-gram models that can predict arbitrary phrases that are relevant to the content of an image. Our visual n-gram models are feed-forward convolutional networks trained using new loss functions that are inspired by n-gram models commonly used in language modeling. We demonstrate the merits of our models in phrase prediction, phrase-based image retrieval, relating images and captions, and zero-shot transfer.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition