Predicting Visual Features from Text for Image and Video Caption   Retrieval

Jianfeng Dong; Xirong Li; Cees G. M. Snoek

arXiv:1709.01362·cs.CV·July 17, 2018

Predicting Visual Features from Text for Image and Video Caption Retrieval

Jianfeng Dong, Xirong Li, Cees G. M. Snoek

PDF

1 Repo

TL;DR

This paper introduces Word2VisualVec, a neural network that predicts visual features directly from text, enabling improved image and video caption retrieval without relying on joint embedding spaces.

Contribution

The paper presents a novel approach that predicts visual features from text exclusively in visual space, advancing caption retrieval for images and videos with a new neural architecture.

Findings

01

Achieves state-of-the-art results on multiple datasets

02

Demonstrates benefits over traditional textual embeddings

03

Shows potential for multimodal query composition

Abstract

This paper strives to find amidst a set of sentences the one best describing the content of a given image or video. Different from existing works, which rely on a joint subspace for their image and video caption retrieval, we propose to do so in a visual space exclusively. Apart from this conceptual novelty, we contribute \emph{Word2VisualVec}, a deep neural network architecture that learns to predict a visual feature representation from textual input. Example captions are encoded into a textual embedding based on multi-scale sentence vectorization and further transferred into a deep visual feature of choice via a simple multi-layer perceptron. We further generalize Word2VisualVec for video caption retrieval, by predicting from text both 3-D convolutional neural network features as well as a visual-audio representation. Experiments on Flickr8k, Flickr30k, the Microsoft Video Description…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

danieljf24/w2vv
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.