Word2VisualVec: Image and Video to Sentence Matching by Visual Feature   Prediction

Jianfeng Dong; Xirong Li; Cees G. M. Snoek

arXiv:1604.06838·cs.CV·November 28, 2016·45 cites

Word2VisualVec: Image and Video to Sentence Matching by Visual Feature Prediction

Jianfeng Dong, Xirong Li, Cees G. M. Snoek

PDF

Open Access

TL;DR

This paper introduces Word2VisualVec, a neural network that predicts visual features from text to improve image and video to sentence matching without relying on joint embedding spaces.

Contribution

The paper proposes a novel deep learning architecture that predicts visual features directly from text, extending to video with multimodal features, achieving state-of-the-art results.

Findings

01

Outperforms existing methods on four benchmarks.

02

Effective in matching images and videos to descriptive sentences.

03

Versatile architecture adaptable to different visual features.

Abstract

This paper strives to find the sentence best describing the content of an image or video. Different from existing works, which rely on a joint subspace for image / video to sentence matching, we propose to do so in a visual space only. We contribute Word2VisualVec, a deep neural network architecture that learns to predict a deep visual encoding of textual input based on sentence vectorization and a multi-layer perceptron. We thoroughly analyze its architectural design, by varying the sentence vectorization strategy, network depth and the deep feature to predict for image to sentence matching. We also generalize Word2VisualVec for matching a video to a sentence, by extending the predictive abilities to 3-D ConvNet features as well as a visual-audio representation. Experiments on four challenging image and video benchmarks detail Word2VisualVec's properties, capabilities for image and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques