Word2VisualVec: Image and Video to Sentence Matching by Visual Feature Prediction
Jianfeng Dong, Xirong Li, Cees G. M. Snoek

TL;DR
This paper introduces Word2VisualVec, a neural network that predicts visual features from text to improve image and video to sentence matching without relying on joint embedding spaces.
Contribution
The paper proposes a novel deep learning architecture that predicts visual features directly from text, extending to video with multimodal features, achieving state-of-the-art results.
Findings
Outperforms existing methods on four benchmarks.
Effective in matching images and videos to descriptive sentences.
Versatile architecture adaptable to different visual features.
Abstract
This paper strives to find the sentence best describing the content of an image or video. Different from existing works, which rely on a joint subspace for image / video to sentence matching, we propose to do so in a visual space only. We contribute Word2VisualVec, a deep neural network architecture that learns to predict a deep visual encoding of textual input based on sentence vectorization and a multi-layer perceptron. We thoroughly analyze its architectural design, by varying the sentence vectorization strategy, network depth and the deep feature to predict for image to sentence matching. We also generalize Word2VisualVec for matching a video to a sentence, by extending the predictive abilities to 3-D ConvNet features as well as a visual-audio representation. Experiments on four challenging image and video benchmarks detail Word2VisualVec's properties, capabilities for image and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
