Deep Multimodal Semantic Embeddings for Speech and Images
David Harwath, James Glass

TL;DR
This paper introduces a deep learning model that aligns spoken captions with images by learning a shared semantic space, enabling cross-modal retrieval and annotation.
Contribution
It presents a novel multimodal embedding approach that jointly models speech and images at the word level using convolutional neural networks.
Findings
Effective cross-modal retrieval demonstrated on Flickr8k dataset
Successful alignment of spoken captions with corresponding images
Augmented dataset with 40,000 spoken captions for evaluation
Abstract
In this paper, we present a model which takes as input a corpus of images with relevant spoken captions and finds a correspondence between the two modalities. We employ a pair of convolutional neural networks to model visual objects and speech signals at the word level, and tie the networks together with an embedding and alignment model which learns a joint semantic space over both modalities. We evaluate our model using image search and annotation tasks on the Flickr8k dataset, which we augmented by collecting a corpus of 40,000 spoken captions using Amazon Mechanical Turk.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
