Deep Multimodal Semantic Embeddings for Speech and Images

David Harwath; James Glass

arXiv:1511.03690·cs.CV·November 13, 2015

Deep Multimodal Semantic Embeddings for Speech and Images

David Harwath, James Glass

PDF

TL;DR

This paper introduces a deep learning model that aligns spoken captions with images by learning a shared semantic space, enabling cross-modal retrieval and annotation.

Contribution

It presents a novel multimodal embedding approach that jointly models speech and images at the word level using convolutional neural networks.

Findings

01

Effective cross-modal retrieval demonstrated on Flickr8k dataset

02

Successful alignment of spoken captions with corresponding images

03

Augmented dataset with 40,000 spoken captions for evaluation

Abstract

In this paper, we present a model which takes as input a corpus of images with relevant spoken captions and finds a correspondence between the two modalities. We employ a pair of convolutional neural networks to model visual objects and speech signals at the word level, and tie the networks together with an embedding and alignment model which learns a joint semantic space over both modalities. We evaluate our model using image search and annotation tasks on the Flickr8k dataset, which we augmented by collecting a corpus of 40,000 spoken captions using Amazon Mechanical Turk.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.