SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO   Data Set

William Havard; Laurent Besacier; Olivier Rosec

arXiv:1707.08435·cs.CL·November 24, 2020

SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set

William Havard, Laurent Besacier, Olivier Rosec

PDF

Open Access 1 Repo

TL;DR

SPEECH-COCO is a large dataset of over 600,000 visually grounded spoken captions generated from MSCOCO, enabling research in multimodal speech and vision tasks with naturalistic speech signals.

Contribution

This work introduces a new large-scale dataset of spoken image captions with detailed time-aligned annotations, facilitating multimodal speech and vision research.

Findings

01

Created a dataset with 616,767 spoken captions over 600 hours

02

Added disfluencies and speed perturbation for naturalness

03

Demonstrated potential for unsupervised speech pattern discovery

Abstract

This paper presents an augmentation of MSCOCO dataset where speech is added to image and text. Speech captions are generated using text-to-speech (TTS) synthesis resulting in 616,767 spoken captions (more than 600h) paired with images. Disfluencies and speed perturbation are added to the signal in order to sound more natural. Each speech signal (WAV) is paired with a JSON file containing exact timecode for each word/syllable/phoneme in the spoken caption. Such a corpus could be used for Language and Vision (LaVi) tasks including speech input or output instead of text. Investigating multimodal learning schemes for unsupervised speech pattern discovery is also possible with this corpus, as demonstrated by a preliminary study conducted on a subset of the corpus (10h, 10k spoken captions). The dataset is available on Zenodo: https://zenodo.org/record/4282267

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

William-N-Havard/SpeechCoco
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Multimodal Machine Learning Applications · Speech and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings