SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set
William Havard, Laurent Besacier, Olivier Rosec

TL;DR
SPEECH-COCO is a large dataset of over 600,000 visually grounded spoken captions generated from MSCOCO, enabling research in multimodal speech and vision tasks with naturalistic speech signals.
Contribution
This work introduces a new large-scale dataset of spoken image captions with detailed time-aligned annotations, facilitating multimodal speech and vision research.
Findings
Created a dataset with 616,767 spoken captions over 600 hours
Added disfluencies and speed perturbation for naturalness
Demonstrated potential for unsupervised speech pattern discovery
Abstract
This paper presents an augmentation of MSCOCO dataset where speech is added to image and text. Speech captions are generated using text-to-speech (TTS) synthesis resulting in 616,767 spoken captions (more than 600h) paired with images. Disfluencies and speed perturbation are added to the signal in order to sound more natural. Each speech signal (WAV) is paired with a JSON file containing exact timecode for each word/syllable/phoneme in the spoken caption. Such a corpus could be used for Language and Vision (LaVi) tasks including speech input or output instead of text. Investigating multimodal learning schemes for unsupervised speech pattern discovery is also possible with this corpus, as demonstrated by a preliminary study conducted on a subset of the corpus (10h, 10k spoken captions). The dataset is available on Zenodo: https://zenodo.org/record/4282267
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Multimodal Machine Learning Applications · Speech and Audio Processing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
