Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and Videos
Benet Oriol, Jordi Luque, Ferran Diego, Xavier Giro-i-Nieto

TL;DR
This paper introduces a multi-modal embedding approach combining speech, images, and text to improve retrieval tasks, demonstrating that textual transcriptions enhance embedding quality even when text isn't directly used in tasks.
Contribution
The study presents a novel tri-modal embedding training method that incorporates textual transcriptions alongside speech and images, improving representation quality.
Findings
Enhanced embedding performance on EPIC-Kitchen and Places Audio Caption datasets.
Improved image and speech retrieval accuracy.
Text transcriptions aid training even when text isn't used in retrieval tasks.
Abstract
In this work, we propose an effective approach for training unique embedding representations by combining three simultaneous modalities: image and spoken and textual narratives. The proposed methodology departs from a baseline system that spawns a embedding space trained with only spoken narratives and image cues. Our experiments on the EPIC-Kitchen and Places Audio Caption datasets show that introducing the human-generated textual transcriptions of the spoken narratives helps to the training procedure yielding to get better embedding representations. The triad speech, image and words allows for a better estimate of the point embedding and show an improving of the performance within tasks like image and speech retrieval, even when text third modality, text, is not present in the task.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Music and Audio Processing
