Speech-Image Semantic Alignment Does Not Depend on Any Prior   Classification Tasks

Masood S. Mortazavi

arXiv:2010.15288·cs.LG·October 30, 2020

Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks

Masood S. Mortazavi

PDF

TL;DR

This paper demonstrates that semantic alignment between speech and images can be achieved without relying on pre-trained models, challenging the common assumption that transfer learning is necessary for high recall in cross-modal retrieval tasks.

Contribution

It shows that with appropriate neural architectures and large datasets, effective speech-image semantic alignment can be obtained without pre-trained features or initialization.

Findings

01

High recall rates achieved without transfer learning

02

Large datasets enable effective semantic alignment

03

Audio embedder size can be reduced with minimal performance loss

Abstract

Semantically-aligned $(s p eec h, ima g e)$ datasets can be used to explore "visually-grounded speech". In a majority of existing investigations, features of an image signal are extracted using neural networks "pre-trained" on other tasks (e.g., classification on ImageNet). In still others, pre-trained networks are used to extract audio features prior to semantic embedding. Without "transfer learning" through pre-trained initialization or pre-trained feature extraction, previous results have tended to show low rates of recall in $s p eec h \to ima g e$ and $ima g e \to s p eec h$ queries. Choosing appropriate neural architectures for encoders in the speech and image branches and using large datasets, one can obtain competitive recall rates without any reliance on any pre-trained initialization or feature extraction: $(s p eec h, ima g e)$ semantic alignment and $s p eec h \to ima g e$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.