Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks
Masood S. Mortazavi

TL;DR
This paper demonstrates that semantic alignment between speech and images can be achieved without relying on pre-trained models, challenging the common assumption that transfer learning is necessary for high recall in cross-modal retrieval tasks.
Contribution
It shows that with appropriate neural architectures and large datasets, effective speech-image semantic alignment can be obtained without pre-trained features or initialization.
Findings
High recall rates achieved without transfer learning
Large datasets enable effective semantic alignment
Audio embedder size can be reduced with minimal performance loss
Abstract
Semantically-aligned datasets can be used to explore "visually-grounded speech". In a majority of existing investigations, features of an image signal are extracted using neural networks "pre-trained" on other tasks (e.g., classification on ImageNet). In still others, pre-trained networks are used to extract audio features prior to semantic embedding. Without "transfer learning" through pre-trained initialization or pre-trained feature extraction, previous results have tended to show low rates of recall in and queries. Choosing appropriate neural architectures for encoders in the speech and image branches and using large datasets, one can obtain competitive recall rates without any reliance on any pre-trained initialization or feature extraction: semantic alignment and …
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
