TL;DR
This paper evaluates and improves visually grounded speech models' ability to align spoken words with visual objects, introducing new metrics and a model variant with cross-modal attention that enhances alignment accuracy.
Contribution
It formalizes the audiovisual alignment problem, proposes systematic evaluation metrics, and introduces a new VGS model with cross-modal attention for better alignment performance.
Findings
Cross-modal attention improves alignment accuracy.
New metrics effectively evaluate audiovisual alignment.
Enhanced models show better performance in both retrieval and alignment tasks.
Abstract
Systems that can find correspondences between multiple modalities, such as between speech and images, have great potential to solve different recognition and data analysis tasks in an unsupervised manner. This work studies multimodal learning in the context of visually grounded speech (VGS) models, and focuses on their recently demonstrated capability to extract spatiotemporal alignments between spoken words and the corresponding visual objects without ever been explicitly trained for object localization or word recognition. As the main contributions, we formalize the alignment problem in terms of an audiovisual alignment tensor that is based on earlier VGS work, introduce systematic metrics for evaluating model performance in aligning visual objects and spoken words, and propose a new VGS model variant for the alignment task utilizing cross-modal attention layer. We test our model and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
