Towards visually prompted keyword localisation for zero-resource spoken   languages

Leanne Nortje; Herman Kamper

arXiv:2210.06229·cs.CL·October 13, 2022

Towards visually prompted keyword localisation for zero-resource spoken languages

Leanne Nortje, Herman Kamper

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel speech-vision model for zero-resource spoken language keyword localisation using visual prompts, outperforming existing models in detection and localisation accuracy.

Contribution

It proposes a new localising attention mechanism and keyword sampling scheme for improved zero-resource keyword localisation from visual cues.

Findings

01

16% relative improvement in localisation F1 over visual BoW model

02

Outperforms existing speech-vision models in keyword detection and localisation

03

Effective in zero-resource scenarios with no prior speech data

Abstract

Imagine being able to show a system a visual depiction of a keyword and finding spoken utterances that contain this keyword from a zero-resource speech corpus. We formalise this task and call it visually prompted keyword localisation (VPKL): given an image of a keyword, detect and predict where in an utterance the keyword occurs. To do VPKL, we propose a speech-vision model with a novel localising attention mechanism which we train with a new keyword sampling scheme. We show that these innovations give improvements in VPKL over an existing speech-vision model. We also compare to a visual bag-of-words (BoW) model where images are automatically tagged with visual labels and paired with unlabelled speech. Although this visual BoW can be queried directly with a written keyword (while our's takes image queries), our new model still outperforms the visual BoW in both detection and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

leannenortje/davenet_vpkl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning