Visually grounded cross-lingual keyword spotting in speech
Herman Kamper, Michael Roth

TL;DR
This paper explores using visual grounding to enable cross-lingual keyword spotting in speech, allowing retrieval of spoken utterances in a low-resource language using high-resource language text queries, without requiring transcriptions.
Contribution
It introduces a novel approach leveraging visual context to perform cross-lingual keyword spotting in speech without parallel transcriptions or translations.
Findings
Achieves a precision at ten of 58% in cross-lingual keyword retrieval.
Most errors involve semantically related keywords, suggesting semantic understanding.
Excluding semantically related errors raises P@10 to 91%.
Abstract
Recent work considered how images paired with speech can be used as supervision for building speech systems when transcriptions are not available. We ask whether visual grounding can be used for cross-lingual keyword spotting: given a text keyword in one language, the task is to retrieve spoken utterances containing that keyword in another language. This could enable searching through speech in a low-resource language using text queries in a high-resource language. As a proof-of-concept, we use English speech with German queries: we use a German visual tagger to add keyword labels to each training image, and then train a neural network to map English speech to German keywords. Without seeing parallel speech-transcriptions or translations, the model achieves a precision at ten of 58%. We show that most erroneous retrievals contain equivalent or semantically relevant keywords; excluding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
