Seeing wake words: Audio-visual Keyword Spotting

Liliane Momeni; Triantafyllos Afouras; Themos Stafylakis and; Samuel Albanie; Andrew Zisserman

arXiv:2009.01225·cs.CV·September 7, 2020·22 cites

Seeing wake words: Audio-visual Keyword Spotting

Liliane Momeni, Triantafyllos Afouras, Themos Stafylakis and, Samuel Albanie, Andrew Zisserman

PDF

Open Access 1 Repo

TL;DR

This paper introduces KWS-Net, a novel zero-shot audio-visual keyword spotting architecture that improves detection accuracy in wild videos, generalizes across languages, and outperforms previous state-of-the-art methods.

Contribution

The paper presents a new convolutional architecture, KWS-Net, that enhances visual keyword spotting by using similarity maps and demonstrates cross-language generalization with minimal language-specific data.

Findings

01

KWS-Net outperforms previous visual keyword spotting methods.

02

Visual keyword spotting benefits from audio when available.

03

The method generalizes well to French and German with fine-tuning.

Abstract

The goal of this work is to automatically determine whether and when a word of interest is spoken by a talking face, with or without the audio. We propose a zero-shot method suitable for in the wild videos. Our key contributions are: (1) a novel convolutional architecture, KWS-Net, that uses a similarity map intermediate representation to separate the task into (i) sequence matching, and (ii) pattern detection, to decide whether the word is there and when; (2) we demonstrate that if audio is available, visual keyword spotting improves the performance both for a clean and noisy audio signal. Finally, (3) we show that our method generalises to other languages, specifically French and German, and achieves a comparable performance to English with less language specific data, by fine-tuning the network pre-trained on English. The method exceeds the performance of the previous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lilianemomeni/KWS-Net
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Face recognition and analysis