Visual Keyword Spotting with Attention

K R Prajwal; Liliane Momeni; Triantafyllos Afouras; Andrew Zisserman

arXiv:2110.15957·cs.CV·November 1, 2021·6 cites

Visual Keyword Spotting with Attention

K R Prajwal, Liliane Momeni, Triantafyllos Afouras, Andrew Zisserman

PDF

Open Access 1 Repo

TL;DR

This paper introduces the Transpotter, a Transformer-based model that effectively spots spoken keywords in silent videos by leveraging cross-modal attention, outperforming previous methods on multiple datasets and handling challenging sign language scenarios.

Contribution

We propose the Transpotter architecture with full cross-modal attention for visual keyword spotting, achieving state-of-the-art results and robustness in sign language videos.

Findings

01

Outperforms prior state-of-the-art methods on LRW, LRS2, LRS3 datasets.

02

Demonstrates effectiveness in extreme sign language mouthings.

03

Shows significant improvement in keyword localization accuracy.

Abstract

In this paper, we consider the task of spotting spoken keywords in silent video sequences -- also known as visual keyword spotting. To this end, we investigate Transformer-based models that ingest two streams, a visual encoding of the video and a phonetic encoding of the keyword, and output the temporal location of the keyword if present. Our contributions are as follows: (1) We propose a novel architecture, the Transpotter, that uses full cross-modal attention between the visual and phonetic streams; (2) We show through extensive evaluations that our model outperforms the prior state-of-the-art visual keyword spotting and lip reading methods on the challenging LRW, LRS2, LRS3 datasets by a large margin; (3) We demonstrate the ability of our model to spot words under the extreme conditions of isolated mouthings in sign language videos.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

prajwalkr/transpotter
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Video Analysis and Summarization · Multimodal Machine Learning Applications