AV Taris: Online Audio-Visual Speech Recognition

George Sterpu; Naomi Harte

arXiv:2012.07467·eess.AS·December 15, 2020

AV Taris: Online Audio-Visual Speech Recognition

George Sterpu, Naomi Harte

PDF

Open Access 1 Repo

TL;DR

AV Taris is a real-time, fully differentiable neural network model that integrates audio and visual speech cues to improve recognition accuracy in challenging listening environments, surpassing audio-only methods.

Contribution

The paper introduces AV Taris, a novel online audio-visual speech recognition model combining AV Align and Taris, enabling real-time decoding with improved robustness over audio-only systems.

Findings

01

AV Taris outperforms audio-only Taris in real-time speech recognition.

02

AV Taris shows approximately 3% degradation compared to full-sentence AV Align.

03

The model demonstrates the utility of visual cues in noisy, real-world conditions.

Abstract

In recent years, Automatic Speech Recognition (ASR) technology has approached human-level performance on conversational speech under relatively clean listening conditions. In more demanding situations involving distant microphones, overlapped speech, background noise, or natural dialogue structures, the ASR error rate is at least an order of magnitude higher. The visual modality of speech carries the potential to partially overcome these challenges and contribute to the sub-tasks of speaker diarisation, voice activity detection, and the recovery of the place of articulation, and can compensate for up to 15dB of noise on average. This article develops AV Taris, a fully differentiable neural network model capable of decoding audio-visual speech in real time. We achieve this by connecting two recently proposed models for audio-visual speech integration and online speech recognition, namely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

georgesterpu/Taris
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis