AV Taris: Online Audio-Visual Speech Recognition
George Sterpu, Naomi Harte

TL;DR
AV Taris is a real-time, fully differentiable neural network model that integrates audio and visual speech cues to improve recognition accuracy in challenging listening environments, surpassing audio-only methods.
Contribution
The paper introduces AV Taris, a novel online audio-visual speech recognition model combining AV Align and Taris, enabling real-time decoding with improved robustness over audio-only systems.
Findings
AV Taris outperforms audio-only Taris in real-time speech recognition.
AV Taris shows approximately 3% degradation compared to full-sentence AV Align.
The model demonstrates the utility of visual cues in noisy, real-world conditions.
Abstract
In recent years, Automatic Speech Recognition (ASR) technology has approached human-level performance on conversational speech under relatively clean listening conditions. In more demanding situations involving distant microphones, overlapped speech, background noise, or natural dialogue structures, the ASR error rate is at least an order of magnitude higher. The visual modality of speech carries the potential to partially overcome these challenges and contribute to the sub-tasks of speaker diarisation, voice activity detection, and the recovery of the place of articulation, and can compensate for up to 15dB of noise on average. This article develops AV Taris, a fully differentiable neural network model capable of decoding audio-visual speech in real time. We achieve this by connecting two recently proposed models for audio-visual speech integration and online speech recognition, namely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
