Unified Speech Recognition: A Single Model for Auditory, Visual, and   Audiovisual Inputs

Alexandros Haliassos; Rodrigo Mira; Honglie Chen; Zoe Landgraf,; Stavros Petridis; Maja Pantic

arXiv:2411.02256·cs.CV·November 5, 2024

Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs

Alexandros Haliassos, Rodrigo Mira, Honglie Chen, Zoe Landgraf,, Stavros Petridis, Maja Pantic

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a unified model for auditory, visual, and audiovisual speech recognition that improves performance and efficiency by leveraging semi-supervised and self-supervised training strategies, achieving state-of-the-art results.

Contribution

It presents a novel unified training framework for ASR, VSR, and AVSR, including a pseudo-labelling approach and self-supervised pre-training, enhancing performance and reducing redundancies.

Findings

01

Unified model outperforms recent methods on multiple datasets.

02

Semi-supervised and self-supervised strategies improve accuracy.

03

Achieves state-of-the-art results on LRS3, LRS2, and WildVSR.

Abstract

Research in auditory, visual, and audiovisual speech recognition (ASR, VSR, and AVSR, respectively) has traditionally been conducted independently. Even recent self-supervised studies addressing two or all three tasks simultaneously tend to yield separate models, leading to disjoint inference pipelines with increased memory requirements and redundancies. This paper proposes unified training strategies for these systems. We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance, overcoming typical optimisation challenges when training from scratch. Moreover, we introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples, addressing shortcomings in related self-supervised methods. Finally, we develop a self-supervised pre-training method within our framework, proving its effectiveness alongside our semi-supervised…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ahaliassos/usr
pytorchOfficial

Videos

Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs· slideslive

Taxonomy

TopicsSpeech and Audio Processing