End-to-End Multi-Person Audio/Visual Automatic Speech Recognition

Otavio Braga; Takaki Makino; Olivier Siohan; Hank Liao

arXiv:2205.05586·eess.AS·May 12, 2022

End-to-End Multi-Person Audio/Visual Automatic Speech Recognition

Otavio Braga, Takaki Makino, Olivier Siohan, Hank Liao

PDF

Open Access

TL;DR

This paper introduces a fully differentiable multi-person audio-visual speech recognition system that automatically selects the correct face track from multiple candidates, improving robustness and accuracy over traditional methods.

Contribution

It presents a novel attention-based model that integrates face selection and speech recognition into a single differentiable framework for multi-person scenarios.

Findings

01

Achieves near-oracle face selection accuracy with minimal WER increase.

02

Utilizes over 30,000 hours of YouTube videos for training.

03

Demonstrates benefits of visual signals over audio-only ASR.

Abstract

Traditionally, audio-visual automatic speech recognition has been studied under the assumption that the speaking face on the visual signal is the face matching the audio. However, in a more realistic setting, when multiple faces are potentially on screen one needs to decide which face to feed to the A/V ASR system. The present work takes the recent progress of A/V ASR one step further and considers the scenario where multiple people are simultaneously on screen (multi-person A/V ASR). We propose a fully differentiable A/V ASR model that is able to handle multiple face tracks in a video. Instead of relying on two separate models for speaker face selection and audio-visual ASR on a single face track, we introduce an attention layer to the ASR encoder that is able to soft-select the appropriate face video track. Experiments carried out on an A/V system trained on over 30k hours of YouTube…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Blind Source Separation Techniques