End-to-End Multi-Person Audio/Visual Automatic Speech Recognition
Otavio Braga, Takaki Makino, Olivier Siohan, Hank Liao

TL;DR
This paper introduces a fully differentiable multi-person audio-visual speech recognition system that automatically selects the correct face track from multiple candidates, improving robustness and accuracy over traditional methods.
Contribution
It presents a novel attention-based model that integrates face selection and speech recognition into a single differentiable framework for multi-person scenarios.
Findings
Achieves near-oracle face selection accuracy with minimal WER increase.
Utilizes over 30,000 hours of YouTube videos for training.
Demonstrates benefits of visual signals over audio-only ASR.
Abstract
Traditionally, audio-visual automatic speech recognition has been studied under the assumption that the speaking face on the visual signal is the face matching the audio. However, in a more realistic setting, when multiple faces are potentially on screen one needs to decide which face to feed to the A/V ASR system. The present work takes the recent progress of A/V ASR one step further and considers the scenario where multiple people are simultaneously on screen (multi-person A/V ASR). We propose a fully differentiable A/V ASR model that is able to handle multiple face tracks in a video. Instead of relying on two separate models for speaker face selection and audio-visual ASR on a single face track, we introduce an attention layer to the ASR encoder that is able to soft-select the appropriate face video track. Experiments carried out on an A/V system trained on over 30k hours of YouTube…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Blind Source Separation Techniques
