Bilingual Speech Recognition by Estimating Speaker Geometry from Video Data
Luis Sanchez Tapia, Antonio Gomez, Mario Esparza, Venkatesh Jatla,, Marios Pattichis, Sylvia Celed\'on-Pattichis, Carlos L\'opezLeiva

TL;DR
This paper introduces a bilingual speech recognition system that estimates speaker geometry from video to improve recognition accuracy in noisy, multi-speaker environments like classrooms, outperforming baseline and commercial systems.
Contribution
The novel approach estimates 3D speaker geometry from video data to enhance speech recognition accuracy in challenging acoustic conditions.
Findings
Average error rate reduced to 10.83% from 33.12%.
Recognition accuracy improved by 1.5% over Google Speech-to-text.
Sensitivity increased from 24% to 38%.
Abstract
Speech recognition is very challenging in student learning environments that are characterized by significant cross-talk and background noise. To address this problem, we present a bilingual speech recognition system that uses an interactive video analysis system to estimate the 3D speaker geometry for realistic audio simulations. We demonstrate the use of our system in generating a complex audio dataset that contains significant cross-talk and background noise that approximate real-life classroom recordings. We then test our proposed system with real-life recordings. In terms of the distance of the speakers from the microphone, our interactive video analysis system obtained a better average error rate of 10.83% compared to 33.12% for a baseline approach. Our proposed system gave an accuracy of 27.92% that is 1.5% better than Google Speech-to-text on the same dataset. In terms of 9…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
