Bilingual Speech Recognition by Estimating Speaker Geometry from Video   Data

Luis Sanchez Tapia; Antonio Gomez; Mario Esparza; Venkatesh Jatla,; Marios Pattichis; Sylvia Celed\'on-Pattichis; Carlos L\'opezLeiva

arXiv:2112.13463·cs.SD·December 28, 2021

Bilingual Speech Recognition by Estimating Speaker Geometry from Video Data

Luis Sanchez Tapia, Antonio Gomez, Mario Esparza, Venkatesh Jatla,, Marios Pattichis, Sylvia Celed\'on-Pattichis, Carlos L\'opezLeiva

PDF

Open Access

TL;DR

This paper introduces a bilingual speech recognition system that estimates speaker geometry from video to improve recognition accuracy in noisy, multi-speaker environments like classrooms, outperforming baseline and commercial systems.

Contribution

The novel approach estimates 3D speaker geometry from video data to enhance speech recognition accuracy in challenging acoustic conditions.

Findings

01

Average error rate reduced to 10.83% from 33.12%.

02

Recognition accuracy improved by 1.5% over Google Speech-to-text.

03

Sensitivity increased from 24% to 38%.

Abstract

Speech recognition is very challenging in student learning environments that are characterized by significant cross-talk and background noise. To address this problem, we present a bilingual speech recognition system that uses an interactive video analysis system to estimate the 3D speaker geometry for realistic audio simulations. We demonstrate the use of our system in generating a complex audio dataset that contains significant cross-talk and background noise that approximate real-life classroom recordings. We then test our proposed system with real-life recordings. In terms of the distance of the speakers from the microphone, our interactive video analysis system obtained a better average error rate of 10.83% compared to 33.12% for a baseline approach. Our proposed system gave an accuracy of 27.92% that is 1.5% better than Google Speech-to-text on the same dataset. In terms of 9…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis