Visual Speech Recognition for Languages with Limited Labeled Data using   Automatic Labels from Whisper

Jeong Hun Yeo; Minsu Kim; Shinji Watanabe; Yong Man Ro

arXiv:2309.08535·cs.CV·January 15, 2024

Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper

Jeong Hun Yeo, Minsu Kim, Shinji Watanabe, Yong Man Ro

PDF

Open Access 1 Repo

TL;DR

This paper introduces a method for visual speech recognition in low-resource languages by automatically generating labels from a multilingual speech recognition model, achieving state-of-the-art results without human annotations.

Contribution

The study demonstrates that automatic labeling using Whisper can effectively replace human annotations, enabling high-performance VSR for low-resource languages.

Findings

01

Achieved comparable performance with human-annotated labels

02

Generated 1,002 hours of labeled data for four low-resource languages

03

Set new state-of-the-art results on the mTEDx dataset

Abstract

This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages, especially for low-resource languages that have a limited number of labeled data. Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the different languages without human intervention. To this end, we employ a Whisper model which can conduct both language identification and audio-based speech recognition. It serves to filter data of the desired languages and transcribe labels from the unannotated, multilingual audio-visual data pool. By comparing the performances of VSR models trained on automatic labels and the human-annotated labels, we show that we can achieve similar VSR performance to that of human-annotated labels even…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jeonghun0716/visual-speech-recognition-for-low-resource-languages
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Indoor and Outdoor Localization Technologies