Leveraging Visemes for Better Visual Speech Representation and Lip Reading
Javad Peymanfard, Vahid Saeedi, Mohammad Reza Mohammadi, Hossein, Zeinali, Nasser Mozayani

TL;DR
This paper introduces a viseme-based method for visual speech recognition that improves accuracy by extracting more discriminative features, outperforming existing techniques on multiple lip reading tasks.
Contribution
The paper presents a novel viseme-based feature extraction approach that enhances lip reading performance across various tasks and datasets.
Findings
Reduces word error rate by 9.1% relative to previous methods.
Outperforms state-of-the-art in word, sentence, and audiovisual speech recognition.
Effective on large-scale Persian dataset.
Abstract
Lip reading is a challenging task that has many potential applications in speech recognition, human-computer interaction, and security systems. However, existing lip reading systems often suffer from low accuracy due to the limitations of video features. In this paper, we propose a novel approach that leverages visemes, which are groups of phonetically similar lip shapes, to extract more discriminative and robust video features for lip reading. We evaluate our approach on various tasks, including word-level and sentence-level lip reading, and audiovisual speech recognition using the Arman-AV dataset, a largescale Persian corpus. Our experimental results show that our viseme based approach consistently outperforms the state-of-theart methods in all these tasks. The proposed method reduces the lip-reading word error rate (WER) by 9.1% relative to the best previous method.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Indoor and Outdoor Localization Technologies
