Leveraging Visemes for Better Visual Speech Representation and Lip   Reading

Javad Peymanfard; Vahid Saeedi; Mohammad Reza Mohammadi; Hossein; Zeinali; Nasser Mozayani

arXiv:2307.10157·cs.CV·July 20, 2023·1 cites

Leveraging Visemes for Better Visual Speech Representation and Lip Reading

Javad Peymanfard, Vahid Saeedi, Mohammad Reza Mohammadi, Hossein, Zeinali, Nasser Mozayani

PDF

Open Access

TL;DR

This paper introduces a viseme-based method for visual speech recognition that improves accuracy by extracting more discriminative features, outperforming existing techniques on multiple lip reading tasks.

Contribution

The paper presents a novel viseme-based feature extraction approach that enhances lip reading performance across various tasks and datasets.

Findings

01

Reduces word error rate by 9.1% relative to previous methods.

02

Outperforms state-of-the-art in word, sentence, and audiovisual speech recognition.

03

Effective on large-scale Persian dataset.

Abstract

Lip reading is a challenging task that has many potential applications in speech recognition, human-computer interaction, and security systems. However, existing lip reading systems often suffer from low accuracy due to the limitations of video features. In this paper, we propose a novel approach that leverages visemes, which are groups of phonetically similar lip shapes, to extract more discriminative and robust video features for lip reading. We evaluate our approach on various tasks, including word-level and sentence-level lip reading, and audiovisual speech recognition using the Arman-AV dataset, a largescale Persian corpus. Our experimental results show that our viseme based approach consistently outperforms the state-of-theart methods in all these tasks. The proposed method reduces the lip-reading word error rate (WER) by 9.1% relative to the best previous method.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Indoor and Outdoor Localization Technologies