Lip reading using external viseme decoding
Javad Peymanfard, Mohammad Reza Mohammadi, Hossein Zeinali, Nasser, Mozayani

TL;DR
This paper presents a two-stage lip-reading approach that uses external viseme-to-character mapping, significantly improving recognition accuracy on a standard dataset by dividing the task into visual-to-viseme and viseme-to-character stages.
Contribution
It introduces a novel two-stage lip-reading method leveraging external text data for viseme decoding, enhancing accuracy over traditional sequence-to-sequence models.
Findings
Achieved a 4% reduction in word error rate on LRS2 dataset.
Demonstrated the effectiveness of external viseme-to-character mapping.
Improved lip-reading performance over baseline models.
Abstract
Lip-reading is the operation of recognizing speech from lip movements. This is a difficult task because the movements of the lips when pronouncing the words are similar for some of them. Viseme is used to describe lip movements during a conversation. This paper aims to show how to use external text data (for viseme-to-character mapping) by dividing video-to-character into two stages, namely converting video to viseme, and then converting viseme to character by using separate models. Our proposed method improves word error rate by 4\% compared to the normal sequence to sequence lip-reading model on the BBC-Oxford Lip Reading Sentences 2 (LRS2) dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis
