Lip reading using external viseme decoding

Javad Peymanfard; Mohammad Reza Mohammadi; Hossein Zeinali; Nasser; Mozayani

arXiv:2104.04784·cs.CV·November 9, 2021

Lip reading using external viseme decoding

Javad Peymanfard, Mohammad Reza Mohammadi, Hossein Zeinali, Nasser, Mozayani

PDF

Open Access

TL;DR

This paper presents a two-stage lip-reading approach that uses external viseme-to-character mapping, significantly improving recognition accuracy on a standard dataset by dividing the task into visual-to-viseme and viseme-to-character stages.

Contribution

It introduces a novel two-stage lip-reading method leveraging external text data for viseme decoding, enhancing accuracy over traditional sequence-to-sequence models.

Findings

01

Achieved a 4% reduction in word error rate on LRS2 dataset.

02

Demonstrated the effectiveness of external viseme-to-character mapping.

03

Improved lip-reading performance over baseline models.

Abstract

Lip-reading is the operation of recognizing speech from lip movements. This is a difficult task because the movements of the lips when pronouncing the words are similar for some of them. Viseme is used to describe lip movements during a conversation. This paper aims to show how to use external text data (for viseme-to-character mapping) by dividing video-to-character into two stages, namely converting video to viseme, and then converting viseme to character by using separate models. Our proposed method improves word error rate by 4\% compared to the normal sequence to sequence lip-reading model on the BBC-Oxford Lip Reading Sentences 2 (LRS2) dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis