Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip   Reading

Minsu Kim; Jeong Hun Yeo; Yong Man Ro

arXiv:2204.01725·cs.CV·April 6, 2022

Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading

Minsu Kim, Jeong Hun Yeo, Yong Man Ro

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a Multi-head Visual-audio Memory (MVM) to improve lip reading by leveraging audio-visual data, addressing challenges of visual information insufficiency and homophenes, and demonstrating enhanced recognition accuracy.

Contribution

The paper proposes a novel MVM that models inter-relationships of audio-visual data and distinguishes homophenes, advancing lip reading technology.

Findings

01

MVM improves lip reading accuracy significantly.

02

The method effectively distinguishes homophenes.

03

Experimental results validate the approach's effectiveness.

Abstract

Recognizing speech from silent lip movement, which is called lip reading, is a challenging task due to 1) the inherent information insufficiency of lip movement to fully represent the speech, and 2) the existence of homophenes that have similar lip movement with different pronunciations. In this paper, we try to alleviate the aforementioned two challenges in lip reading by proposing a Multi-head Visual-audio Memory (MVM). Firstly, MVM is trained with audio-visual datasets and remembers audio representations by modelling the inter-relationships of paired audio-visual representations. At the inference stage, visual input alone can extract the saved audio representation from the memory by examining the learned inter-relationships. Therefore, the lip reading model can complement the insufficient visual information with the extracted audio representations. Secondly, MVM is composed of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ms-dot-k/Multi-head-Visual-Audio-Memory
pytorch

Videos

Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading· underline

Taxonomy

TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Hearing Loss and Rehabilitation