Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading
Minsu Kim, Jeong Hun Yeo, Yong Man Ro

TL;DR
This paper introduces a Multi-head Visual-audio Memory (MVM) to improve lip reading by leveraging audio-visual data, addressing challenges of visual information insufficiency and homophenes, and demonstrating enhanced recognition accuracy.
Contribution
The paper proposes a novel MVM that models inter-relationships of audio-visual data and distinguishes homophenes, advancing lip reading technology.
Findings
MVM improves lip reading accuracy significantly.
The method effectively distinguishes homophenes.
Experimental results validate the approach's effectiveness.
Abstract
Recognizing speech from silent lip movement, which is called lip reading, is a challenging task due to 1) the inherent information insufficiency of lip movement to fully represent the speech, and 2) the existence of homophenes that have similar lip movement with different pronunciations. In this paper, we try to alleviate the aforementioned two challenges in lip reading by proposing a Multi-head Visual-audio Memory (MVM). Firstly, MVM is trained with audio-visual datasets and remembers audio representations by modelling the inter-relationships of paired audio-visual representations. At the inference stage, visual input alone can extract the saved audio representation from the memory by examining the learned inter-relationships. Therefore, the lip reading model can complement the insufficient visual information with the extracted audio representations. Secondly, MVM is composed of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Hearing Loss and Rehabilitation
