Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed
Yaman Kumar, Mayank Aggarwal, Pratham Nawal, Shin'ichi Satoh, Rajiv, Ratn Shah, Roger Zimmerman

TL;DR
This paper introduces the first multi-view speech reading system that uses multiple silent video feeds from different angles to improve speech reconstruction, addressing pose variations and enhancing intelligibility.
Contribution
It presents a novel multi-view approach for speech reconstruction from silent videos, including optimal camera placement and potential applications across multimedia fields.
Findings
Multi-view video improves speech reconstruction accuracy.
Optimal camera placement enhances speech intelligibility.
System shows promise for security and multimedia analytics.
Abstract
Speechreading or lipreading is the technique of understanding and getting phonetic features from a speaker's visual features such as movement of lips, face, teeth and tongue. It has a wide range of multimedia applications such as in surveillance, Internet telephony, and as an aid to a person with hearing impairments. However, most of the work in speechreading has been limited to text generation from silent videos. Recently, research has started venturing into generating (audio) speech from silent video sequences but there have been no developments thus far in dealing with divergent views and poses of a speaker. Thus although, we have multiple camera feeds for the speech of a user, but we have failed in using these multiple video feeds for dealing with the different poses. To this end, this paper presents the world's first ever multi-view speech reading and reconstruction system. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
