Designing Practical Models for Isolated Word Visual Speech Recognition
Iason Ioannis Panagos, Giorgos Sfikas, Christophoros Nikou

TL;DR
This paper develops lightweight, resource-efficient visual speech recognition models that maintain high accuracy, enabling practical deployment in resource-constrained environments by benchmarking and adapting efficient neural network architectures.
Contribution
It introduces novel low-resource VSR architectures based on efficient image classification models and lightweight temporal convolution blocks, addressing hardware cost issues.
Findings
Achieved strong recognition performance with low-resource models.
Demonstrated effectiveness on a large English word database.
Models are suitable for practical, resource-constrained applications.
Abstract
Visual speech recognition (VSR) systems decode spoken words from an input sequence using only the video data. Practical applications of such systems include medical assistance as well as human-machine interactions. A VSR system is typically employed in a complementary role in cases where the audio is corrupt or not available. In order to accurately predict the spoken words, these architectures often rely on deep neural networks in order to extract meaningful representations from the input sequence. While deep architectures achieve impressive recognition performance, relying on such models incurs significant computation costs which translates into increased resource demands in terms of hardware requirements and results in limited applicability in real-world scenarios where resources might be constrained. This factor prevents wider adoption and deployment of speech recognition systems in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
