TL;DR
LipLearner introduces a lipreading-based silent speech interface that enables customizable, private communication on mobile devices with high accuracy, robustness, and user-friendly on-device learning, supporting flexible vocabularies with minimal effort.
Contribution
The paper presents a contrastive learning approach for lipreading that allows few-shot command customization and robust performance in real-world conditions on mobile devices.
Findings
Achieves 0.8947 F1-score on 25-command classification with one-shot learning.
Supports on-device fine-tuning and visual keyword spotting for personalized commands.
User study confirms high usability and reliability of the system.
Abstract
Silent speech interface is a promising technology that enables private communications in natural language. However, previous approaches only support a small and inflexible vocabulary, which leads to limited expressiveness. We leverage contrastive learning to learn efficient lipreading representations, enabling few-shot command customization with minimal user effort. Our model exhibits high robustness to different lighting, posture, and gesture conditions on an in-the-wild dataset. For 25-command classification, an F1-score of 0.8947 is achievable only using one shot, and its performance can be further boosted by adaptively learning from more data. This generalizability allowed us to develop a mobile silent speech interface empowered with on-device fine-tuning and visual keyword spotting. A user study demonstrated that with LipLearner, users could define their own commands with high…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Learning
