TL;DR
NeuroLip is an event-based framework that captures fine-grained lip motion dynamics for robust visual speaker recognition across different scenes and lighting conditions, outperforming existing methods.
Contribution
The paper introduces NeuroLip, a novel event-driven spatiotemporal learning framework with new modules and a comprehensive dataset for cross-scene lip-motion-based speaker recognition.
Findings
NeuroLip achieves over 71% accuracy on unseen viewpoints.
It attains nearly 76% accuracy under low-light conditions.
Outperforms existing methods by at least 8.54%.
Abstract
Visual speaker recognition based on lip motion offers a silent, hands-free, and behavior-driven biometric solution that remains effective even when acoustic cues are unavailable. Compared to traditional methods that rely heavily on appearance-dependent representations, lip motion encodes subject-specific behavioral dynamics driven by consistent articulation patterns and muscle coordination, offering inherent stability across environmental changes. However, capturing these robust, fine-grained dynamics is challenging for conventional frame-based cameras due to motion blur and low dynamic range. To exploit the intrinsic stability of lip motion and address these sensing limitations, we propose NeuroLip, an event-based framework that captures fine-grained lip dynamics under a strict yet practical cross-scene protocol: training is performed under a single controlled condition, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
