Audio-Visual Target Speaker Enhancement on Multi-Talker Environment using Event-Driven Cameras
Ander Arriandiaga, Giovanni Morrone, Luca Pasa, Leonardo Badino,, Chiara Bartolozzi

TL;DR
This paper introduces an online, low-latency audio-visual speech enhancement method using event-driven cameras to extract facial motion features, enabling real-time multi-talker separation.
Contribution
It pioneers the use of event-driven cameras for low-latency, online audio-visual speech separation in multi-talker environments.
Findings
Approaches perform nearly as well as offline methods
Achieves low latency and computational efficiency
Enables real-time embedded audio-visual processing
Abstract
We propose a method to address audio-visual target speaker enhancement in multi-talker environments using event-driven cameras. State of the art audio-visual speech separation methods shows that crucial information is the movement of the facial landmarks related to speech production. However, all approaches proposed so far work offline, using frame-based video input, making it difficult to process an audio-visual signal with low latency, for online applications. In order to overcome this limitation, we propose the use of event-driven cameras and exploit compression, high temporal resolution and low latency, for low cost and low latency motion feature extraction, going towards online embedded audio-visual speech processing. We use the event-driven optical flow estimation of the facial landmarks as input to a stacked Bidirectional LSTM trained to predict an Ideal Amplitude Mask that is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
