Audio-Visual Target Speaker Enhancement on Multi-Talker Environment   using Event-Driven Cameras

Ander Arriandiaga; Giovanni Morrone; Luca Pasa; Leonardo Badino,; Chiara Bartolozzi

arXiv:1912.02671·eess.AS·February 23, 2021

Audio-Visual Target Speaker Enhancement on Multi-Talker Environment using Event-Driven Cameras

Ander Arriandiaga, Giovanni Morrone, Luca Pasa, Leonardo Badino,, Chiara Bartolozzi

PDF

TL;DR

This paper introduces an online, low-latency audio-visual speech enhancement method using event-driven cameras to extract facial motion features, enabling real-time multi-talker separation.

Contribution

It pioneers the use of event-driven cameras for low-latency, online audio-visual speech separation in multi-talker environments.

Findings

01

Approaches perform nearly as well as offline methods

02

Achieves low latency and computational efficiency

03

Enables real-time embedded audio-visual processing

Abstract

We propose a method to address audio-visual target speaker enhancement in multi-talker environments using event-driven cameras. State of the art audio-visual speech separation methods shows that crucial information is the movement of the facial landmarks related to speech production. However, all approaches proposed so far work offline, using frame-based video input, making it difficult to process an audio-visual signal with low latency, for online applications. In order to overcome this limitation, we propose the use of event-driven cameras and exploit compression, high temporal resolution and low latency, for low cost and low latency motion feature extraction, going towards online embedded audio-visual speech processing. We use the event-driven optical flow estimation of the facial landmarks as input to a stacked Bidirectional LSTM trained to predict an Ideal Amplitude Mask that is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory