Look\&Listen: Multi-Modal Correlation Learning for Active Speaker   Detection and Speech Enhancement

Junwen Xiong; Yu Zhou; Peng Zhang; Lei Xie; Wei Huang; Yufei Zha

arXiv:2203.02216·cs.SD·July 8, 2022

Look\&Listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

Junwen Xiong, Yu Zhou, Peng Zhang, Lei Xie, Wei Huang, Yufei Zha

PDF

Open Access 1 Repo

TL;DR

This paper proposes a unified multi-modal framework that jointly learns audio-visual correlations to improve active speaker detection and speech enhancement, addressing the limitations of task-specific models.

Contribution

It introduces a novel multi-modal correlation learning framework that enhances generalization in audio-visual tasks by jointly modeling auditory and visual streams.

Findings

01

Improved accuracy in active speaker detection

02

Enhanced speech quality in noisy environments

03

Better cross-modal feature representations

Abstract

Active speaker detection and speech enhancement have become two increasingly attractive topics in audio-visual scenario understanding. According to their respective characteristics, the scheme of independently designed architecture has been widely used in correspondence to each single task. This may lead to the representation learned by the model being task-specific, and inevitably result in the lack of generalization ability of the feature based on multi-modal modeling. More recent studies have shown that establishing cross-modal relationship between auditory and visual stream is a promising solution for the challenge of audio-visual multi-task learning. Therefore, as a motivation to bridge the multi-modal associations in audio-visual tasks, a unified framework is proposed to achieve target speaker detection and speech enhancement with joint learning of audio-visual modeling in this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

overcautious/adenet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Music and Audio Processing