Audio-Visual Speech Separation Using Cross-Modal Correspondence Loss

Naoki Makishima; Mana Ihori; Akihiko Takashima; Tomohiro Tanaka; Shota; Orihashi; Ryo Masumura

arXiv:2103.01463·cs.SD·March 3, 2021

Audio-Visual Speech Separation Using Cross-Modal Correspondence Loss

Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota, Orihashi, Ryo Masumura

PDF

Open Access

TL;DR

This paper introduces a novel cross-modal correspondence loss for audio-visual speech separation, leveraging visual signals to better preserve speech characteristics and improve separation quality over traditional audio-only methods.

Contribution

The paper proposes the CMC loss that utilizes visual signals to enhance speech separation by capturing speech-visual cooccurrence, addressing limitations of conventional audio-only losses.

Findings

01

CMC loss improves separation performance

02

Visual signals help preserve speech characteristics

03

Method reduces noise and distortion

Abstract

We present an audio-visual speech separation learning method that considers the correspondence between the separated signals and the visual signals to reflect the speech characteristics during training. Audio-visual speech separation is a technique to estimate the individual speech signals from a mixture using the visual signals of the speakers. Conventional studies on audio-visual speech separation mainly train the separation model on the audio-only loss, which reflects the distance between the source signals and the separated signals. However, conventional losses do not reflect the characteristics of the speech signals, including the speaker's characteristics and phonetic information, which leads to distortion or remaining noise. To address this problem, we propose the cross-modal correspondence (CMC) loss, which is based on the cooccurrence of the speech signal and the visual signal.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Hearing Loss and Rehabilitation