Dual-Path Cross-Modal Attention for better Audio-Visual Speech   Extraction

Zhongweiyang Xu; Xulin Fan; Mark Hasegawa-Johnson

arXiv:2207.04213·cs.MM·March 7, 2023

Dual-Path Cross-Modal Attention for better Audio-Visual Speech Extraction

Zhongweiyang Xu, Xulin Fan, Mark Hasegawa-Johnson

PDF

Open Access

TL;DR

This paper introduces a novel dual-path cross-modal attention mechanism that fuses audio and visual features at compatible time resolutions, improving audio-visual speech extraction performance by leveraging long-term lip movement information.

Contribution

The paper proposes a new audio-visual fusion method that avoids visual feature upsampling by integrating visual cues directly into inter-chunk attention, enhancing efficiency and accuracy.

Findings

01

Achieved superior speech extraction results compared to existing models.

02

Efficient fusion by aligning time resolutions of audio and visual features.

03

Improved long-term lip movement utilization in audio-visual models.

Abstract

Audio-visual target speech extraction, which aims to extract a certain speaker's speech from the noisy mixture by looking at lip movements, has made significant progress combining time-domain speech separation models and visual feature extractors (CNN). One problem of fusing audio and video information is that they have different time resolutions. Most current research upsamples the visual features along the time dimension so that audio and video features are able to align in time. However, we believe that lip movement should mostly contain long-term, or phone-level information. Based on this assumption, we propose a new way to fuse audio-visual features. We observe that for DPRNN \cite{dprnn}, the interchunk dimension's time resolution could be very close to the time resolution of video frames. Like \cite{sepformer}, the LSTM in DPRNN is replaced by intra-chunk and inter-chunk…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Music and Audio Processing

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · ALIGN