Dual-Path Cross-Modal Attention for better Audio-Visual Speech Extraction
Zhongweiyang Xu, Xulin Fan, Mark Hasegawa-Johnson

TL;DR
This paper introduces a novel dual-path cross-modal attention mechanism that fuses audio and visual features at compatible time resolutions, improving audio-visual speech extraction performance by leveraging long-term lip movement information.
Contribution
The paper proposes a new audio-visual fusion method that avoids visual feature upsampling by integrating visual cues directly into inter-chunk attention, enhancing efficiency and accuracy.
Findings
Achieved superior speech extraction results compared to existing models.
Efficient fusion by aligning time resolutions of audio and visual features.
Improved long-term lip movement utilization in audio-visual models.
Abstract
Audio-visual target speech extraction, which aims to extract a certain speaker's speech from the noisy mixture by looking at lip movements, has made significant progress combining time-domain speech separation models and visual feature extractors (CNN). One problem of fusing audio and video information is that they have different time resolutions. Most current research upsamples the visual features along the time dimension so that audio and video features are able to align in time. However, we believe that lip movement should mostly contain long-term, or phone-level information. Based on this assumption, we propose a new way to fuse audio-visual features. We observe that for DPRNN \cite{dprnn}, the interchunk dimension's time resolution could be very close to the time resolution of video frames. Like \cite{sepformer}, the LSTM in DPRNN is replaced by intra-chunk and inter-chunk…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Music and Audio Processing
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · ALIGN
