Multimodal Attention Fusion for Target Speaker Extraction
Hiroshi Sato, Tsubasa Ochiai, Keisuke Kinoshita, Marc Delcroix,, Tomohiro Nakatani, Shoko Araki

TL;DR
This paper introduces a novel attention-based fusion mechanism for audio-visual target speaker extraction, enhancing robustness to clue corruption and demonstrating improved performance on simulated and real data.
Contribution
It proposes a new attention mechanism for multimodal fusion that effectively assesses clue reliability, advancing audio-visual speaker extraction in realistic scenarios.
Findings
Improved SDR by 1.0 dB on simulated data.
Successful application on real recorded mixtures.
Enhanced robustness to visual clue corruption.
Abstract
Target speaker extraction, which aims at extracting a target speaker's voice from a mixture of voices using audio, visual or locational clues, has received much interest. Recently an audio-visual target speaker extraction has been proposed that extracts target speech by using complementary audio and visual clues. Although audio-visual target speaker extraction offers a more stable performance than single modality methods for simulated data, its adaptation towards realistic situations has not been fully explored as well as evaluations on real recorded mixtures. One of the major issues to handle realistic situations is how to make the system robust to clue corruption because in real recordings both clues may not be equally reliable, e.g. visual clues may be affected by occlusions. In this work, we propose a novel attention mechanism for multi-modal fusion and its training methods that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
