Multi-View Based Audio Visual Target Speaker Extraction
Peijun Yang, Zhan Jin, Juan Liu, Ming Li

TL;DR
This paper introduces a Multi-View Tensor Fusion framework for audio-visual target speaker extraction that leverages multi-view lip videos during training to improve speech separation robustness and performance in both single-view and multi-view scenarios.
Contribution
The novel MVTF framework transforms multi-view learning into single-view gains, explicitly modeling cross-view correlations for improved speaker extraction.
Findings
Significant performance improvements in single-view inputs.
Enhanced robustness and accuracy in multi-view mode.
Effective cross-view correlation modeling through pairwise outer products.
Abstract
Audio-Visual Target Speaker Extraction (AVTSE) aims to separate a target speaker's voice from a mixed audio signal using the corresponding visual cues. While most existing AVTSE methods rely exclusively on frontal-view videos, this limitation restricts their robustness in real-world scenarios where non-frontal views are prevalent. Such visual perspectives often contain complementary articulatory information that could enhance speech extraction. In this work, we propose Multi-View Tensor Fusion (MVTF), a novel framework that transforms multi-view learning into single-view performance gains. During the training stage, we leverage synchronized multi-perspective lip videos to learn cross-view correlations through MVTF, where pairwise outer products explicitly model multiplicative interactions between different views of input lip embeddings. At the inference stage, the system supports both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Speech Recognition and Synthesis
