Multi-View Based Audio Visual Target Speaker Extraction

Peijun Yang; Zhan Jin; Juan Liu; Ming Li

arXiv:2603.07696·eess.AS·March 12, 2026

Multi-View Based Audio Visual Target Speaker Extraction

Peijun Yang, Zhan Jin, Juan Liu, Ming Li

PDF

Open Access

TL;DR

This paper introduces a Multi-View Tensor Fusion framework for audio-visual target speaker extraction that leverages multi-view lip videos during training to improve speech separation robustness and performance in both single-view and multi-view scenarios.

Contribution

The novel MVTF framework transforms multi-view learning into single-view gains, explicitly modeling cross-view correlations for improved speaker extraction.

Findings

01

Significant performance improvements in single-view inputs.

02

Enhanced robustness and accuracy in multi-view mode.

03

Effective cross-view correlation modeling through pairwise outer products.

Abstract

Audio-Visual Target Speaker Extraction (AVTSE) aims to separate a target speaker's voice from a mixed audio signal using the corresponding visual cues. While most existing AVTSE methods rely exclusively on frontal-view videos, this limitation restricts their robustness in real-world scenarios where non-frontal views are prevalent. Such visual perspectives often contain complementary articulatory information that could enhance speech extraction. In this work, we propose Multi-View Tensor Fusion (MVTF), a novel framework that transforms multi-view learning into single-view performance gains. During the training stage, we leverage synchronized multi-perspective lip videos to learn cross-view correlations through MVTF, where pairwise outer products explicitly model multiplicative interactions between different views of input lip embeddings. At the inference stage, the system supports both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Speech Recognition and Synthesis