AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker   Extraction

Jiuxin Lin; Xinyu Cai; Heinrich Dinkel; Jun Chen; Zhiyong Yan,; Yongqing Wang; Junbo Zhang; Zhiyong Wu; Yujun Wang; Helen Meng

arXiv:2306.14170·cs.MM·June 27, 2023

AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction

Jiuxin Lin, Xinyu Cai, Heinrich Dinkel, Jun Chen, Zhiyong Yan,, Yongqing Wang, Junbo Zhang, Zhiyong Wu, Yujun Wang, Helen Meng

PDF

Open Access 1 Repo

TL;DR

AV-SepFormer is a novel multi-modal attention model that effectively fuses audio and visual cues for target speaker extraction, improving performance by aligning features and unifying the extraction process.

Contribution

The paper introduces AV-SepFormer, a cross-attention based model with a new 2D positional encoding for synchronized audio-visual feature fusion in speaker extraction.

Findings

01

Significantly outperforms existing methods in target speaker extraction.

02

Uses a novel 2D positional encoding to improve feature fusion.

03

Aligns audio and visual features to reduce sampling rate mismatch issues.

Abstract

Visual information can serve as an effective cue for target speaker extraction (TSE) and is vital to improving extraction performance. In this paper, we propose AV-SepFormer, a SepFormer-based attention dual-scale model that utilizes cross- and self-attention to fuse and model features from audio and visual. AV-SepFormer splits the audio feature into a number of chunks, equivalent to the length of the visual feature. Then self- and cross-attention are employed to model and fuse the multi-modal features. Furthermore, we use a novel 2D positional encoding, that introduces the positional information between and within chunks and provides significant gains over the traditional positional encoding. Our model has two key advantages: the time granularity of audio chunked feature is synchronized to the visual feature, which alleviates the harm caused by the inconsistency of audio and video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lin9x/av-sepformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis