Joint Speaker Features Learning for Audio-visual Multichannel Speech   Separation and Recognition

Guinan Li; Jiajun Deng; Youjun Chen; Mengzhe Geng; Shujie Hu; Zhe Li,; Zengrui Jin; Tianzi Wang; Xurong Xie; Helen Meng; Xunying Liu

arXiv:2406.10152·cs.SD·June 17, 2024

Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition

Guinan Li, Jiajun Deng, Youjun Chen, Mengzhe Geng, Shujie Hu, Zhe Li,, Zengrui Jin, Tianzi Wang, Xurong Xie, Helen Meng, Xunying Liu

PDF

Open Access

TL;DR

This paper introduces joint speaker feature learning methods that enhance audio-visual multichannel speech separation and recognition, achieving significant WER reductions through integrated speaker encoders and fusion techniques.

Contribution

It presents a novel joint speaker feature learning approach that improves speech separation and recognition performance in zero-shot settings by integrating xVector and ECAPA-TDNN encoders.

Findings

01

Consistent performance improvements over baselines.

02

Significant WER reductions of over 20%.

03

Enhanced inter-speaker discrimination correlates with better results.

Abstract

This paper proposes joint speaker feature learning methods for zero-shot adaptation of audio-visual multichannel speech separation and recognition systems. xVector and ECAPA-TDNN speaker encoders are connected using purpose-built fusion blocks and tightly integrated with the complete system training. Experiments conducted on LRS3-TED data simulated multichannel overlapped speech suggest that joint speaker feature learning consistently improves speech separation and recognition performance over the baselines without joint speaker feature estimation. Further analyses reveal performance improvements are strongly correlated with increased inter-speaker discrimination measured using cosine similarity. The best-performing joint speaker feature learning adapted system outperformed the baseline fine-tuned WavLM model by statistically significant WER reductions of 21.6% and 25.3% absolute (67.5%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Data Compression Techniques