Attention-based multi-channel speaker verification with ad-hoc microphone arrays
Chengdong Liang, Junqi Chen, Shanzheng Guan, Xiao-Lei Zhang

TL;DR
This paper introduces an attention-based multi-channel speaker verification system designed for ad-hoc microphone arrays with unknown configurations, employing residual self-attention and sparsemax to improve robustness and accuracy.
Contribution
It proposes a novel neural network architecture with inter-channel processing and global fusion layers, incorporating sparsemax for better noise handling in ad-hoc microphone arrays.
Findings
Achieves over 20% EER reduction on semi-real data
Achieves over 30% EER reduction on simulated data
Effective in scenarios with varying and mismatched channel numbers
Abstract
Recently, ad-hoc microphone array has been widely studied. Unlike traditional microphone array settings, the spatial arrangement and number of microphones of ad-hoc microphone arrays are not known in advance, which hinders the adaptation of traditional speaker verification technologies to ad-hoc microphone arrays. To overcome this weakness, in this paper, we propose attention-based multi-channel speaker verification with ad-hoc microphone arrays. Specifically, we add an inter-channel processing layer and a global fusion layer after the pooling layer of a single-channel speaker verification system. The inter-channel processing layer applies a so-called residual self-attention along the channel dimension for allocating weights to different microphones. The global fusion layer integrates all channels in a way that is independent to the number of the input channels. We further replace the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
