Bidirectional Multiscale Feature Aggregation for Speaker Verification
Jiajun Qi, Wu Guo, Bin Gu

TL;DR
This paper introduces a bidirectional multiscale feature aggregation network with attentional fusion modules for text-independent speaker verification, improving feature integration and verification accuracy.
Contribution
It presents a novel bidirectional aggregation framework with attentional fusion modules, enhancing feature combination for speaker verification tasks.
Findings
Improved verification accuracy on NIST SRE16 and VoxCeleb1 datasets.
Effective bidirectional aggregation strategy demonstrated.
Attentional fusion modules further boost performance.
Abstract
In this paper, we propose a novel bidirectional multiscale feature aggregation (BMFA) network with attentional fusion modules for text-independent speaker verification. The feature maps from different stages of the backbone network are iteratively combined and refined in both a bottom-up and top-down manner. Furthermore, instead of simple concatenation or element-wise addition of feature maps from different stages, an attentional fusion module is designed to compute the fusion weights. Experiments are conducted on the NIST SRE16 and VoxCeleb1 datasets. The experimental results demonstrate the effectiveness of the bidirectional aggregation strategy and show that the proposed attentional fusion module can further improve the performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
