Phoneme-aware and Channel-wise Attentive Learning for Text DependentSpeaker Verification
Yan Liu, Zheng Li, Lin Li, Qingyang Hong

TL;DR
This paper introduces a multi-task learning framework with phoneme-aware attention and channel-wise recalibration to enhance text-dependent speaker verification, demonstrating superior performance on the RSR2015 dataset.
Contribution
It presents a novel combination of phoneme-aware attentive pooling and SE-blocks within a multi-task learning network for improved speaker verification accuracy.
Findings
Achieved state-of-the-art results on RSR2015 Part 1 database.
Demonstrated the effectiveness of phoneme-aware and channel-wise attention strategies.
Improved speaker embedding discriminability for text-dependent SV.
Abstract
This paper proposes a multi-task learning network with phoneme-aware and channel-wise attentive learning strategies for text-dependent Speaker Verification (SV). In the proposed structure, the frame-level multi-task learning along with the segment-level adversarial learning is adopted for speaker embedding extraction. The phoneme-aware attentive pooling is exploited on frame-level features in the main network for speaker classifier, with the corresponding posterior probability for the phoneme distribution in the auxiliary subnet. Further, the introduction of Squeeze and Excitation (SE-block) performs dynamic channel-wise feature recalibration, which improves the representational ability. The proposed method exploits speaker idiosyncrasies associated with pass-phrases, and is further improved by the phoneme-aware attentive pooling and SE-block from temporal and channel-wise aspects,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
