Phoneme-aware and Channel-wise Attentive Learning for Text   DependentSpeaker Verification

Yan Liu; Zheng Li; Lin Li; Qingyang Hong

arXiv:2106.13514·cs.SD·June 28, 2021·1 cites

Phoneme-aware and Channel-wise Attentive Learning for Text DependentSpeaker Verification

Yan Liu, Zheng Li, Lin Li, Qingyang Hong

PDF

Open Access

TL;DR

This paper introduces a multi-task learning framework with phoneme-aware attention and channel-wise recalibration to enhance text-dependent speaker verification, demonstrating superior performance on the RSR2015 dataset.

Contribution

It presents a novel combination of phoneme-aware attentive pooling and SE-blocks within a multi-task learning network for improved speaker verification accuracy.

Findings

01

Achieved state-of-the-art results on RSR2015 Part 1 database.

02

Demonstrated the effectiveness of phoneme-aware and channel-wise attention strategies.

03

Improved speaker embedding discriminability for text-dependent SV.

Abstract

This paper proposes a multi-task learning network with phoneme-aware and channel-wise attentive learning strategies for text-dependent Speaker Verification (SV). In the proposed structure, the frame-level multi-task learning along with the segment-level adversarial learning is adopted for speaker embedding extraction. The phoneme-aware attentive pooling is exploited on frame-level features in the main network for speaker classifier, with the corresponding posterior probability for the phoneme distribution in the auxiliary subnet. Further, the introduction of Squeeze and Excitation (SE-block) performs dynamic channel-wise feature recalibration, which improves the representational ability. The proposed method exploits speaker idiosyncrasies associated with pass-phrases, and is further improved by the phoneme-aware attentive pooling and SE-block from temporal and channel-wise aspects,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing