Decomposed Temporal Dynamic CNN: Efficient Time-Adaptive Network for   Text-Independent Speaker Verification Explained with Speaker Activation Map

Seong-Hu Kim; Hyeonuk Nam; Yong-Hwa Park

arXiv:2203.15277·eess.AS·October 28, 2022·6 cites

Decomposed Temporal Dynamic CNN: Efficient Time-Adaptive Network for Text-Independent Speaker Verification Explained with Speaker Activation Map

Seong-Hu Kim, Hyeonuk Nam, Yong-Hwa Park

PDF

Open Access 1 Repo

TL;DR

This paper introduces a decomposed temporal dynamic CNN that significantly reduces model size and improves accuracy in text-independent speaker verification by combining static and dynamic kernels, with detailed analysis of speaker information extraction.

Contribution

Proposes DTDY-CNNs using matrix decomposition to create efficient, adaptive kernels, outperforming previous models in speaker verification accuracy and size reduction.

Findings

01

Achieved 0.96% EER with DTDY-ResNet-34(x0.50)

02

Reduced model size by 64% compared to TDY-CNNs

03

Effectively extracts speaker info from formant and high-frequency unvoiced phonemes

Abstract

To extract accurate speaker information for text-independent speaker verification, temporal dynamic CNNs (TDY-CNNs) adapting kernels to each time bin was proposed. However, model size of TDY-CNN is too large and the adaptive kernel's degree of freedom is limited. To address these limitations, we propose decomposed temporal dynamic CNNs (DTDY-CNNs) which forms time-adaptive kernel by combining static kernel with dynamic residual based on matrix decomposition. Proposed DTDY-ResNet-34(x0.50) using attentive statistical pooling without data augmentation shows EER of 0.96%, which is better than other state-of-the-art methods. DTDY-CNNs are successful upgrade of TDY-CNNs, reducing the model size by 64% and enhancing the performance. We showed that DTDY-CNNs extract more accurate frame-level speaker embeddings as well compared to TDY-CNNs. Detailed behaviors of DTDY-ResNet-34(x0.50) on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shkim816/decomposed_temporal_dynamic_cnn
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing