Prosodic-Enhanced Siamese Convolutional Neural Networks for Cross-Device   Text-Independent Speaker Verification

Sobhan Soleymani; Ali Dabouei; Seyed Mehdi Iranmanesh; Hadi Kazemi,; Jeremy Dawson; and Nasser M. Nasrabadi

arXiv:1808.01026·eess.AS·August 6, 2018·5 cites

Prosodic-Enhanced Siamese Convolutional Neural Networks for Cross-Device Text-Independent Speaker Verification

Sobhan Soleymani, Ali Dabouei, Seyed Mehdi Iranmanesh, Hadi Kazemi,, Jeremy Dawson, and Nasser M. Nasrabadi

PDF

Open Access

TL;DR

This paper introduces a novel cross-device speaker verification system that combines spectrogram features with prosodic and voice quality features using a Siamese CNN and multilayer perceptron, improving accuracy over existing methods.

Contribution

It proposes a new end-to-end architecture integrating spectro-temporal and prosodic features for enhanced cross-device speaker verification.

Findings

01

Significant improvement over classical approaches

02

Effective use of prosodic and voice quality features

03

Robust performance in forensic scenarios

Abstract

In this paper a novel cross-device text-independent speaker verification architecture is proposed. Majority of the state-of-the-art deep architectures that are used for speaker verification tasks consider Mel-frequency cepstral coefficients. In contrast, our proposed Siamese convolutional neural network architecture uses Mel-frequency spectrogram coefficients to benefit from the dependency of the adjacent spectro-temporal features. Moreover, although spectro-temporal features have proved to be highly reliable in speaker verification models, they only represent some aspects of short-term acoustic level traits of the speaker's voice. However, the human voice consists of several linguistic levels such as acoustic, lexicon, prosody, and phonetics, that can be utilized in speaker verification models. To compensate for these inherited shortcomings in spectro-temporal features, we propose to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing