Application of ASV for Voice Identification after VC and Duration   Predictor Improvement in TTS Models

Borodin Kirill Nikolayevich; Kudryavtsev Vasiliy Dmitrievich and; Mkrtchian Grach Maratovich; Gorodnichev Mikhail Genadievich; Korzh; Dmitrii Sergeevich

arXiv:2406.19243·cs.SD·June 28, 2024

Application of ASV for Voice Identification after VC and Duration Predictor Improvement in TTS Models

Borodin Kirill Nikolayevich, Kudryavtsev Vasiliy Dmitrievich and, Mkrtchian Grach Maratovich, Gorodnichev Mikhail Genadievich, Korzh, Dmitrii Sergeevich

PDF

Open Access

TL;DR

This paper explores an automatic speaker verification system that extracts voice features for use in a multi-voice TTS pipeline, demonstrating effectiveness in verifying converted voices with an EER of 20.669 in SSTC challenge.

Contribution

It introduces a novel speaker verification approach focusing on embedding extraction of pitch, energy, and phoneme duration for improved voice authentication in TTS systems.

Findings

01

Achieved an EER of 20.669 in SSTC challenge for voice conversion verification.

02

Demonstrated the potential of embedding-based features in voice biometric security.

03

Enhanced verification accuracy for manipulated voice data.

Abstract

One of the most crucial components in the field of biometric security is the automatic speaker verification system, which is based on the speaker's voice. It is possible to utilise ASVs in isolation or in conjunction with other AI models. In the contemporary era, the quality and quantity of neural networks are increasing exponentially. Concurrently, there is a growing number of systems that aim to manipulate data through the use of voice conversion and text-to-speech models. The field of voice biometrics forgery is aided by a number of challenges, including SSTC, ASVSpoof, and SingFake. This paper presents a system for automatic speaker verification. The primary objective of our model is the extraction of embeddings from the target speaker's audio in order to obtain information about important characteristics of his voice, such as pitch, energy, and the duration of phonemes. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis