
TL;DR
This paper investigates the challenge of identifying authentic speakers from voice-converted outputs using a deep learning-based recognition system, demonstrating promising results despite the acoustic alterations introduced by voice conversion.
Contribution
It introduces a hierarchical VLAD-based deep neural network model for robust speaker recognition from converted voices, addressing a key challenge in speaker verification.
Findings
High recognition accuracy on converted voices
Robustness against voice quality variations
Effective use of hierarchical VLAD in DNNs
Abstract
Voice conversion (VC) using deep learning technologies can now generate high quality one-to-many voices and thus has been used in some practical application fields, such as entertainment and healthcare. However, voice conversion can pose potential social issues when manipulated voices are employed for deceptive purposes. Moreover, it is a big challenge to find who are real speakers from the converted voices as the acoustic characteristics of source speakers are changed greatly. In this paper we attempt to explore the feasibility of identifying authentic speakers from converted voices. This study is conducted with the assumption that certain information from the source speakers persists, even when their voices undergo conversion into different target voices. Therefore our experiments are geared towards recognising the source speakers given the converted voices, which are generated by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · AI in Service Interactions
