TL;DR
DeepTalk introduces a novel prosody encoding network that captures vocal style features from raw audio, enhancing speaker recognition accuracy and improving speech synthesis quality by modeling F0 contours.
Contribution
The paper presents DeepTalk, a new method for extracting vocal style features directly from raw audio, outperforming existing systems and integrating into speech synthesis for more natural synthetic speech.
Findings
DeepTalk outperforms state-of-the-art speaker recognition systems.
Combining DeepTalk with physiological features further improves recognition accuracy.
DeepTalk captures F0 contours crucial for vocal style modeling.
Abstract
Automatic speaker recognition algorithms typically characterize speech audio using short-term spectral features that encode the physiological and anatomical aspects of speech production. Such algorithms do not fully capitalize on speaker-dependent characteristics present in behavioral speech features. In this work, we propose a prosody encoding network called DeepTalk for extracting vocal style features directly from raw audio data. The DeepTalk method outperforms several state-of-the-art speaker recognition systems across multiple challenging datasets. The speaker recognition performance is further improved by combining DeepTalk with a state-of-the-art physiological speech feature-based speaker recognition system. We also integrate DeepTalk into a current state-of-the-art speech synthesizer to generate synthetic speech. A detailed analysis of the synthetic speech shows that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
