Techniques and Challenges in Speech Synthesis
David Ferris

TL;DR
This paper presents a comprehensive approach to English speech synthesis using diphone technology, including methods for database creation, pronunciation prediction, and voice modulation, with evaluations on naturalness and intelligibility.
Contribution
It introduces a novel diphone-based speech synthesis system with automatic diphone extraction, a combined pitch and duration modification method, and a text processing pipeline for improved naturalness.
Findings
Diphone database creation in under 40 minutes
Enhanced voice naturalness through pitch and duration modulation
System tested for intelligibility and naturalness
Abstract
The aim of this project was to develop and implement an English language Text-to-Speech synthesis system. This involved a study of mechanisms of human speech production, a review of techniques in speech synthesis, and analysis of tests used to evaluate the effectiveness of synthesized speech. It was determined that a diphone synthesis system was the most effective choice for the scope of this project. A method of automatically identifying and extracting diphones from prompted speech was designed, allowing for the creation of a diphone database by a speaker in less than 40 minutes. CMUdict was used to determine the pronunciation of known words. A system for smoothing the transitions between diphone recordings was designed and implemented. CMUdict was then used to train a maximum-likelihood prediction system to determine the correct pronunciation of unknown English language alphabetic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Speech and dialogue systems
