Speech Synthesis with Neural Networks
Orhan Karaali, Gerald Corrigan, Ira Gerson

TL;DR
This paper presents a neural network-based system for text-to-speech synthesis that improves memory efficiency and performance over traditional concatenation and rule-based methods.
Contribution
It introduces a neural network approach using TDNNs for phonetic-to-acoustic mapping and timing control, advancing speech synthesis technology.
Findings
Requires less memory than concatenation systems
Performs well in comparative tests with commercial systems
Demonstrates effective neural network application in speech synthesis
Abstract
Text-to-speech conversion has traditionally been performed either by concatenating short samples of speech or by using rule-based systems to convert a phonetic representation of speech into an acoustic representation, which is then converted into speech. This paper describes a system that uses a time-delay neural network (TDNN) to perform this phonetic-to-acoustic mapping, with another neural network to control the timing of the generated speech. The neural network system requires less memory than a concatenation system, and performed well in tests comparing it to commercial systems using other technologies.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and Audio Processing
