Zero resource speech synthesis using transcripts derived from perceptual acoustic units
Karthik Pandia D S, Hema A Murthy

TL;DR
This paper presents a method for zero-resource speech synthesis by discovering and modeling perceptual acoustic units from untranscribed speech, enabling vocabulary-independent synthesis with low bit rate encoding.
Contribution
It introduces a novel approach to identify and model perceptual acoustic units from raw speech data for zero-resource synthesis, combining clustering and HMM-GMM modeling.
Findings
Achieves good synthesis quality with low bit rate encoding.
Uses clustering of CVC-like segments to discover acoustic units.
Demonstrates effectiveness on Zerospeech 2019 dataset.
Abstract
Zerospeech synthesis is the task of building vocabulary independent speech synthesis systems, where transcriptions are not available for training data. It is, therefore, necessary to convert training data into a sequence of fundamental acoustic units that can be used for synthesis during the test. This paper attempts to discover, and model perceptual acoustic units consisting of steady-state, and transient regions in speech. The transients roughly correspond to CV, VC units, while the steady-state corresponds to sonorants and fricatives. The speech signal is first preprocessed by segmenting the same into CVC-like units using a short-term energy-like contour. These CVC segments are clustered using a connected components-based graph clustering technique. The clustered CVC segments are initialized such that the onset (CV) and decays (VC) correspond to transients, and the rhyme corresponds…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
