Fine-Grained and Interpretable Neural Speech Editing
Max Morrison, Cameron Churchwell, Nathan Pruyne, Bryan Pardo

TL;DR
This paper introduces a novel, disentangled, and interpretable speech representation that allows precise editing of various speech attributes with high accuracy, improving upon existing methods for speech synthesis and editing.
Contribution
It presents the first disentangled and interpretable speech representation with competitive reconstruction accuracy, enabling fine-grained editing of multiple speech attributes.
Findings
Achieved comparable subjective and objective vocoding reconstruction accuracy to Mel spectrograms.
Enabled fast, accurate, and high-quality editing of multiple speech attributes.
Demonstrated effectiveness of data augmentation in training neural vocoders for speech editing.
Abstract
Fine-grained editing of speech attributessuch as prosody (i.e., the pitch, loudness, and phoneme durations), pronunciation, speaker identity, and formantsis useful for fine-tuning and fixing imperfections in human and AI-generated speech recordings for creation of podcasts, film dialogue, and video game dialogue. Existing speech synthesis systems use representations that entangle two or more of these attributes, prohibiting their use in fine-grained, disentangled editing. In this paper, we demonstrate the first disentangled and interpretable representation of speech with comparable subjective and objective vocoding reconstruction accuracy to Mel spectrograms. Our interpretable representation, combined with our proposed data augmentation method, enables training an existing neural vocoder to perform fast, accurate, and high-quality editing of pitch,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
