A Preliminary Analysis of Automatic Word and Syllable Prominence Detection in Non-Native Speech With Text-to-Speech Prosody Embeddings
Anindita Mondal, Rangavajjala Sankara Bharadwaj, Jhansi Mallela, Anil, Kumar Vuppala, Chiranjeevi Yarra

TL;DR
This study evaluates the effectiveness of prosody embeddings from state-of-the-art TTS systems in detecting word and syllable prominence in native and non-native speech, revealing notable improvements over heuristic and self-supervised features.
Contribution
It introduces a comparative analysis of TTS-derived prosody embeddings for prominence detection in non-native speech, including a novel extraction method during TTS training mode.
Findings
TTS embeddings improve prominence detection accuracy by up to 16.2%.
Embeddings extracted during TTS training mode outperform inference mode.
Prosody embeddings from TTS can match natural speech prominence patterns.
Abstract
Automatic detection of prominence at the word and syllable-levels is critical for building computer-assisted language learning systems. It has been shown that prosody embeddings learned by the current state-of-the-art (SOTA) text-to-speech (TTS) systems could generate word- and syllable-level prominence in the synthesized speech as natural as in native speech. To understand the effectiveness of prosody embeddings from TTS for prominence detection under nonnative context, a comparative analysis is conducted on the embeddings extracted from native and non-native speech considering the prominence-related embeddings: duration, energy, and pitch from a SOTA TTS named FastSpeech2. These embeddings are extracted under two conditions considering: 1) only text, 2) both speech and text. For the first condition, the embeddings are extracted directly from the TTS inference mode, whereas for the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and dialogue systems
