A Preliminary Analysis of Automatic Word and Syllable Prominence   Detection in Non-Native Speech With Text-to-Speech Prosody Embeddings

Anindita Mondal; Rangavajjala Sankara Bharadwaj; Jhansi Mallela; Anil; Kumar Vuppala; Chiranjeevi Yarra

arXiv:2412.08283·cs.CL·December 12, 2024

A Preliminary Analysis of Automatic Word and Syllable Prominence Detection in Non-Native Speech With Text-to-Speech Prosody Embeddings

Anindita Mondal, Rangavajjala Sankara Bharadwaj, Jhansi Mallela, Anil, Kumar Vuppala, Chiranjeevi Yarra

PDF

Open Access

TL;DR

This study evaluates the effectiveness of prosody embeddings from state-of-the-art TTS systems in detecting word and syllable prominence in native and non-native speech, revealing notable improvements over heuristic and self-supervised features.

Contribution

It introduces a comparative analysis of TTS-derived prosody embeddings for prominence detection in non-native speech, including a novel extraction method during TTS training mode.

Findings

01

TTS embeddings improve prominence detection accuracy by up to 16.2%.

02

Embeddings extracted during TTS training mode outperform inference mode.

03

Prosody embeddings from TTS can match natural speech prominence patterns.

Abstract

Automatic detection of prominence at the word and syllable-levels is critical for building computer-assisted language learning systems. It has been shown that prosody embeddings learned by the current state-of-the-art (SOTA) text-to-speech (TTS) systems could generate word- and syllable-level prominence in the synthesized speech as natural as in native speech. To understand the effectiveness of prosody embeddings from TTS for prominence detection under nonnative context, a comparative analysis is conducted on the embeddings extracted from native and non-native speech considering the prominence-related embeddings: duration, energy, and pitch from a SOTA TTS named FastSpeech2. These embeddings are extracted under two conditions considering: 1) only text, 2) both speech and text. For the first condition, the embeddings are extracted directly from the TTS inference mode, whereas for the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and dialogue systems