Acquiring Pronunciation Knowledge from Transcribed Speech Audio via   Multi-task Learning

Siqi Sun; Korin Richmond

arXiv:2409.09891·cs.CL·September 17, 2024

Acquiring Pronunciation Knowledge from Transcribed Speech Audio via Multi-task Learning

Siqi Sun, Korin Richmond

PDF

Open Access

TL;DR

This paper introduces a multi-task learning approach to improve pronunciation knowledge acquisition from transcribed speech audio, simplifying the training process while maintaining high accuracy in pronunciation prediction.

Contribution

It presents a novel MTL-based method that leverages transcribed speech audio for pronunciation learning, reducing complexity compared to previous auxiliary ASR-based approaches.

Findings

01

PER reduced from 2.5% to 1.6% for specific word types

02

Achieves similar performance to previous methods

03

Simplifies implementation flow

Abstract

Recent work has shown the feasibility and benefit of bootstrapping an integrated sequence-to-sequence (Seq2Seq) linguistic frontend from a traditional pipeline-based frontend for text-to-speech (TTS). To overcome the fixed lexical coverage of bootstrapping training data, previous work has proposed to leverage easily accessible transcribed speech audio as an additional training source for acquiring novel pronunciation knowledge for uncovered words, which relies on an auxiliary ASR model as part of a cumbersome implementation flow. In this work, we propose an alternative method to leverage transcribed speech audio as an additional training source, based on multi-task learning (MTL). Experiments show that, compared to a baseline Seq2Seq frontend, the proposed MTL-based method reduces PER from 2.5% to 1.6% for those word types covered exclusively in transcribed speech audio, achieving a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Sequence to Sequence