Universal Automatic Phonetic Transcription into the International   Phonetic Alphabet

Chihiro Taguchi; Yusuke Sakai; Parisa Haghani; David Chiang

arXiv:2308.03917·cs.CL·August 9, 2023

Universal Automatic Phonetic Transcription into the International Phonetic Alphabet

Chihiro Taguchi, Yusuke Sakai, Parisa Haghani, David Chiang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a high-quality, language-agnostic speech-to-IPA transcription model based on wav2vec 2.0, capable of automating phonetic transcription with accuracy close to human annotators, aiding language documentation.

Contribution

The paper presents a universal speech-to-IPA model trained on high-quality data from multiple languages, achieving comparable or superior results to previous models despite using less data.

Findings

01

Model achieves near-human transcription quality.

02

Uses smaller, higher-quality training data.

03

Performs well across multiple languages.

Abstract

This paper presents a state-of-the-art model for transcribing speech in any language into the International Phonetic Alphabet (IPA). Transcription of spoken languages into IPA is an essential yet time-consuming process in language documentation, and even partially automating this process has the potential to drastically speed up the documentation of endangered languages. Like the previous best speech-to-IPA model (Wav2Vec2Phoneme), our model is based on wav2vec 2.0 and is fine-tuned to predict IPA from audio input. We use training data from seven languages from CommonVoice 11.0, transcribed into IPA semi-automatically. Although this training dataset is much smaller than Wav2Vec2Phoneme's, its higher quality lets our model achieve comparable or better results. Furthermore, we show that the quality of our universal speech-to-IPA models is close to that of human annotators.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ctaguchi/multipa
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings