Autosegmental Neural Nets: Should Phones and Tones be Synchronous or Asynchronous?
Jialu Li, Mark Hasegawa-Johnson

TL;DR
This study compares synchronous and asynchronous neural network models for multilingual speech recognition, revealing that while synchronous models excel in joint accuracy, asynchronous models better recognize tones specifically.
Contribution
It introduces and evaluates four CTC-based models with different synchronization constraints for multilingual and cross-lingual speech recognition.
Findings
Synchronous models have lower joint phone+tone error rates.
Asynchronous models achieve lower tone error rates.
Both models are effective across multilingual and cross-lingual tasks.
Abstract
Phones, the segmental units of the International Phonetic Alphabet (IPA), are used for lexical distinctions in most human languages; Tones, the suprasegmental units of the IPA, are used in perhaps 70%. Many previous studies have explored cross-lingual adaptation of automatic speech recognition (ASR) phone models, but few have explored the multilingual and cross-lingual transfer of synchronization between phones and tones. In this paper, we test four Connectionist Temporal Classification (CTC)-based acoustic models, differing in the degree of synchrony they impose between phones and tones. Models are trained and tested multilingually in three languages, then adapted and tested cross-lingually in a fourth. Both synchronous and asynchronous models are effective in both multilingual and cross-lingual settings. Synchronous models achieve lower error rate in the joint phone+tone tier, but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
