Differentiable Allophone Graphs for Language-Universal Speech Recognition
Brian Yan, Siddharth Dalmia, David R. Mortensen, Florian Metze, Shinji, Watanabe

TL;DR
This paper introduces a framework for creating universal speech recognition models by deriving phone-level supervision from phonemic transcriptions using differentiable allophone graphs, enabling multilingual and interpretable phoneme-to-allophone mappings.
Contribution
The work presents a novel differentiable allophone graph approach that learns language-specific phoneme-to-allophone mappings from phonemic transcriptions, facilitating universal and interpretable speech recognition.
Findings
Trained on 7 diverse languages, the system effectively models pronunciation variations.
The approach enables linguists to document languages and build lexicons with rich pronunciation data.
The model provides interpretable probabilistic mappings for each language.
Abstract
Building language-universal speech recognition systems entails producing phonological units of spoken sound that can be shared across languages. While speech annotations at the language-specific phoneme or surface levels are readily available, annotations at a universal phone level are relatively rare and difficult to produce. In this work, we present a general framework to derive phone-level supervision from only phonemic transcriptions and phone-to-phoneme mappings with learnable weights represented using weighted finite-state transducers, which we call differentiable allophone graphs. By training multilingually, we build a universal phone-based speech recognition model with interpretable probabilistic phone-to-phoneme mappings for each language. These phone-based systems with learned allophone graphs can be used by linguists to document new languages, build phone-based lexicons that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
