Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation
Saierdaer Yusuyin, Te Ma, Hao Huang, Zhijian Ou

TL;DR
This paper introduces a novel phoneme-based crosslingual speech recognition method that eliminates the need for pronunciation lexicons by jointly training speech-to-phoneme, phoneme-to-grapheme, and G2P models using joint stochastic approximation, achieving significant error reductions.
Contribution
The study proposes a pronunciation-lexicon free training approach for crosslingual ASR using a joint stochastic approximation algorithm to train multiple models simultaneously.
Findings
Achieves 5% error rate reduction with minimal phoneme supervision.
Outperforms traditional language model fusion in domain adaptation.
Open-sourced code for reproducibility and further research.
Abstract
Recently, pre-trained models with phonetic supervision have demonstrated their advantages for crosslingual speech recognition in data efficiency and information sharing across languages. However, a limitation is that a pronunciation lexicon is needed for such phoneme-based crosslingual speech recognition. In this study, we aim to eliminate the need for pronunciation lexicons and propose a latent variable model based method, with phonemes being treated as discrete latent variables. The new method consists of a speech-to-phoneme (S2P) model and a phoneme-to-grapheme (P2G) model, and a grapheme-to-phoneme (G2P) model is introduced as an auxiliary inference model. To jointly train the three models, we utilize the joint stochastic approximation (JSA) algorithm, which is a stochastic extension of the EM (expectation-maximization) algorithm and has demonstrated superior performance…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The proposed SPG-JSA method efficiently achieves cross-lingual transfer.
- The paper addresses the multilingual speech recognition problem but does not compare its approach to advanced models like Whisper and Seamless, either in terms of speed or performance. - The contribution is limited, as the backbone of the work is the S2P model (Whistle), with JSA simply combining existing components. - While the paper focuses on multilingual capabilities, the experiments are conducted only on Polish and Indonesian.
The paper is mostly well-written and easy to understand. The technique is described well with the necessary details for reproduction. The theory behind the paper isn't new, though the authors carefully design a set of algorithms for the particular ASR problem, making the novelty OK. The results seem quite impressive considering the amount of the phoneme-labeled data was drastically reduced with the proposed method.
The evaluation was done on two languages, and only with CTC model, which isn't enough to convince me fully of the usefulness of the method. And, given the set of experiments reported, I'm not fully convinced that JSA + beam-search + augmentation + multiple-hypotheses reranking is really necessary. I understand that initially, due to the small amount of phoneme-labeled data, training a separate S2P model might be necessary to bootstrap something useful. And it seems those search techniques and
1. The SPG-JSA algorithm to train S2P, P2G and G2P models is novel. 2. Experiments are solid. 3. The paper is well-written. Experimental setups are detailed and clear.
1. I think the main weakness is that the algorithm is rather complicated. First, it requires some pre-trained models to initialize the S2P, P2G and G2P models separately. As the authors stated “we first fine-tuned the Whistle model on 10 minutes of phoneme labels to initialize the S2P model. Subsequently, this S2P model was utilized to generate phoneme pseudo-labels on the training set, which were then used to train the P2G and G2P models for initialization.” Second, training the entire model re
I am not able to fully access the paper due to the copy-paste issue mentioned in the weakness section below
Many paragraphs from Background is copied paste from the following paper (which authors cited as well though) https://arxiv.org/pdf/2005.14001 For example, Expectation-Maximization (EM) algorithm subsection in this paper is identical to the first parargraph on the top-right 4th page in the previously mentioned papers. The other paragraphs in the Background section also copied paste non-trivial numbers of sentences from this papers as well. Authors should rewrite them in their own words if the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
