Alternate Intermediate Conditioning with Syllable-level and Character-level Targets for Japanese ASR
Yusuke Fujita, Tatsuya Komatsu, Yusuke Kida

TL;DR
This paper proposes a novel Japanese ASR method that uses intermediate syllable and character predictions to improve recognition accuracy, addressing pronunciation ambiguities inherent in Japanese kanji characters.
Contribution
It introduces an explicit interaction mechanism between characters and syllables using Self-conditioned CTC with intermediate predictions as conditioning features.
Findings
Outperformed conventional multi-task methods
Improved recognition accuracy on Japanese speech data
Effective handling of pronunciation ambiguities
Abstract
End-to-end automatic speech recognition directly maps input speech to characters. However, the mapping can be problematic when several different pronunciations should be mapped into one character or when one pronunciation is shared among many different characters. Japanese ASR suffers the most from such many-to-one and one-to-many mapping problems due to Japanese kanji characters. To alleviate the problems, we introduce explicit interaction between characters and syllables using Self-conditioned connectionist temporal classification (CTC), in which the upper layers are ``self-conditioned'' on the intermediate predictions from the lower layers. The proposed method utilizes character-level and syllable-level intermediate predictions as conditioning features to deal with mutual dependency between characters and syllables. Experimental results on Corpus of Spontaneous Japanese show that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Phonetics and Phonology Research
