ViSpeechFormer: A Phonemic Approach for Vietnamese Automatic Speech Recognition

Khoa Anh Nguyen; Long Minh Hoang; Nghia Hieu Nguyen; Luan Thanh Nguyen; and Ngan Luu-Thuy Nguyen

arXiv:2602.10003·cs.CL·February 11, 2026

ViSpeechFormer: A Phonemic Approach for Vietnamese Automatic Speech Recognition

Khoa Anh Nguyen, Long Minh Hoang, Nghia Hieu Nguyen, Luan Thanh Nguyen, and Ngan Luu-Thuy Nguyen

PDF

Open Access

TL;DR

ViSpeechFormer introduces a phoneme-based Vietnamese ASR framework leveraging the language's high grapheme-phoneme transparency, achieving improved performance and generalization, and potentially benefiting other phonetic orthographies.

Contribution

This is the first Vietnamese ASR framework explicitly modeling phonemic representations, demonstrating advantages over traditional methods.

Findings

01

Achieves strong performance on Vietnamese ASR datasets

02

Generalizes better to out-of-vocabulary words

03

Less affected by training bias

Abstract

Vietnamese has a phonetic orthography, where each grapheme corresponds to at most one phoneme and vice versa. Exploiting this high grapheme-phoneme transparency, we propose ViSpeechFormer (\textbf{Vi}etnamese \textbf{Speech} Trans\textbf{Former}), a phoneme-based approach for Vietnamese Automatic Speech Recognition (ASR). To the best of our knowledge, this is the first Vietnamese ASR framework that explicitly models phonemic representations. Experiments on two publicly available Vietnamese ASR datasets show that ViSpeechFormer achieves strong performance, generalizes better to out-of-vocabulary words, and is less affected by training bias. This phoneme-based paradigm is also promising for other languages with phonetic orthographies. The code will be released upon acceptance of this paper.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition