Whisper based Cross-Lingual Phoneme Recognition between Vietnamese and English

Nguyen Huu Nhat Minh; Tran Nguyen Anh; Truong Dinh Dung; Vo Van Nam; and Le Pham Tuyen

arXiv:2508.19270·cs.CL·August 28, 2025

Whisper based Cross-Lingual Phoneme Recognition between Vietnamese and English

Nguyen Huu Nhat Minh, Tran Nguyen Anh, Truong Dinh Dung, Vo Van Nam, and Le Pham Tuyen

PDF

TL;DR

This paper introduces a novel bilingual speech recognition system leveraging Whisper's pre-trained encoder to improve cross-lingual phoneme recognition between Vietnamese and English, addressing tonal and stress pattern challenges.

Contribution

It proposes a new bilingual phoneme set and an end-to-end recognition system using PhoWhisper encoder, advancing cross-lingual phoneme recognition techniques.

Findings

01

Improved recognition accuracy for Vietnamese-English bilingual speech.

02

Robust framework for tonal and stress-based phoneme recognition.

03

Effective bridging of phonetic differences between Vietnamese and English.

Abstract

Cross-lingual phoneme recognition has emerged as a significant challenge for accurate automatic speech recognition (ASR) when mixing Vietnamese and English pronunciations. Unlike many languages, Vietnamese relies on tonal variations to distinguish word meanings, whereas English features stress patterns and non-standard pronunciations that hinder phoneme alignment between the two languages. To address this challenge, we propose a novel bilingual speech recognition approach with two primary contributions: (1) constructing a representative bilingual phoneme set that bridges the differences between Vietnamese and English phonetic systems; (2) designing an end-to-end system that leverages the PhoWhisper pre-trained encoder for deep high-level representations to improve phoneme recognition. Our extensive experiments demonstrate that the proposed approach not only improves recognition accuracy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.