VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining

Jianheng Zhuo; Yifan Yang; Yiwen Shao; Yong Xu; Dong Yu; Kai Yu; Xie Chen

arXiv:2505.21527·eess.AS·May 30, 2025

VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining

Jianheng Zhuo, Yifan Yang, Yiwen Shao, Yong Xu, Dong Yu, Kai Yu, Xie Chen

PDF

Open Access

TL;DR

VietASR introduces a cost-effective, high-performance Vietnamese ASR system trained on large-scale unlabeled data and minimal labeled data, outperforming existing models and reducing resource requirements.

Contribution

The paper presents a novel multi-iteration self-supervised training pipeline for low-resource Vietnamese ASR using vast unlabeled data and limited labeled data.

Findings

01

Pre-training on 70,000 hours of unlabeled data improves performance.

02

Fine-tuning on only 50 hours of labeled data achieves superior results.

03

VietASR outperforms Whisper Large-v3 and commercial systems.

Abstract

Automatic speech recognition (ASR) has made remarkable progress but heavily relies on large-scale labeled data, which is scarce for low-resource languages like Vietnamese. While existing systems such as Whisper, USM, and MMS achieve promising performance, their efficacy remains inadequate in terms of training costs, latency, and accessibility. To address these issues, we propose VietASR, a novel ASR training pipeline that leverages vast amounts of unlabeled data and a small set of labeled data. Through multi-iteration ASR-biased self-supervised learning on a large-scale unlabeled dataset, VietASR offers a cost-effective and practical solution for enhancing ASR performance. Experiments demonstrate that pre-training on 70,000-hour unlabeled data and fine-tuning on merely 50-hour labeled data yield a lightweight but powerful ASR model. It outperforms Whisper Large-v3 and commercial ASR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition

MethodsSparse Evolutionary Training