CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese

Dazhong Chen; Yi-Cheng Lin; Yuchen Huang; Ziwei Gong; Di Jiang; Zeying Xie; and Yi R. (May) Fung

arXiv:2511.04139·cs.CL·November 7, 2025

CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese

Dazhong Chen, Yi-Cheng Lin, Yuchen Huang, Ziwei Gong, Di Jiang, Zeying Xie, and Yi R. (May) Fung

PDF

Open Access

TL;DR

CantoASR is a novel framework that combines acoustic feature extraction, tone-aware fine-tuning, and prosody-aware correction to significantly improve Cantonese speech recognition in low-resource settings.

Contribution

It introduces a collaborative error correction approach integrating acoustic cues with large language models for low-resource tonal ASR.

Findings

01

Substantial CER reduction over Whisper baseline

02

Effective integration of prosodic cues improves recognition accuracy

03

Scalable approach for low-resource tonal language ASR

Abstract

Automatic speech recognition (ASR) is critical for language accessibility, yet low-resource Cantonese remains challenging due to limited annotated data, six lexical tones, tone sandhi, and accent variation. Existing ASR models, such as Whisper, often suffer from high word error rates. Large audio-language models (LALMs), in contrast, can leverage broader contextual reasoning but still require explicit tonal and prosodic acoustic cues. We introduce CantoASR, a collaborative ASR-LALM error correction framework that integrates forced alignment for acoustic feature extraction, a LoRA-finetuned Whisper for improved tone discrimination, and an instruction-tuned Qwen-Audio for prosody-aware correction. Evaluations on spontaneous Cantonese data show substantial CER gains over Whisper-Large-V3. These findings suggest that integrating acoustic cues with LALM reasoning provides a scalable strategy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPhonetics and Phonology Research · Speech Recognition and Synthesis · Speech and Audio Processing