Unlocking Large Audio-Language Models for Interactive Language Learning
Hongfu Liu, Zhouying Cui, Xiangming Gu, Ye Wang

TL;DR
This paper explores the use of advanced audio-language models for interactive language learning, introducing a new dataset and methods that significantly improve pronunciation feedback accuracy.
Contribution
It introduces L2-Arctic-plus, a detailed dataset for pronunciation feedback, and demonstrates that instruction-tuning ALMs on this dataset enhances mispronunciation detection and feedback generation.
Findings
Instruction-tuned ALMs outperform baselines in detection accuracy.
The dataset enables more actionable and human-like feedback.
Experimental results show significant improvements in both objective metrics and human evaluations.
Abstract
Achieving pronunciation proficiency in a second language (L2) remains a challenge, despite the development of Computer-Assisted Pronunciation Training (CAPT) systems. Traditional CAPT systems often provide unintuitive feedback that lacks actionable guidance, limiting its effectiveness. Recent advancements in audio-language models (ALMs) offer the potential to enhance these systems by providing more user-friendly feedback. In this work, we investigate ALMs for chat-based pronunciation training by introducing L2-Arctic-plus, an English dataset with detailed error explanations and actionable suggestions for improvement. We benchmark cascaded ASR+LLMs and existing ALMs on this dataset, specifically in detecting mispronunciation and generating actionable feedback. To improve the performance, we further propose to instruction-tune ALMs on L2-Arctic-plus. Experimental results demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · AI in Service Interactions
