Unlocking Large Audio-Language Models for Interactive Language Learning

Hongfu Liu; Zhouying Cui; Xiangming Gu; Ye Wang

arXiv:2601.14744·cs.SD·January 22, 2026

Unlocking Large Audio-Language Models for Interactive Language Learning

Hongfu Liu, Zhouying Cui, Xiangming Gu, Ye Wang

PDF

Open Access 1 Video

TL;DR

This paper explores the use of advanced audio-language models for interactive language learning, introducing a new dataset and methods that significantly improve pronunciation feedback accuracy.

Contribution

It introduces L2-Arctic-plus, a detailed dataset for pronunciation feedback, and demonstrates that instruction-tuning ALMs on this dataset enhances mispronunciation detection and feedback generation.

Findings

01

Instruction-tuned ALMs outperform baselines in detection accuracy.

02

The dataset enables more actionable and human-like feedback.

03

Experimental results show significant improvements in both objective metrics and human evaluations.

Abstract

Achieving pronunciation proficiency in a second language (L2) remains a challenge, despite the development of Computer-Assisted Pronunciation Training (CAPT) systems. Traditional CAPT systems often provide unintuitive feedback that lacks actionable guidance, limiting its effectiveness. Recent advancements in audio-language models (ALMs) offer the potential to enhance these systems by providing more user-friendly feedback. In this work, we investigate ALMs for chat-based pronunciation training by introducing L2-Arctic-plus, an English dataset with detailed error explanations and actionable suggestions for improvement. We benchmark cascaded ASR+LLMs and existing ALMs on this dataset, specifically in detecting mispronunciation and generating actionable feedback. To improve the performance, we further propose to instruction-tune ALMs on L2-Arctic-plus. Experimental results demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Unlocking Large Audio-Language Models for Interactive Language Learning· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · AI in Service Interactions