A Perception-Based L2 Speech Intelligibility Indicator: Leveraging a Rater's Shadowing and Sequence-to-sequence Voice Conversion

Haopeng Geng; Daisuke Saito; Nobuaki Minematsu

arXiv:2505.24304·eess.AS·June 2, 2025

A Perception-Based L2 Speech Intelligibility Indicator: Leveraging a Rater's Shadowing and Sequence-to-sequence Voice Conversion

Haopeng Geng, Daisuke Saito, Nobuaki Minematsu

PDF

Open Access

TL;DR

This paper presents a perception-based L2 speech intelligibility indicator that uses native raters' shadowing data within a seq2seq voice conversion framework to better reflect human listener comprehension than traditional methods.

Contribution

It introduces a novel approach combining shadowing data and voice conversion to more accurately assess L2 speech intelligibility aligned with human perception.

Findings

01

Outperforms traditional ASR-based metrics in correlating with native judgments.

02

Effectively identifies segments causing comprehension difficulties.

03

Shows promise for improving CALL systems globally.

Abstract

Evaluating L2 speech intelligibility is crucial for effective computer-assisted language learning (CALL). Conventional ASR-based methods often focus on native-likeness, which may fail to capture the actual intelligibility perceived by human listeners. In contrast, our work introduces a novel, perception based L2 speech intelligibility indicator that leverages a native rater's shadowing data within a sequence-to-sequence (seq2seq) voice conversion framework. By integrating an alignment mechanism and acoustic feature reconstruction, our approach simulates the auditory perception of native listeners, identifying segments in L2 speech that are likely to cause comprehension difficulties. Both objective and subjective evaluations indicate that our method aligns more closely with native judgments than traditional ASR-based metrics, offering a promising new direction for CALL systems in a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsFocus