Unsupervised speech intelligibility assessment with utterance level alignment distance between teacher and learner Wav2Vec-2.0 representations
Nayan Anand, Meenakshi Sirigiraju, Chiranjeevi Yarra

TL;DR
This paper introduces an unsupervised method for speech intelligibility detection using alignment distances between teacher and learner Wav2Vec-2.0 representations, achieving high accuracy without manual annotations.
Contribution
It proposes a novel unsupervised SID approach based on alignment distances with DTW and Wav2Vec-2.0 features, eliminating the need for manual labels.
Findings
Achieved detection accuracies of 90.37%, 92.57%, and 96.58%.
Used three different alignment distance measures: MAE, MSE, and cosine distance.
Demonstrated high effectiveness of unsupervised approach for speech intelligibility assessment.
Abstract
Speech intelligibility is crucial in language learning for effective communication. Thus, to develop computer-assisted language learning systems, automatic speech intelligibility detection (SID) is necessary. Most of the works have assessed the intelligibility in a supervised manner considering manual annotations, which requires cost and time; hence scalability is limited. To overcome these, this work proposes an unsupervised approach for SID. The proposed approach considers alignment distance computed with dynamic-time warping (DTW) between teacher and learner representation sequence as a measure to separate intelligible versus non-intelligible speech. We obtain the feature sequence using current state-of-the-art self-supervised representations from Wav2Vec-2.0. We found the detection accuracies as 90.37\%, 92.57\% and 96.58\%, respectively, with three alignment distance measures --…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
