Unsupervised speech intelligibility assessment with utterance level   alignment distance between teacher and learner Wav2Vec-2.0 representations

Nayan Anand; Meenakshi Sirigiraju; Chiranjeevi Yarra

arXiv:2306.08845·cs.SD·June 16, 2023·1 cites

Unsupervised speech intelligibility assessment with utterance level alignment distance between teacher and learner Wav2Vec-2.0 representations

Nayan Anand, Meenakshi Sirigiraju, Chiranjeevi Yarra

PDF

Open Access

TL;DR

This paper introduces an unsupervised method for speech intelligibility detection using alignment distances between teacher and learner Wav2Vec-2.0 representations, achieving high accuracy without manual annotations.

Contribution

It proposes a novel unsupervised SID approach based on alignment distances with DTW and Wav2Vec-2.0 features, eliminating the need for manual labels.

Findings

01

Achieved detection accuracies of 90.37%, 92.57%, and 96.58%.

02

Used three different alignment distance measures: MAE, MSE, and cosine distance.

03

Demonstrated high effectiveness of unsupervised approach for speech intelligibility assessment.

Abstract

Speech intelligibility is crucial in language learning for effective communication. Thus, to develop computer-assisted language learning systems, automatic speech intelligibility detection (SID) is necessary. Most of the works have assessed the intelligibility in a supervised manner considering manual annotations, which requires cost and time; hence scalability is limited. To overcome these, this work proposes an unsupervised approach for SID. The proposed approach considers alignment distance computed with dynamic-time warping (DTW) between teacher and learner representation sequence as a measure to separate intelligible versus non-intelligible speech. We obtain the feature sequence using current state-of-the-art self-supervised representations from Wav2Vec-2.0. We found the detection accuracies as 90.37\%, 92.57\% and 96.58\%, respectively, with three alignment distance measures --…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems