A Study on Incorporating Whisper for Robust Speech Assessment

Ryandhimas E. Zezario; Yu-Wen Chen; Szu-Wei Fu; Yu Tsao; Hsin-Min; Wang; Chiou-Shann Fuh

arXiv:2309.12766·eess.AS·April 30, 2024·2 cites

A Study on Incorporating Whisper for Robust Speech Assessment

Ryandhimas E. Zezario, Yu-Wen Chen, Szu-Wei Fu, Yu Tsao, Hsin-Min, Wang, Chiou-Shann Fuh

PDF

Open Access 1 Repo

TL;DR

This paper presents MOSA-Net+, an improved speech assessment model that leverages Whisper's acoustic features to enhance the accuracy and robustness of subjective speech quality and intelligibility predictions, outperforming existing models in various tests.

Contribution

The study introduces MOSA-Net+ which effectively incorporates Whisper's features, demonstrating significant performance gains over prior models in noisy and challenging conditions.

Findings

01

Whisper's embeddings improve prediction accuracy.

02

Combining Whisper with SSL models yields marginal gains.

03

MOSA-Net+ outperforms existing models in TMHINT-QI and VoiceMOS Challenge 2023.

Abstract

This research introduces an enhanced version of the multi-objective speech assessment model--MOSA-Net+, by leveraging the acoustic features from Whisper, a large-scaled weakly supervised model. We first investigate the effectiveness of Whisper in deploying a more robust speech assessment model. After that, we explore combining representations from Whisper and SSL models. The experimental results reveal that Whisper's embedding features can contribute to more accurate prediction performance. Moreover, combining the embedding features from Whisper and SSL models only leads to marginal improvement. As compared to intrusive methods, MOSA-Net, and other SSL-based speech assessment models, MOSA-Net+ yields notable improvements in estimating subjective quality and intelligibility scores across all evaluation metrics in Taiwan Mandarin Hearing In Noise test - Quality & Intelligibility…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dhimasryan/tmhint-qi_voicemos2023
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis