TL;DR
This paper introduces NISQA, a deep learning model utilizing CNN and self-attention for multidimensional speech quality prediction, trained on extensive crowdsourced datasets, providing detailed quality insights and reliable predictions on real-world telephone call recordings.
Contribution
The paper presents an updated end-to-end NISQA model with self-attention for detailed speech quality assessment, trained on new large-scale datasets and evaluated on real-world data.
Findings
NISQA effectively predicts overall speech quality and four specific dimensions.
The model generalizes well to unseen speech samples from diverse datasets.
Open-sourced code and datasets facilitate further research.
Abstract
In this paper, we present an update to the NISQA speech quality prediction model that is focused on distortions that occur in communication networks. In contrast to the previous version, the model is trained end-to-end and the time-dependency modelling and time-pooling is achieved through a Self-Attention mechanism. Besides overall speech quality, the model also predicts the four speech quality dimensions Noisiness, Coloration, Discontinuity, and Loudness, and in this way gives more insight into the cause of a quality degradation. Furthermore, new datasets with over 13,000 speech files were created for training and validation of the model. The model was finally tested on a new, live-talking test dataset that contains recordings of real telephone calls. Overall, NISQA was trained and evaluated on 81 datasets from different sources and showed to provide reliable predictions also for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
