CCATMos: Convolutional Context-aware Transformer Network for Non-intrusive Speech Quality Assessment
Yuchen Liu, Li-Chia Yang, Alex Pawlicki, Marko Stamenovic

TL;DR
This paper introduces CCAT, a novel non-intrusive speech quality assessment model that combines convolutional and transformer architectures, achieving higher correlation with human ratings across multiple datasets.
Contribution
The paper presents a new end-to-end convolutional transformer model for non-intrusive speech quality prediction, outperforming existing models in correlation and error metrics.
Findings
CCAT achieves higher Pearson correlation (0.697) than baseline (0.530).
CCAT reduces RMSE from 0.768 to 0.570.
Model performs well across multiple languages and distortions.
Abstract
Speech quality assessment has been a critical component in many voice communication related applications such as telephony and online conferencing. Traditional intrusive speech quality assessment requires the clean reference of the degraded utterance to provide an accurate quality measurement. This requirement limits the usability of these methods in real-world scenarios. On the other hand, non-intrusive subjective measurement is the ``golden standard" in evaluating speech quality as human listeners can intrinsically evaluate the quality of any degraded speech with ease. In this paper, we propose a novel end-to-end model structure called Convolutional Context-Aware Transformer (CCAT) network to predict the mean opinion score (MOS) of human raters. We evaluate our model on three MOS-annotated datasets spanning multiple languages and distortion types and submit our results to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Test · Linear Layer · Softmax · Label Smoothing · Adam · Position-Wise Feed-Forward Layer · Dense Connections · Absolute Position Encodings
