JSQA: Speech Quality Assessment with Perceptually-Inspired Contrastive Pretraining Based on JND Audio Pairs
Junyi Fan, Donald Williamson

TL;DR
This paper introduces JSQA, a speech quality assessment framework that uses perceptually-guided contrastive pretraining on JND pairs to improve MOS prediction accuracy, addressing the challenge of high variance in perceptual scores.
Contribution
The paper proposes a novel two-stage framework combining contrastive pretraining with fine-tuning for speech quality assessment, incorporating perceptual factors into the learning process.
Findings
Contrastive pretraining improves MOS prediction performance.
Perceptually-guided pretraining outperforms training from scratch.
Incorporating perceptual factors enhances speech quality assessment accuracy.
Abstract
Speech quality assessment (SQA) is often used to learn a mapping from a high-dimensional input space to a scalar that represents the mean opinion score (MOS) of the perceptual speech quality. Learning such a mapping is challenging for many reasons, but largely because MOS exhibits high levels of inherent variance due to perceptual and experimental-design differences. Many solutions have been proposed, but many approaches do not properly incorporate perceptual factors into their learning algorithms (beyond the MOS label), which could lead to unsatisfactory results. To this end, we propose JSQA, a two-stage framework that pretrains an audio encoder using perceptually-guided contrastive learning on just noticeable difference (JND) pairs, followed by fine-tuning for MOS prediction. We first generate pairs of audio data within JND levels, which are then used to pretrain an encoder to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing
