Stereo InSE-NET: Stereo Audio Quality Predictor Transfer Learned from Mono InSE-NET
Arijit Biswas, Guanxin Jiang

TL;DR
Stereo InSE-NET extends the mono InSE-NET model to predict stereo audio quality by incorporating spatial cues, achieving significant correlation improvements over existing metrics through transfer learning and augmented training.
Contribution
It introduces a stereo-aware extension of InSE-NET, leveraging transfer learning from mono models and training with real and synthetic data for improved stereo audio quality prediction.
Findings
12% improvement in Pearson correlation
6% improvement in Spearman correlation
Effective transfer learning from mono to stereo models
Abstract
Automatic coded audio quality predictors are typically designed for evaluating single channels without considering any spatial aspects. With InSE-NET [1], we demonstrated mimicking a state-of-the-art coded audio quality metric (ViSQOL-v3 [2]) with deep neural networks (DNN) and subsequently improving it - completely with programmatically generated data. In this study, we take steps towards building a DNN-based coded stereo audio quality predictor and we propose an extension of the InSE-NET for handling stereo signals. The design considers stereo/spatial aspects by conditioning the model with left, right, mid, and side channels; and we name our model Stereo InSE-NET. By transferring selected weights from the pre-trained mono InSE-NET and retraining with both real and synthetically augmented listening tests, we demonstrate a significant improvement of 12% and 6% of Pearson and Spearman…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Acoustic Wave Phenomena Research · Music and Audio Processing
