Semi-intrusive audio evaluation: Casting non-intrusive assessment as a multi-modal text prediction task
Jozef Coldenhoff, Milos Cernak

TL;DR
This paper introduces a semi-intrusive audio evaluation method that models human selective listening by framing assessment as a multi-modal text prediction task, improving accuracy over existing models.
Contribution
The work extends the multi-modal PENGI model with instruction fine-tuning for MOS and SNR estimation, enabling source-focused audio assessment.
Findings
Achieves higher Pearson correlation in MOS estimation compared to baselines.
Proposes a novel SNR estimator focusing on specific audio sources.
Demonstrates human-like selective listening in semi-intrusive assessment.
Abstract
Human perception has the unique ability to focus on specific events in a mixture of signals--a challenging task for existing non-intrusive assessment methods. In this work, we introduce semi-intrusive assessment that emulates human attention by framing audio assessment as a text-prediction task with audio-text inputs. To this end, we extend the multi-modal PENGI model through instruction fine-tuning for MOS and SNR estimation. For MOS, our approach achieves absolute Pearson correlation gains of 0.06 and 0.20 over the re-trained MOSRA model and the pre-trained PAM model, respectively. We further propose a novel SNR estimator that can focus on a specific audio source in a mixture, outperforming a random baseline and the fixed-prompt counterpart. Our findings suggest that semi-intrusive assessment can effectively capture human-like selective listening capabilities. Samples are available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
MethodsSoftmax · Attention Is All You Need · Focus
