NOMAD: Unsupervised Learning of Perceptual Embeddings for Speech Enhancement and Non-matching Reference Audio Quality Assessment
Alessandro Ragano, Jan Skoglund, Andrew Hines

TL;DR
NOMAD introduces an unsupervised, differentiable perceptual similarity metric for audio that effectively assesses quality and degradation without human labels, outperforming existing non-matching reference methods.
Contribution
The paper proposes NOMAD, a novel unsupervised deep embedding method guided by NSIM for perceptual audio similarity, applicable to quality assessment and speech enhancement.
Findings
Outperforms other non-matching reference methods in ranking degradation and quality assessment
Achieves competitive results with full-reference audio metrics
Demonstrates effectiveness in speech enhancement and synthesis tasks
Abstract
This paper presents NOMAD (Non-Matching Audio Distance), a differentiable perceptual similarity metric that measures the distance of a degraded signal against non-matching references. The proposed method is based on learning deep feature embeddings via a triplet loss guided by the Neurogram Similarity Index Measure (NSIM) to capture degradation intensity. During inference, the similarity score between any two audio samples is computed through Euclidean distance of their embeddings. NOMAD is fully unsupervised and can be used in general perceptual audio tasks for audio analysis e.g. quality assessment and generative tasks such as speech enhancement and speech synthesis. The proposed method is evaluated with 3 tasks. Ranking degradation intensity, predicting speech quality, and as a loss function for speech enhancement. Results indicate NOMAD outperforms other non-matching reference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Music and Audio Processing
