Discriminating real and synthetic super-resolved audio samples using embedding-based classifiers

Mikhail Silaev; Konstantinos Drossos; Tuomas Virtanen

arXiv:2601.03443·eess.AS·January 8, 2026

Discriminating real and synthetic super-resolved audio samples using embedding-based classifiers

Mikhail Silaev, Konstantinos Drossos, Tuomas Virtanen

PDF

Open Access

TL;DR

This paper investigates how well current audio super-resolution models produce synthetic audio that matches real audio by using embedding-based classifiers, revealing a persistent gap between perceptual quality and distributional fidelity.

Contribution

It introduces a method to discriminate real from synthetic super-resolved audio using embedding classifiers, exposing limitations of perceptual metrics in evaluating audio authenticity.

Findings

01

Embedding classifiers achieve near-perfect separation of real and synthetic audio.

02

Perceptual quality metrics do not reliably indicate distributional fidelity.

03

The gap persists across datasets, models, and audio types, including recent diffusion approaches.

Abstract

Generative adversarial networks (GANs) and diffusion models have recently achieved state-of-the-art performance in audio super-resolution (ADSR), producing perceptually convincing wideband audio from narrowband inputs. However, existing evaluations primarily rely on signal-level or perceptual metrics, leaving open the question of how closely the distributions of synthetic super-resolved and real wideband audio match. Here we address this problem by analyzing the separability of real and super-resolved audio in various embedding spaces. We consider both middle-band ( $4 \to 16$ ~kHz) and full-band ( $16 \to 48$ ~kHz) upsampling tasks for speech and music, training linear classifiers to distinguish real from synthetic samples based on multiple types of audio embeddings. Comparisons with objective metrics and subjective listening tests reveal that embedding-based classifiers achieve near-perfect…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Music and Audio Processing