Optimal Transport Audio Distance with Learned Riemannian Ground Metrics
Wonwoo Jeong

TL;DR
This paper introduces OTAD, a novel audio distance metric that improves evaluation accuracy by correcting primitive components with learned Riemannian metrics and entropic Sinkhorn transport, outperforming existing methods.
Contribution
It proposes a new audio distance measure, OTAD, with dedicated mechanisms for cost and coupling, enhancing evaluation precision and diagnostics over prior metrics.
Findings
OTAD outperforms FAD in coupling sensitivity by a factor of 1.9 to 3.6.
OTAD achieves higher correlation with audio MOS scores than baseline metrics.
Per-sample diagnostics with AUROC ≥ 0.86 are enabled by OTAD's discrete transport plan.
Abstract
In audio generation evaluation, Fr\'echet Audio Distance (FAD) is a 2-Wasserstein distance with structural constraints for both primitives: the cost is a frozen embedding pullback whose invariance set hides severe artifacts, and the coupling is a Gaussian fit that dilutes rank-1 contamination relative to discrete OT. We propose Optimal Transport Audio Distance (OTAD), which corrects each primitive with one dedicated mechanism -- a residual Riemannian ground-metric adapter for the cost and entropic Sinkhorn optimal transport for the coupling. Across eight encoders under a four-axis protocol, coupling-only comparisons at show that Sinkhorn's rank-1 sensitivity exceeds FAD's by a factor of 1.9 to 3.6. Furthermore, OTAD achieves a higher mean Spearman correlation with audio-quality MOS (DCASE 2023 Task 7) than baseline metrics. As an intrinsic benefit of the discrete…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
