TL;DR
KAD introduces a distribution-free, efficient, and perceptually aligned evaluation metric for audio generation, overcoming FAD's limitations and enabling reliable assessment with smaller samples and lower computational costs.
Contribution
The paper proposes KAD, a novel evaluation metric based on MMD that is unbiased, scalable, and better aligned with human perception, addressing key limitations of FAD.
Findings
KAD converges faster with smaller sample sizes.
KAD has lower computational costs and scalable GPU acceleration.
KAD aligns more closely with human perceptual judgments.
Abstract
Although being widely adopted for evaluating generated audio signals, the Fr\'echet Audio Distance (FAD) suffers from significant limitations, including reliance on Gaussian assumptions, sensitivity to sample size, and high computational complexity. As an alternative, we introduce the Kernel Audio Distance (KAD), a novel, distribution-free, unbiased, and computationally efficient metric based on Maximum Mean Discrepancy (MMD). Through analysis and empirical validation, we demonstrate KAD's advantages: (1) faster convergence with smaller sample sizes, enabling reliable evaluation with limited data; (2) lower computational cost, with scalable GPU acceleration; and (3) stronger alignment with human perceptual judgments. By leveraging advanced embeddings and characteristic kernels, KAD captures nuanced differences between real and generated audio. Open-sourced in the kadtk toolkit, KAD…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
