An Empirical Analysis of Task-Induced Encoder Bias in Fr\'echet Audio Distance
Wonwoo Jeong

TL;DR
This paper investigates how different audio encoders influence the Fréchet Audio Distance (FAD) scores, revealing systematic biases based on their training tasks and emphasizing the need for evaluation-native encoders aligned with human perception.
Contribution
It provides a detailed analysis of encoder-induced biases in FAD, decomposing evaluation into multiple dimensions and highlighting the trade-offs among various encoder types.
Findings
AudioMAE emphasizes precision sensitivity.
Whisper excels in structural detection but ignores signal degradation.
VGGish maximizes semantic detection but penalizes intra-class variation.
Abstract
Fr\'echet Audio Distance (FAD) is the de facto standard for evaluating text-to-audio generation, yet its scores depend on the underlying encoder's embedding space. An encoder's training task dictates which acoustic features are preserved or discarded, causing FAD to inherit systematic task-induced biases. We decompose evaluation into Recall, Precision, and Alignment (split into semantic and structural dimensions), using log-scale normalization for fair cross-encoder comparison. Controlled experiments on six encoders across two datasets reveal a four-axis trade-off: reconstruction-based AudioMAE leads precision sensitivity; ASR-trained Whisper dominates structural detection but is blind to signal degradation; classification-trained VGGish maximizes semantic detection but penalizes legitimate intra-class variation. Since no single encoder is a universal evaluator, future metrics must…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
