An Empirical Analysis of Task-Induced Encoder Bias in Fr\'echet Audio Distance

Wonwoo Jeong

arXiv:2602.23958·eess.AS·March 2, 2026

An Empirical Analysis of Task-Induced Encoder Bias in Fr\'echet Audio Distance

Wonwoo Jeong

PDF

Open Access

TL;DR

This paper investigates how different audio encoders influence the Fréchet Audio Distance (FAD) scores, revealing systematic biases based on their training tasks and emphasizing the need for evaluation-native encoders aligned with human perception.

Contribution

It provides a detailed analysis of encoder-induced biases in FAD, decomposing evaluation into multiple dimensions and highlighting the trade-offs among various encoder types.

Findings

01

AudioMAE emphasizes precision sensitivity.

02

Whisper excels in structural detection but ignores signal degradation.

03

VGGish maximizes semantic detection but penalizes intra-class variation.

Abstract

Fr\'echet Audio Distance (FAD) is the de facto standard for evaluating text-to-audio generation, yet its scores depend on the underlying encoder's embedding space. An encoder's training task dictates which acoustic features are preserved or discarded, causing FAD to inherit systematic task-induced biases. We decompose evaluation into Recall, Precision, and Alignment (split into semantic and structural dimensions), using log-scale normalization for fair cross-encoder comparison. Controlled experiments on six encoders across two datasets reveal a four-axis trade-off: reconstruction-based AudioMAE leads precision sensitivity; ASR-trained Whisper dominates structural detection but is blind to signal degradation; classification-trained VGGish maximizes semantic detection but penalizes legitimate intra-class variation. Since no single encoder is a universal evaluator, future metrics must…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing