A Tilted Seesaw: Revisiting Autoencoder Trade-off for Controllable Diffusion
Pu Cao, Yiyang Ma, Feng Zhou, Xuedan Yin, Qing Song, Lu Yang

TL;DR
This paper critically examines the autoencoder evaluation metrics in latent diffusion models, revealing that reconstruction fidelity better predicts controllability than generative metrics like gFID, especially when scaling to controllable diffusion tasks.
Contribution
It provides a theoretical and empirical analysis showing the limitations of gFID-focused evaluation and proposes a more reliable multi-dimensional assessment for controllability in diffusion models.
Findings
gFID is weakly predictive of condition preservation
Reconstruction metrics better indicate controllability
Autoencoder evaluation bias affects controllable diffusion performance
Abstract
In latent diffusion models, the autoencoder (AE) is typically expected to balance two capabilities: faithful reconstruction and a generation-friendly latent space (e.g., low gFID). In recent ImageNet-scale AE studies, we observe a systematic bias toward generative metrics in handling this trade-off: reconstruction metrics are increasingly under-reported, and ablation-based AE selection often favors the best-gFID configuration even when reconstruction fidelity degrades. We theoretically analyze why this gFID-dominant preference can appear unproblematic for ImageNet generation, yet becomes risky when scaling to controllable diffusion: AEs can induce condition drift, which limits achievable condition alignment. Meanwhile, we find that reconstruction fidelity, especially instance-level measures, better indicates controllability. We empirically validate the impact of tilted autoencoder…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Model Reduction and Neural Networks · Domain Adaptation and Few-Shot Learning
