High-dimensional Asymptotics of VAEs: Threshold of Posterior Collapse and Dataset-Size Dependence of Rate-Distortion Curve
Yuma Ichikawa, Koji Hukushima

TL;DR
This paper analyzes the conditions leading to posterior collapse in high-dimensional VAEs, revealing a threshold effect of the hyperparameter beta and the influence of dataset size on the rate-distortion trade-off.
Contribution
It provides a theoretical analysis of posterior collapse thresholds and dataset-size effects in high-dimensional VAEs, offering insights into their generalization behavior.
Findings
Posterior collapse occurs beyond a certain beta threshold, regardless of dataset size.
Large datasets are necessary to achieve high-rate rate-distortion curves.
The analysis explains observed behaviors in real-world non-linear VAEs.
Abstract
In variational autoencoders (VAEs), the variational posterior often collapses to the prior, known as posterior collapse, which leads to poor representation learning quality. An adjustable hyperparameter beta has been introduced in VAEs to address this issue. This study sharply evaluates the conditions under which the posterior collapse occurs with respect to beta and dataset size by analyzing a minimal VAE in a high-dimensional limit. Additionally, this setting enables the evaluation of the rate-distortion curve of the VAE. Our results show that, unlike typical regularization parameters, VAEs face "inevitable posterior collapse" beyond a certain beta threshold, regardless of dataset size. Moreover, the dataset-size dependence of the derived rate-distortion curve suggests that relatively large datasets are required to achieve a rate-distortion curve with high rates. These findings…
Peer Reviews
Decision·Submitted to ICLR 2025
- To the best of my knowledge, this is the first paper that studied RD curves in VAEs as a function of dataset size and data dimensions. This topic I think is a valuable topic of study and will indeed be of interest to the ICLR community. - The theory in the paper, to the best of my understanding, is sound. - The paper for the most reads well.
- There is no study of the network capacity in this work. While I understand that this is theoretical work, the authors do make a claim that the same results hold for more complex networks. However, there are prior works that suggest that RD curves for different network capacities behave differently [1,2]. Could the authors comment on this? - It is also not clear to me what is the message of the paper. It ofcourse makes sense that when you don't have a lotta data in high dimensions, you want to
- The paper studied an important aspect of VAEs and how the different parameters and choices can affect the performance. - The empirical findings of the relation between generalisation error and the sample complexity as well as the beta parameter is interesting.
The paper discussed a list of different behaviours of VAEs, but it feels like they are rather loosely connected findings (i.e., the subsections in Section 6). The findings themselves are interesting, but it is not surprising that changing one variable, such as beta or the number of training data, will lead to various changes in aspects like RD curves, posterior collapse. Therefore, I believe a more coherent story is important to connect the dots and make these findings more insightful.
**Technical strengths**: - The paper sharply characterizes high-dimensional asymptotics for learning the linear VAE (Eq (5)) under the spiked covariance model (Eq (4)) with the regularized $\beta$-VAE objective (Eq (6)). - This is used to show interesting observations about the VAE learning process in Section 6.1 and 6.2. In particular, (1) Figure 2 shows a double-descent phenomenon w.r.to the sample complexity $\alpha$, with the reconstruction error (Eq (9)) peaking at $\alpha = 1$, and (2) Fig
**Technical Weaknesses**: - The main weakness is the fact that the theoretical results are not exact, since they have been developed using the replica method, which is a heuristic to get around intractable calculations. - The authors work in the simple setting of $k = k^\star = 1$. If I understand correctly, this means the true latent space is $1$-dimensional. It would have been nice to see the synthetic experiments with $k^\star$ varying, say in $[1, 2, 4]$. In particular, what would the trend
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Model Reduction and Neural Networks · Cancer-related molecular mechanisms research
