TL;DR
AlphaVAE introduces a unified end-to-end VAE model for RGBA image reconstruction and generation, leveraging alpha-aware learning to improve transparency handling with significantly less training data.
Contribution
It presents the first comprehensive RGBA benchmark and a novel alpha-aware VAE that extends pretrained RGB VAEs for transparent image synthesis.
Findings
Achieves +4.9 dB PSNR improvement over prior methods
Enables superior transparent image generation
Trains effectively on only 8K images
Abstract
Recent advances in latent diffusion models have achieved remarkable results in high-fidelity RGB image synthesis by leveraging pretrained VAEs to compress and reconstruct pixel data at low computational cost. However, the generation of transparent or layered content (RGBA image) remains largely unexplored, due to the lack of large-scale benchmarks. In this work, we propose ALPHA, the first comprehensive RGBA benchmark that adapts standard RGB metrics to four-channel images via alpha blending over canonical backgrounds. We further introduce ALPHAVAE, a unified end-to-end RGBA VAE that extends a pretrained RGB VAE by incorporating a dedicated alpha channel. The model is trained with a composite objective that combines alpha-blended pixel reconstruction, patch-level fidelity, perceptual consistency, and dual KL divergence constraints to ensure latent fidelity across both RGB and alpha…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Writing: The paper is clearly written and easy to understand. The organization is logical, and the figures are well-designed and intuitive, effectively supporting the main arguments. 2. Ablation studies: The ablation studies are comprehensive and well executed. In particular, Table 3 provides detailed analyses across multiple objective functions, offering a thorough understanding of each component’s contribution.
1. Metrics: The proposed RGBA evaluation metric lacks originality. The method essentially applies conventional image quality metrics only to the non-transparent regions. While this may be acceptable for pixel-wise metrics such as PSNR, it is questionable for perceptual or structural metrics like SSIM and LPIPS, which rely on local context and structural consistency. 2. Discriminator: The choice of a patch-based discriminator warrants further justification. Considering the role of transparency
The paper defines an underexplored problem by introducing alpha-aware learning into generative modeling. Its methodologically sound design—combining dual-KL and patch-level fidelity—effectively bridges RGB and alpha representations. The work demonstrates clear empirical improvements on a new benchmark.
1. Dataset Size & Diversity: The ALPHA dataset (8K images) is relatively small, raising concerns about the model’s scalability and generalization to larger or more diverse real-world datasets. In addition, the limited dataset size may lead to potential overfitting, and it remains unclear whether the 8K samples provide sufficient diversity to support robust model training. 2. Generative Task Evaluation: Although the authors claim that the fine-tuned model can generate transparent images, the qual
1. Clear Motivation: The paper identifies an underexplored but important gap in current generative modeling—handling transparency (alpha channel) in VAEs and diffusion pipelines. This is a concrete and practically relevant problem, especially for compositing, editing, and transparent-object generation tasks. 2. Novel Method Design: The proposed AlphaVAE integrates alpha-channel modeling into the standard RGB VAE pipeline in a simple yet principled way. It avoids complex architectural changes wh
I am not quite familiar with this area, but the training objectives are a little bit too much, with four different losses. I wonder if all of these losses are useful.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
