AlphaVAE: Unified End-to-End RGBA Image Reconstruction and Generation with Alpha-Aware Representation Learning

Zile Wang; Hao Yu; Jiabo Zhan; Chun Yuan

arXiv:2507.09308·cs.CV·July 15, 2025

AlphaVAE: Unified End-to-End RGBA Image Reconstruction and Generation with Alpha-Aware Representation Learning

Zile Wang, Hao Yu, Jiabo Zhan, Chun Yuan

PDF

3 Reviews

TL;DR

AlphaVAE introduces a unified end-to-end VAE model for RGBA image reconstruction and generation, leveraging alpha-aware learning to improve transparency handling with significantly less training data.

Contribution

It presents the first comprehensive RGBA benchmark and a novel alpha-aware VAE that extends pretrained RGB VAEs for transparent image synthesis.

Findings

01

Achieves +4.9 dB PSNR improvement over prior methods

02

Enables superior transparent image generation

03

Trains effectively on only 8K images

Abstract

Recent advances in latent diffusion models have achieved remarkable results in high-fidelity RGB image synthesis by leveraging pretrained VAEs to compress and reconstruct pixel data at low computational cost. However, the generation of transparent or layered content (RGBA image) remains largely unexplored, due to the lack of large-scale benchmarks. In this work, we propose ALPHA, the first comprehensive RGBA benchmark that adapts standard RGB metrics to four-channel images via alpha blending over canonical backgrounds. We further introduce ALPHAVAE, a unified end-to-end RGBA VAE that extends a pretrained RGB VAE by incorporating a dedicated alpha channel. The model is trained with a composite objective that combines alpha-blended pixel reconstruction, patch-level fidelity, perceptual consistency, and dual KL divergence constraints to ensure latent fidelity across both RGB and alpha…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. Writing: The paper is clearly written and easy to understand. The organization is logical, and the figures are well-designed and intuitive, effectively supporting the main arguments. 2. Ablation studies: The ablation studies are comprehensive and well executed. In particular, Table 3 provides detailed analyses across multiple objective functions, offering a thorough understanding of each component’s contribution.

Weaknesses

1. Metrics: The proposed RGBA evaluation metric lacks originality. The method essentially applies conventional image quality metrics only to the non-transparent regions. While this may be acceptable for pixel-wise metrics such as PSNR, it is questionable for perceptual or structural metrics like SSIM and LPIPS, which rely on local context and structural consistency. 2. Discriminator: The choice of a patch-based discriminator warrants further justification. Considering the role of transparency

Reviewer 02Rating 4Confidence 3

Strengths

The paper defines an underexplored problem by introducing alpha-aware learning into generative modeling. Its methodologically sound design—combining dual-KL and patch-level fidelity—effectively bridges RGB and alpha representations. The work demonstrates clear empirical improvements on a new benchmark.

Weaknesses

1. Dataset Size & Diversity: The ALPHA dataset (8K images) is relatively small, raising concerns about the model’s scalability and generalization to larger or more diverse real-world datasets. In addition, the limited dataset size may lead to potential overfitting, and it remains unclear whether the 8K samples provide sufficient diversity to support robust model training. 2. Generative Task Evaluation: Although the authors claim that the fine-tuned model can generate transparent images, the qual

Reviewer 03Rating 8Confidence 2

Strengths

1. Clear Motivation: The paper identifies an underexplored but important gap in current generative modeling—handling transparency (alpha channel) in VAEs and diffusion pipelines. This is a concrete and practically relevant problem, especially for compositing, editing, and transparent-object generation tasks. 2. Novel Method Design: The proposed AlphaVAE integrates alpha-channel modeling into the standard RGB VAE pipeline in a simple yet principled way. It avoids complex architectural changes wh

Weaknesses

I am not quite familiar with this area, but the training objectives are a little bit too much, with four different losses. I wonder if all of these losses are useful.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.