Domain-Specific Latent Representations Improve the Fidelity of Diffusion-Based Medical Image Super-Resolution

Sebastian Cajas; Ashaba Judith; Rahul Gorijavolu; Sahil Kapadia; Hillary Clinton Kasimbazi; Leo Kinyera; Emmanuel Paul Kwesiga; Sri Sri Jaithra Varma Manthena; Luis Filipe Nakayama; Ninsiima Doreen; Leo Anthony Celi

arXiv:2604.12152·cs.CV·April 15, 2026

Domain-Specific Latent Representations Improve the Fidelity of Diffusion-Based Medical Image Super-Resolution

Sebastian Cajas, Ashaba Judith, Rahul Gorijavolu, Sahil Kapadia, Hillary Clinton Kasimbazi, Leo Kinyera, Emmanuel Paul Kwesiga, Sri Sri Jaithra Varma Manthena, Luis Filipe Nakayama, Ninsiima Doreen, Leo Anthony Celi

PDF

1 Repo

TL;DR

Using domain-specific autoencoders significantly enhances the quality of diffusion-based medical image super-resolution, with improvements in PSNR and stable hallucination rates, emphasizing the importance of autoencoder choice.

Contribution

Demonstrates that replacing generic VAEs with domain-specific autoencoders improves super-resolution performance and provides a practical criterion for autoencoder selection.

Findings

01

Replacing the VAE yields +2.91 to +3.29 dB PSNR improvement.

02

Wavelet decomposition localizes the advantage to fine spatial frequency bands.

03

Autoencoder quality predicts downstream super-resolution performance with R^2 = 0.67.

Abstract

Latent diffusion models for medical image super-resolution universally inherit variational autoencoders designed for natural photographs. We show that this default choice, not the diffusion architecture, is the dominant constraint on reconstruction quality. In a controlled experiment holding all other pipeline components fixed, replacing the generic Stable Diffusion VAE with MedVAE, a domain-specific autoencoder pretrained on more than 1.6 million medical images, yields +2.91 to +3.29 dB PSNR improvement across knee MRI, brain MRI, and chest X-ray (n = 1,820; Cohen's d = 1.37 to 1.86, all p < 10^{-20}, Wilcoxon signed-rank). Wavelet decomposition localises the advantage to the finest spatial frequency bands encoding anatomically relevant fine structure. Ablations across inference schedules, prediction targets, and generative architectures confirm the gap is stable within plus or minus…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sebasmos/latent-sr
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.