Multi-Modal Masked Autoencoders for Learning Image-Spectrum Associations for Galaxy Evolution and Cosmology
Morgan Himes, Samiksha Krishnamurthy, Andrew Lizarraga, Srinath Saikrishnan, Vikram Seenivasan, Jonathan Soriano, Ying Nian Wu, Tuan Do

TL;DR
This paper introduces a multi-modal masked autoencoder that learns shared representations of galaxy images and spectra, enabling reconstruction and redshift prediction with promising results for large-scale astronomical surveys.
Contribution
It adapts a transformer-based masked autoencoder to jointly embed galaxy images and spectra, demonstrating its effectiveness in reconstruction and redshift estimation tasks.
Findings
Successfully reconstructs galaxy features from heavily masked data.
Performs redshift regression with accuracy comparable or superior to existing models.
Highlights limitations in fine detail reconstruction and spectral line strength recovery.
Abstract
Upcoming surveys will produce billions of galaxy images but comparatively few spectra, motivating models that learn cross-modal representations. We build a dataset of 134,533 galaxy images (HSC-PDR2) and spectra (DESI-DR1) and adapt a Multi-Modal Masked Autoencoder (MMAE) to embed both images and spectra in a shared representation. The MMAE is a transformer-based architecture, which we train by masking 75% of the data and reconstructing missing image and spectral tokens. We use this model to test three applications: spectral and image reconstruction from heavily masked data and redshift regression from images alone. It recovers key physical features, such as galaxy shapes, atomic emission line peaks, and broad continuum slopes, though it struggles with fine image details and line strengths. For redshift regression, the MMAE performs comparably or better than prior multi-modal models in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
