Semantic search for 100M+ galaxy images using AI-generated captions

Nolan Koblischke; Liam Parker; Francois Lanusse; Irina Espejo Morales; Jo Bovy; Shirley Ho

arXiv:2512.11982·astro-ph.IM·December 16, 2025

Semantic search for 100M+ galaxy images using AI-generated captions

Nolan Koblischke, Liam Parker, Francois Lanusse, Irina Espejo Morales, Jo Bovy, Shirley Ho

PDF

Open Access 1 Models 5 Datasets

TL;DR

This paper presents AION-Search, a scalable semantic search engine for over 140 million galaxy images using AI-generated captions and contrastive alignment, enabling discovery of rare phenomena without manual labeling.

Contribution

It introduces a novel pipeline combining vision-language models and contrastive learning to enable large-scale, zero-shot semantic search in unlabeled scientific image datasets.

Findings

01

Outperforms direct image similarity search in identifying rare phenomena.

02

Achieves state-of-the-art zero-shot performance on galaxy image search.

03

Nearly doubles recall for challenging targets with VLM-based re-ranking.

Abstract

Finding scientifically interesting phenomena through slow, manual labeling campaigns severely limits our ability to explore the billions of galaxy images produced by telescopes. In this work, we develop a pipeline to create a semantic search engine from completely unlabeled image data. Our method leverages Vision-Language Models (VLMs) to generate descriptions for galaxy images, then contrastively aligns a pre-trained multimodal astronomy foundation model with these embedded descriptions to produce searchable embeddings at scale. We find that current VLMs provide descriptions that are sufficiently informative to train a semantic search model that outperforms direct image similarity search. Our model, AION-Search, achieves state-of-the-art zero-shot performance on finding rare phenomena despite training on randomly selected images with no deliberate curation for rare cases. Furthermore,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
astronolan/aion-search
model· 2 dl
2 dl

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications