CosmoCLIP: Generalizing Large Vision-Language Models for Astronomical Imaging
Raza Imam, Mohammed Talha Alam, Umaima Rahman, Mohsen Guizani, Fakhri, Karray

TL;DR
CosmoCLIP is a specialized astronomical image-text contrastive learning framework that fine-tunes a pre-trained CLIP model using SpaceNet and BLIP captions, achieving superior zero-shot performance in astronomical tasks.
Contribution
We introduce CosmoCLIP, a novel astronomical contrastive learning framework that leverages SpaceNet and BLIP captions to enhance generalization of vision-language models in astronomy.
Findings
Outperforms CLIP in zero-shot classification
Achieves superior image-text retrieval accuracy
Demonstrates strong generalization across tasks
Abstract
Existing vision-text contrastive learning models enhance representation transferability and support zero-shot prediction by matching paired image and caption embeddings while pushing unrelated pairs apart. However, astronomical image-label datasets are significantly smaller compared to general image and label datasets available from the internet. We introduce CosmoCLIP, an astronomical image-text contrastive learning framework precisely fine-tuned on the pre-trained CLIP model using SpaceNet and BLIP-based captions. SpaceNet, attained via FLARE, constitutes ~13k optimally distributed images, while BLIP acts as a rich knowledge extractor. The rich semantics derived from this SpaceNet and BLIP descriptions, when learned contrastively, enable CosmoCLIP to achieve superior generalization across various in-domain and out-of-domain tasks. Our results demonstrate that CosmoCLIP is a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAstronomical Observations and Instrumentation
MethodsContrastive Language-Image Pre-training · BLIP: Bootstrapping Language-Image Pre-training · Contrastive Learning
