Self-supervised similarity search for large scientific datasets
George Stein, Peter Harrington, Jacqueline Blaum, Tomislav Medan,, Zarija Lukic

TL;DR
This paper introduces a self-supervised learning approach to analyze large unlabeled scientific datasets, specifically galaxy images, enabling efficient similarity search and discovery of rare objects, with broad applicability across scientific domains.
Contribution
The paper presents a novel self-supervised model for extracting robust low-dimensional representations from large unlabeled datasets, facilitating similarity search and data exploration.
Findings
Successfully applied to 42 million galaxy images from DESI
Enabled rapid discovery of rare objects from single examples
Improved data curation and supervised training set construction
Abstract
We present the use of self-supervised learning to explore and exploit large unlabeled datasets. Focusing on 42 million galaxy images from the latest data release of the Dark Energy Spectroscopic Instrument (DESI) Legacy Imaging Surveys, we first train a self-supervised model to distill low-dimensional representations that are robust to symmetries, uncertainties, and noise in each image. We then use the representations to construct and publicly release an interactive semantic similarity search tool. We demonstrate how our tool can be used to rapidly discover rare objects given only a single example, increase the speed of crowd-sourcing campaigns, and construct and improve training sets for supervised applications. While we focus on images from sky surveys, the technique is straightforward to apply to any scientific dataset of any dimensionality. The similarity search web app can be found…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Video Analysis and Summarization · Image Retrieval and Classification Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
