ReSearch: A Multi-Stage Machine Learning Framework for Earth Science Data Discovery
Youran Sun, Yixin Wen, Haizhao Yang

TL;DR
ReSearch is a multi-stage, reasoning-enhanced search framework that improves Earth Science data discovery by interpreting intent, retrieving relevant datasets, and ranking them effectively using advanced NLP techniques.
Contribution
It introduces a novel multi-stage search architecture combining lexical, semantic, and LLM-based methods, along with a new benchmark for evaluating intent-aware data discovery.
Findings
ReSearch outperforms baseline methods in recall and ranking.
It effectively handles abstract scientific queries.
The framework enhances reproducibility and scalability in Earth Science research.
Abstract
The rapid expansion of Earth Science data from satellite observations, reanalysis products, and numerical simulations has created a critical bottleneck in scientific discovery, namely identifying relevant datasets for a given research objective. Existing discovery systems are primarily retrieval-centric and struggle to bridge the gap between high-level scientific intent and heterogeneous metadata at scale. We introduce \textbf{ReSearch}, a multi-stage, reasoning-enhanced search framework that formulates Earth Science data discovery as an iterative process of intent interpretation, high-recall retrieval, and context-aware ranking. ReSearch integrates lexical search, semantic embeddings, abbreviation expansion, and large language model reranking within a unified architecture that explicitly separates recall and precision objectives. To enable realistic evaluation, we construct a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Data Visualization and Analytics · Topic Modeling
