Multi-Disciplinary Dataset Discovery from Citation-Verified Literature Contexts
Zhiyin Tan, Changxu Duan

TL;DR
This paper presents a novel literature-driven framework that leverages citation contexts and large language models to improve dataset discovery in scientific literature, surpassing existing search engines in recall and uncovering additional valuable datasets.
Contribution
It introduces a new citation-context mining approach combined with schema-guided dataset recognition and entity resolution, enabling more effective dataset discovery from scientific texts.
Findings
Achieves higher recall than Google Dataset Search and DataCite Commons
Surfaces additional datasets not documented in surveys
Expert assessments indicate high utility and novelty of discovered datasets
Abstract
Identifying suitable datasets for a research question remains challenging because existing dataset search engines rely heavily on metadata quality and keyword overlap, which often fail to capture the semantic intent of scientific investigation. We introduce a literature-driven framework that discovers datasets from citation contexts in scientific papers, enabling retrieval grounded in actual research use rather than metadata availability. Our approach combines large-scale citation-context extraction, schema-guided dataset recognition with Large Language Models, and provenance-preserving entity resolution. We evaluate the system on eight survey-derived computer science queries and find that it achieves substantially higher recall than Google Dataset Search and DataCite Commons, with normalized recall ranging from an average of 47.47% to a highest value of 81.82%. Beyond recovering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Topic Modeling · Advanced Graph Neural Networks
