TL;DR
This paper introduces UrbanDataMiner, a large-scale platform for discovering urban datasets extracted from scientific literature, enabled by Paper2Data, a novel LLM-driven pipeline with high recall and precision.
Contribution
The paper presents Paper2Data, a new large-scale LLM-based pipeline for automatically extracting and structuring urban datasets from scientific papers, supporting a comprehensive urban data portal.
Findings
Paper2Data achieves approximately 90% recall in dataset identification.
UrbanDataMiner retrieves over 9% datasets not easily found via general search engines.
The infrastructure supports systematic, reusable urban data discovery across disciplines.
Abstract
Urban data support a wide range of applications across multiple disciplines. However, at the global scale, there is no unified platform for urban data discovery. As a result, researchers often have to manually search through websites or scientific literature to identify relevant datasets. To address this problem, we curate an open urban data discovery portal, \textit{UrbanDataMiner}, which supports dataset-level search and filtering over more than 60{,}000 urban datasets extracted from over 15{,}000 Nature-affiliated publications. \textit{UrbanDataMiner} is enabled by \textit{Paper2Data}, a novel large-scale LLM-driven pipeline that automatically identifies dataset mentions in scientific papers and structures them using a unified urban data metadata schema. Human-annotated evaluation demonstrates that \textit{Paper2Data} achieves high recall (approximately 90\%) in dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
