Paper2Data: Large-Scale LLM Extraction and Metadata Structuring of Global Urban Data from Scientific Literature

Runwen You; Tong Xia; Jingzhi Wang; Jiankun Zhang; Tengyao Tu; Jinghua Piao; Yi Chang; Yong Li

arXiv:2604.16317·cs.IR·April 21, 2026

Paper2Data: Large-Scale LLM Extraction and Metadata Structuring of Global Urban Data from Scientific Literature

Runwen You, Tong Xia, Jingzhi Wang, Jiankun Zhang, Tengyao Tu, Jinghua Piao, Yi Chang, Yong Li

PDF

1 Repo

TL;DR

This paper introduces UrbanDataMiner, a large-scale platform for discovering urban datasets extracted from scientific literature, enabled by Paper2Data, a novel LLM-driven pipeline with high recall and precision.

Contribution

The paper presents Paper2Data, a new large-scale LLM-based pipeline for automatically extracting and structuring urban datasets from scientific papers, supporting a comprehensive urban data portal.

Findings

01

Paper2Data achieves approximately 90% recall in dataset identification.

02

UrbanDataMiner retrieves over 9% datasets not easily found via general search engines.

03

The infrastructure supports systematic, reusable urban data discovery across disciplines.

Abstract

Urban data support a wide range of applications across multiple disciplines. However, at the global scale, there is no unified platform for urban data discovery. As a result, researchers often have to manually search through websites or scientific literature to identify relevant datasets. To address this problem, we curate an open urban data discovery portal, \textit{UrbanDataMiner}, which supports dataset-level search and filtering over more than 60{,}000 urban datasets extracted from over 15{,}000 Nature-affiliated publications. \textit{UrbanDataMiner} is enabled by \textit{Paper2Data}, a novel large-scale LLM-driven pipeline that automatically identifies dataset mentions in scientific papers and structures them using a unified urban data metadata schema. Human-annotated evaluation demonstrates that \textit{Paper2Data} achieves high recall (approximately 90\%) in dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Yourunwen/Paper2Data
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.