The Science Data Lake: A Unified Open Infrastructure Integrating 293 Million Papers Across Eight Scholarly Sources with Embedding-Based Ontology Alignment
Jonas Wilinski

TL;DR
This paper introduces the Science Data Lake, an open, unified infrastructure integrating 293 million scholarly papers from eight sources, using embedding-based ontology alignment to enhance cross-source data linkage and analysis.
Contribution
The paper presents a scalable, open-source infrastructure that unifies multiple scholarly data sources and employs embedding-based ontology alignment for improved interoperability.
Findings
Achieved 99.8% topic coverage with high accuracy in ontology mapping.
Validated the resource through multiple automated and manual quality checks.
Demonstrated novel cross-source analyses enabled by the unified data lake.
Abstract
Scholarly data are largely fragmented across siloed databases with divergent metadata and missing linkages among them. We present the Science Data Lake, a locally-deployable infrastructure built on DuckDB and simple Parquet files that unifies eight open sources - Semantic Scholar, OpenAlex, SciSciNet, Papers with Code, Retraction Watch, Reliance on Science, a preprint-to-published mapping, and Crossref - via DOI normalization while preserving source-level schemas. The resource comprises approximately 960GB of Parquet files spanning ~293 million uniquely identifiable papers across ~22 schemas and ~153 SQL views. An embedding-based ontology alignment using BGE-large sentence embeddings maps 4,516 OpenAlex topics to 13 scientific ontologies (~1.3 million terms), yielding 16,150 mappings covering 99.8% of topics ( threshold) with at the recommended …
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Research Data Management Practices · Scientific Computing and Data Management
