Making the complete OpenAIRE citation graph easily accessible through compact data representation
Joakim Skarding, Pavel Sanda

TL;DR
This paper presents a compact, accessible version of the large OpenAIRE citation graph, enabling easier processing and analysis on standard computers while preserving the full network structure.
Contribution
The authors provide a downscaled, simplified dataset and tools to process future releases, improving accessibility of the OpenAIRE citation graph.
Findings
Processed dataset fits in 16 GB RAM
Full graph structure preserved in the compact format
Python pipeline for dataset processing provided
Abstract
The OpenAIRE graph contains a large citation graph dataset, with over 200 million publications and over 2 billion citations. The current graph is available as a dump with metadata which, when uncompressed, totals 2.5 TB. This makes it hard to process on conventional computers. To make this network more accessible for the community, we provide a processed OpenAIRE graph which is downscaled to 16 GB RAM, while preserving the full graph structure. Apart from this we offer the processed data in a very simple format, which allows for further straightforward manipulation. We also provide (1) a Python pipeline, which can be used to process the next releases of the OpenAIRE graph, and (2) a larger version of the dataset including more publication fields such as, the title, list of authors.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
