Making the complete OpenAIRE citation graph easily accessible through compact data representation

Joakim Skarding; Pavel Sanda

arXiv:2602.12206·cs.SI·May 19, 2026

Making the complete OpenAIRE citation graph easily accessible through compact data representation

Joakim Skarding, Pavel Sanda

PDF

TL;DR

This paper presents a compact, accessible version of the large OpenAIRE citation graph, enabling easier processing and analysis on standard computers while preserving the full network structure.

Contribution

The authors provide a downscaled, simplified dataset and tools to process future releases, improving accessibility of the OpenAIRE citation graph.

Findings

01

Processed dataset fits in 16 GB RAM

02

Full graph structure preserved in the compact format

03

Python pipeline for dataset processing provided

Abstract

The OpenAIRE graph contains a large citation graph dataset, with over 200 million publications and over 2 billion citations. The current graph is available as a dump with metadata which, when uncompressed, totals $\sim$ 2.5 TB. This makes it hard to process on conventional computers. To make this network more accessible for the community, we provide a processed OpenAIRE graph which is downscaled to 16 GB RAM, while preserving the full graph structure. Apart from this we offer the processed data in a very simple format, which allows for further straightforward manipulation. We also provide (1) a Python pipeline, which can be used to process the next releases of the OpenAIRE graph, and (2) a larger version of the dataset including more publication fields such as, the title, list of authors.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.