Mapping Large Scale Research Metadata to Linked Data: A Performance Comparison of HBase, CSV and XML
Sahar Vahdati, Farah Karim, Jyun-Yao Huang, and Christoph Lange

TL;DR
This paper compares the performance of three different methods—HBase MapReduce, CSV, and XML—for converting large-scale research metadata into Linked Open Data to optimize data integration workflows.
Contribution
It provides a performance evaluation of three conversion approaches for exporting research metadata to Linked Data, aiding in selecting efficient methods for large-scale data processing.
Findings
HBase MapReduce outperforms CSV and XML in conversion speed
CSV-based conversion offers a good balance of simplicity and performance
XML conversion is the slowest among the three methods
Abstract
OpenAIRE, the Open Access Infrastructure for Research in Europe, comprises a database of all EC FP7 and H2020 funded research projects, including metadata of their results (publications and datasets). These data are stored in an HBase NoSQL database, post-processed, and exposed as HTML for human consumption, and as XML through a web service interface. As an intermediate format to facilitate statistical computations, CSV is generated internally. To interlink the OpenAIRE data with related data on the Web, we aim at exporting them as Linked Open Data (LOD). The LOD export is required to integrate into the overall data processing workflow, where derived data are regenerated from the base data every day. We thus faced the challenge of identifying the best-performing conversion approach.We evaluated the performances of creating LOD by a MapReduce job on top of HBase, by mapping the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Advanced Database Systems and Queries · Biomedical Text Mining and Ontologies
