ORCA: a Benchmark for Data Web Crawlers

Michael R\"oder; Geraldo de Souza; Denis Kuchelev and; Abdelmoneim Amer Desouki; Axel-Cyrille Ngonga Ngomo

arXiv:1912.08026·cs.DB·October 30, 2020

ORCA: a Benchmark for Data Web Crawlers

Michael R\"oder, Geraldo de Souza, Denis Kuchelev and, Abdelmoneim Amer Desouki, Axel-Cyrille Ngonga Ngomo

PDF

1 Repo

TL;DR

Orca is a new benchmark that creates a synthetic Data Web to fairly evaluate and compare the performance of Data Web crawlers, addressing the lack of standardized evaluation tools.

Contribution

This work introduces Orca, the first benchmark for Data Web crawlers, enabling fair, repeatable, and comprehensive performance assessments.

Findings

01

Orca effectively differentiates crawler performance.

02

It reveals strengths and weaknesses of existing crawlers.

03

The benchmark is open-source and publicly available.

Abstract

The number of RDF knowledge graphs available on the Web grows constantly. Gathering these graphs at large scale for downstream applications hence requires the use of crawlers. Although Data Web crawlers exist, and general Web crawlers could be adapted to focus on the Data Web, there is currently no benchmark to fairly evaluate their performance. Our work closes this gap by presenting the Orca benchmark. Orca generates a synthetic Data Web, which is decoupled from the original Web and enables a fair and repeatable comparison of Data Web crawlers. Our evaluations show that Orca can be used to reveal the different advantages and disadvantages of existing crawlers. The benchmark is open-source and available at https://github.com/dice-group/orca.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dice-group/orca
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.