FastWARC: Optimizing Large-Scale Web Archive Analytics

Janek Bevendorff; Martin Potthast; Benno Stein

arXiv:2112.03103·cs.IR·December 7, 2021

FastWARC: Optimizing Large-Scale Web Archive Analytics

Janek Bevendorff, Martin Potthast, Benno Stein

PDF

Open Access 1 Repo 2 Datasets

TL;DR

FastWARC is a high-performance Python library for processing large web archives, significantly reducing computation time by optimizing WARC file handling with C++/Cython, enabling more efficient large-scale web data analytics.

Contribution

The paper introduces FastWARC, a novel C++/Cython-based library that accelerates WARC processing, addressing inefficiencies in existing tools for large-scale web archive analysis.

Findings

01

Achieves 1.6-8x speedup over existing WARC tools

02

Reduces processing time for large web archives significantly

03

Improves efficiency of large-scale web data analytics

Abstract

Web search and other large-scale web data analytics rely on processing archives of web pages stored in a standardized and efficient format. Since its introduction in 2008, the IIPC's Web ARCive (WARC) format has become the standard format for this purpose. As a list of individually compressed records of HTTP requests and responses, it allows for constant-time random access to all kinds of web data via off-the-shelf open source parsers in many programming languages, such as WARCIO, the de-facto standard for Python. When processing web archives at the terabyte or petabyte scale, however, even small inefficiencies in these tools add up quickly, resulting in hours, days, or even weeks of wasted compute time. Reviewing the basic components of WARCIO and analyzing its bottlenecks, we proceed to build FastWARC, a new high-performance WARC processing library for Python, written in C++/Cython,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chatnoir-eu/chatnoir-resiliparse
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Algorithms and Data Compression · Advanced Data Storage Technologies