FastWARC: Optimizing Large-Scale Web Archive Analytics
Janek Bevendorff, Martin Potthast, Benno Stein

TL;DR
FastWARC is a high-performance Python library for processing large web archives, significantly reducing computation time by optimizing WARC file handling with C++/Cython, enabling more efficient large-scale web data analytics.
Contribution
The paper introduces FastWARC, a novel C++/Cython-based library that accelerates WARC processing, addressing inefficiencies in existing tools for large-scale web archive analysis.
Findings
Achieves 1.6-8x speedup over existing WARC tools
Reduces processing time for large web archives significantly
Improves efficiency of large-scale web data analytics
Abstract
Web search and other large-scale web data analytics rely on processing archives of web pages stored in a standardized and efficient format. Since its introduction in 2008, the IIPC's Web ARCive (WARC) format has become the standard format for this purpose. As a list of individually compressed records of HTTP requests and responses, it allows for constant-time random access to all kinds of web data via off-the-shelf open source parsers in many programming languages, such as WARCIO, the de-facto standard for Python. When processing web archives at the terabyte or petabyte scale, however, even small inefficiencies in these tools add up quickly, resulting in hours, days, or even weeks of wasted compute time. Reviewing the basic components of WARCIO and analyzing its bottlenecks, we proceed to build FastWARC, a new high-performance WARC processing library for Python, written in C++/Cython,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Algorithms and Data Compression · Advanced Data Storage Technologies
