WCXB: A Multi-Type Web Content Extraction Benchmark
Murrough Foley

TL;DR
The paper introduces WCXB, a comprehensive benchmark dataset of 2,008 web pages across diverse types to evaluate and improve web content extraction methods.
Contribution
It provides a large, diverse dataset with detailed annotations and a benchmark for evaluating extraction systems across multiple web page types.
Findings
Top extraction systems perform well on articles (F1=0.93)
Performance varies significantly on structured page types (F1=0.41-0.84)
Existing benchmarks overlook challenges in non-article page types
Abstract
Web content extraction - isolating a page's main content from surrounding boilerplate - is a prerequisite for search indexing, retrieval-augmented generation, NLP dataset construction, and large language model training. Progress in this area has been constrained by the limitations of existing evaluation benchmarks, which are small (100-800 pages), restricted to news articles, or based on web pages from over a decade ago. We introduce the Web Content Extraction Benchmark (WCXB), a dataset of 2,008 web pages from 1,613 domains spanning seven structurally distinct page types: articles, forums, products, collections, listings, documentation, and service pages. The dataset includes a 1,497-page development set and a 511-page held-out test set with matched page type distributions. Ground truth annotations were produced through a five-stage pipeline: LLM-assisted drafting, automated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
