WCXB: A Multi-Type Web Content Extraction Benchmark

Murrough Foley

arXiv:2605.21097·cs.CL·May 21, 2026

WCXB: A Multi-Type Web Content Extraction Benchmark

Murrough Foley

PDF

TL;DR

The paper introduces WCXB, a comprehensive benchmark dataset of 2,008 web pages across diverse types to evaluate and improve web content extraction methods.

Contribution

It provides a large, diverse dataset with detailed annotations and a benchmark for evaluating extraction systems across multiple web page types.

Findings

01

Top extraction systems perform well on articles (F1=0.93)

02

Performance varies significantly on structured page types (F1=0.41-0.84)

03

Existing benchmarks overlook challenges in non-article page types

Abstract

Web content extraction - isolating a page's main content from surrounding boilerplate - is a prerequisite for search indexing, retrieval-augmented generation, NLP dataset construction, and large language model training. Progress in this area has been constrained by the limitations of existing evaluation benchmarks, which are small (100-800 pages), restricted to news articles, or based on web pages from over a decade ago. We introduce the Web Content Extraction Benchmark (WCXB), a dataset of 2,008 web pages from 1,613 domains spanning seven structurally distinct page types: articles, forums, products, collections, listings, documentation, and service pages. The dataset includes a 1,497-page development set and a 511-page held-out test set with matched page type distributions. Ground truth annotations were produced through a five-stage pipeline: LLM-assisted drafting, automated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.