ClueWeb22: 10 Billion Web Documents with Visual and Semantic Information
Arnold Overwijk, Chenyan Xiong, Xiao Liu, Cameron VandenBerg, Jamie, Callan

TL;DR
ClueWeb22 is a large-scale, high-quality web dataset with 10 billion pages, enriched with visual, structural, and textual information, designed to support advanced research in information retrieval and AI.
Contribution
It introduces a significantly larger, more diverse, and higher-quality web corpus with multi-modal signals, available for research at an unprecedented scale.
Findings
Provides rich visual and structural data for each web page.
Enables new research in retrieval-augmented AI and model pretraining.
Offers a resource aligned with commercial web search distributions.
Abstract
ClueWeb22, the newest iteration of the ClueWeb line of datasets, provides 10 billion web pages affiliated with rich information. Its design was influenced by the need for a high quality, large scale web corpus to support a range of academic and industry research, for example, in information systems, retrieval-augmented AI systems, and model pretraining. Compared with earlier ClueWeb corpora, the ClueWeb22 corpus is larger, more varied, of higher-quality, and aligned with the document distributions in commercial web search. Besides raw HTML, ClueWeb22 includes rich information about the web pages provided by industry-standard document understanding systems, including the visual representation of pages rendered by a web browser, parsed HTML structure information from a neural network parser, and pre-processed cleaned document text to lower the barrier to entry. Many of these signals have…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Topic Modeling · Natural Language Processing Techniques
