Web2Wiki: Characterizing Wikipedia Linking Across the Web
Veniamin Veselovsky, Tiziano Piccardi, Ashton Anderson, Robert West, Akhil Arora

TL;DR
This study provides a large-scale analysis of how Wikipedia is referenced across the Web, revealing its primary roles and contexts in various online domains, and introduces the Web2Wiki dataset for further research.
Contribution
First large-scale analysis of Wikipedia links across the Web, characterizing their distribution, context, and function, and releasing a comprehensive dataset for future studies.
Findings
Wikipedia is mainly cited by news and science sites for information.
Most links are within main content, not user-generated sections.
95% of links are explanatory references, not evidence or attribution.
Abstract
Wikipedia is one of the most visited websites globally, yet its role beyond its own platform remains largely unexplored. In this paper, we present the first large-scale analysis of how Wikipedia is referenced across the Web. Using a dataset from Common Crawl, we identify over 90 million Wikipedia links spanning 1.68% of Web domains and examine their distribution, context, and function. Our analysis of English Wikipedia reveals three key findings: (1) Wikipedia is most frequently cited by news and science websites for informational purposes, while commercial websites reference it less often. (2) The majority of Wikipedia links appear within the main content rather than in boilerplate or user-generated sections, highlighting their role in structured knowledge presentation. (3) Most links (95%) serve as explanatory references rather than as evidence or attribution, reinforcing Wikipedia's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWikis in Education and Collaboration · Knowledge Management and Sharing · Information Retrieval and Search Behavior
