Wikipedia Reader Navigation: When Synthetic Data Is Enough
Akhil Arora, Martin Gerlach, Tiziano Piccardi, Alberto, Garc\'ia-Dur\'an, Robert West

TL;DR
This study evaluates how effectively Wikipedia's publicly available clickstream data can approximate actual reader navigation patterns, demonstrating its utility for research while respecting user privacy.
Contribution
The paper systematically compares real Wikipedia navigation sequences with synthetic ones generated from clickstream data, showing close approximation with small effect sizes.
Findings
Clickstream data closely approximates real navigation with less than 10% difference.
Synthetic sequences can be used for practical downstream applications.
Clickstream data enables privacy-preserving research on user navigation.
Abstract
Every day millions of people read Wikipedia. When navigating the vast space of available topics using hyperlinks, readers describe trajectories on the article network. Understanding these navigation patterns is crucial to better serve readers' needs and address structural biases and knowledge gaps. However, systematic studies of navigation on Wikipedia are hindered by a lack of publicly available data due to the commitment to protect readers' privacy by not storing or sharing potentially sensitive data. In this paper, we ask: How well can Wikipedia readers' navigation be approximated by using publicly available resources, most notably the Wikipedia clickstream data? We systematically quantify the differences between real navigation sequences and synthetic sequences generated from the clickstream data, in 6 analyses across 8 Wikipedia language versions. Overall, we find that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
