Automatic Generation of Web Censorship Probe Lists
Jenny Tang, Leo Alvarez, Arjun Brar, Nguyen Phong Hoang, Nicolas, Christin

TL;DR
This paper presents an automated method for generating and updating web censorship probe lists by analyzing URL content, expanding topics, and testing accessibility from multiple locations, improving the scalability and accuracy of censorship measurement.
Contribution
It introduces a novel automated approach to generate and update web censorship probe lists using content analysis and search engine expansion, reducing manual effort and increasing coverage.
Findings
Discovered over 1,400 new potentially censored domains
Generated 119,255 new URLs from initial seed URLs
Demonstrated the feasibility of automated, scalable censorship measurement
Abstract
Domain probe lists--used to determine which URLs to probe for Web censorship--play a critical role in Internet censorship measurement studies. Indeed, the size and accuracy of the domain probe list limits the set of censored pages that can be detected; inaccurate lists can lead to an incomplete view of the censorship landscape or biased results. Previous efforts to generate domain probe lists have been mostly manual or crowdsourced. This approach is time-consuming, prone to errors, and does not scale well to the ever-changing censorship landscape. In this paper, we explore methods for automatically generating probe lists that are both comprehensive and up-to-date for Web censorship measurement. We start from an initial set of 139,957 unique URLs from various existing test lists consisting of pages from a variety of languages to generate new candidate pages. By analyzing content from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training
