Identifying AI Web Scrapers Using Canary Tokens
Steven Seiden, Triss Ren, Caroline Zhang, Taein Kim, Enze Liu, Emily Wenger

TL;DR
This paper introduces a novel method using Canary Tokens to automatically identify which web scrapers feed data into large language models, enhancing site control over scraping activities.
Contribution
It presents a new technique for reliably and automatically inferring LLM-related web scrapers using Canary Tokens and LLM prompt responses.
Findings
Successfully identified scrapers for 22 LLM systems.
Detected several scrapers not publicly disclosed.
Method is reliable and scalable for third-party inference.
Abstract
From pre-training to query-time augmentation, web-scraped data helps to improve the quality and contextual relevancy of content generated by large language models (LLMs). However, large-scale web scraping to feed LLMs can affect site stability and raise legal, privacy, or ethics concerns. If website owners wish to limit LLM-related web scraping on their site, due to these or other concerns, they may turn to scraper access control mechanisms like the Robots Exclusion Protocol. To be most effective, such mechanisms require site owners to first identify the scrapers that they wish to restrict (e.g., via User-Agent strings). Existing mechanisms to identify LLM-related scrapers rely on voluntary disclosure by companies, one-off experiments by researchers, or crowd-sourced reports -- methods that are neither reliable nor scalable. This paper proposes a novel technique for accurately and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
