Identifying AI Web Scrapers Using Canary Tokens

Steven Seiden; Triss Ren; Caroline Zhang; Taein Kim; Enze Liu; Emily Wenger

arXiv:2605.13706·cs.CR·May 14, 2026

Identifying AI Web Scrapers Using Canary Tokens

Steven Seiden, Triss Ren, Caroline Zhang, Taein Kim, Enze Liu, Emily Wenger

PDF

TL;DR

This paper introduces a novel method using Canary Tokens to automatically identify which web scrapers feed data into large language models, enhancing site control over scraping activities.

Contribution

It presents a new technique for reliably and automatically inferring LLM-related web scrapers using Canary Tokens and LLM prompt responses.

Findings

01

Successfully identified scrapers for 22 LLM systems.

02

Detected several scrapers not publicly disclosed.

03

Method is reliable and scalable for third-party inference.

Abstract

From pre-training to query-time augmentation, web-scraped data helps to improve the quality and contextual relevancy of content generated by large language models (LLMs). However, large-scale web scraping to feed LLMs can affect site stability and raise legal, privacy, or ethics concerns. If website owners wish to limit LLM-related web scraping on their site, due to these or other concerns, they may turn to scraper access control mechanisms like the Robots Exclusion Protocol. To be most effective, such mechanisms require site owners to first identify the scrapers that they wish to restrict (e.g., via User-Agent strings). Existing mechanisms to identify LLM-related scrapers rely on voluntary disclosure by companies, one-off experiments by researchers, or crowd-sourced reports -- methods that are neither reliable nor scalable. This paper proposes a novel technique for accurately and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.