Scrapers selectively respect robots.txt directives: evidence from a large-scale empirical study

Taein Kim; Karstan Bock; Claire Luo; Amanda Liswood; Chloe Poroslay; Emily Wenger

arXiv:2505.21733·cs.NI·October 24, 2025

Scrapers selectively respect robots.txt directives: evidence from a large-scale empirical study

Taein Kim, Karstan Bock, Claire Luo, Amanda Liswood, Chloe Poroslay, Emily Wenger

PDF

TL;DR

This large-scale empirical study reveals that many web scrapers, especially AI search bots, often ignore robots.txt directives, questioning its effectiveness for controlling unwanted scraping.

Contribution

The study provides the first comprehensive analysis of scraper compliance with robots.txt, highlighting its limitations and the need for alternative anti-scraping measures.

Findings

01

Bots are less compliant with stricter robots.txt directives.

02

AI search crawlers rarely check robots.txt.

03

Relying solely on robots.txt is risky for preventing unwanted scraping.

Abstract

Online data scraping has taken on new dimensions in recent years, as traditional scrapers have been joined by new AI-specific bots. To counteract unwanted scraping, many sites use tools like the Robots Exclusion Protocol (REP), which places a robots $.$ txt file at the site root to dictate scraper behavior. Yet, the efficacy of the REP is not well-understood. Anecdotal evidence suggests some bots comply poorly with it, but no rigorous study exists to support (or refute) this claim. To understand the merits and limits of the REP, we conduct the first large-scale study of web scraper compliance with robots $.$ txt directives using anonymized web logs from our institution. We analyze the behavior of 130 self-declared bots (and many anonymous ones) over 40 days, using a series of controlled robots $.$ txt experiments. We find that bots are less likely to comply with stricter robots $.$ txt…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.