This is an open list of web crawlers associated with AI companies and the training of LLMs to block. We encourage you to contribute to and implement this list on your own site. See information about the listed crawlers and the FAQ.
A number of these crawlers have been sourced from Dark Visitors and we appreciate the ongoing effort they put in to track these crawlers.
If you'd like to add information about a crawler to the list, please make a pull request with the bot name added to robots.txt
, ai.txt
, and any relevant details in table-of-bot-metrics.md
to help people understand what's crawling.
This repository provides the following files:
robots.txt
.htaccess
nginx-block-ai-bots.conf
Caddyfile
haproxy-block-ai-bots.txt
robots.txt
implements the Robots Exclusion Protocol (RFC 9309).
.htaccess
may be used to configure web servers such as Apache httpd to return an error page when one of the listed AI crawlers sends a request to the web server.
Note that, as stated in the httpd documentation, more performant methods than an .htaccess
file exist.
nginx-block-ai-bots.conf
implements a Nginx configuration snippet that can be included in any virtual host server {}
block via the include
directive.
Caddyfile
includes a Header Regex matcher group you can copy or import into your Caddyfile, the rejection can then be handled as followed abort @aibots
haproxy-block-ai-bots.txt
may be used to configure HAProxy to block AI bots. To implement it;
- Add the file to the config directory of HAProxy
- Add the following lines in the
frontend
section;(Note that the path of theacl ai_robot hdr_sub(user-agent) -i -f /etc/haproxy/haproxy-block-ai-bots.txt http-request deny if ai_robot
haproxy-block-ai-bots.txt
may be different in your environment.)
A note about contributing: updates should be added/made to robots.json
. A GitHub action will then generate the updated robots.txt
, table-of-bot-metrics.md
, .htaccess
and nginx-block-ai-bots.conf
.
You can run the tests by installing Python 3 and issuing:
code/tests.py
You can subscribe to list updates via RSS/Atom with the releases feed:
https://github.com/ai-robots-txt/ai.robots.txt/releases.atom
You can subscribe with Feedly, Inoreader, The Old Reader, Feedbin, or any other reader app.
Alternatively, you can also subscribe to new releases with your GitHub account by clicking the ⬇️ on "Watch" button at the top of this page, clicking "Custom" and selecting "Releases".
If you use Cloudflare's hard block alongside this list, you can report abusive crawlers that don't respect robots.txt
here.
But even if you don't use Cloudflare's hard block, their list of verified bots may come in handy.
- Blocking Bots with Nginx by Robb Knight
- Blockin' bots. by Ethan Marcotte
- Blocking Bots With 11ty And Apache by fLaMEd fury
- Blockin' bots on Netlify by Jeremia Kimelman
- Blocking AI web crawlers by Glyn Normington
- Block AI Bots from Crawling Websites Using Robots.txt by Jonathan Gillham, Originality.AI