Blocking AI bots is the implementation half of a question with two halves. The strategic side, whether you should block them at all, lives in Should You Block AI Bots?. This article is the technical reference: who the bots are, what each one does, and the exact rules to keep them out, at every layer from robots.txt to your CDN.
The short version: most major AI crawlers respect robots.txt, so the first rule belongs there. A few do not, so the second rule belongs at the web server or at the CDN. And the bot you want to think hardest about is rarely the famous one but the one nobody has heard of yet, which is why the right setup is layered and verifiable, not a single line in a single file.
How do I block AI bots?
There are five places to block AI bots, in increasing order of force. First, robots.txt opts you out of every well-behaved crawler with a single Disallow: / block per bot. Second, your web server (Nginx or Apache) drops the request when the User-Agent header matches, which catches bots that ignore robots.txt but still send their real UA. Third, your CDN (Cloudflare is the common case) blocks at the edge before the request ever reaches your origin. Fourth, WordPress can layer plugin-based or theme-level filters on top. Fifth, special vendor tokens like Google-Extended and Applebot-Extended opt you out of model training without blocking the search crawler. Combine robots.txt for the well-behaved bots, server or CDN rules for Bytespider, and vendor tokens for Google and Apple.
Jump to:
- The bots that matter
- Vendor reference: official bot docs and IP sources
- Block AI bots with robots.txt
- Per-bot robots.txt snippets
- Block AI bots with Nginx
- Block AI bots with Apache
- Block AI bots with Cloudflare
- Block AI bots in WordPress
- Verify the blocks actually work
- The bots that ignore robots.txt
- FAQ
The bots that matter
There are dozens of AI-related User-Agents in the wild, but a handful drive almost all the traffic and almost all the policy debate. The table below is the working set I keep updated. The "Respects robots.txt" column is the one most other guides get wrong; verified per-vendor as of mid-2026.
| Bot (User-agent) | Owner | Purpose | Respects robots.txt | Sends traffic | Default action |
|---|---|---|---|---|---|
GPTBot | OpenAI | LLM training | Yes | No | Block if anti-training |
ChatGPT-User | OpenAI | User-triggered fetch from ChatGPT | Yes | Yes (when a user clicks a citation) | Allow |
OAI-SearchBot | OpenAI | ChatGPT search index | Yes | Yes | Allow |
ClaudeBot | Anthropic | LLM training | Yes | No | Block if anti-training |
Claude-User | Anthropic | User-triggered fetch from Claude | Yes | Yes | Allow |
Claude-SearchBot | Anthropic | Claude citation index | Yes | Yes | Allow |
Google-Extended | Token for opting out of Gemini training | Yes (token only) | None directly | Disallow if anti-training | |
Googlebot | Google Search index (also informs AI Overviews) | Yes | Yes | Allow | |
PerplexityBot | Perplexity | Citation index | Yes | Yes | Allow |
Perplexity-User | Perplexity | User-triggered fetch | May ignore robots.txt | Yes | Allow, or block at the CDN |
Applebot | Apple | Siri / Spotlight search | Yes | Yes | Allow |
Applebot-Extended | Apple | Token for opting out of Apple Intelligence training | Yes (token only) | None directly | Disallow if anti-training |
Bytespider | ByteDance | Aggressive training crawl | No | Minimal | Block at server or CDN |
CCBot | Common Crawl | Open dataset used by many AI vendors | Yes | No | Block if anti-training |
Meta-ExternalAgent | Meta | LLM training | Yes (with some delay) | No | Block if anti-training |
Amazonbot | Amazon | Alexa and AI training | Yes | No direct | Block if anti-training |
cohere-ai | Cohere | LLM training | Yes | No | Block if anti-training |
MistralAI-User | Mistral | User-triggered fetch | Yes | Yes | Allow |
DuckAssistBot | DuckDuckGo | DuckAssist citation | Yes | Yes | Allow |
Two distinctions in this table do more work than any single block rule.
Training versus retrieval. Most vendors run separate crawlers for the two jobs. GPTBot and OAI-SearchBot are different bots from the same company; blocking the training one has zero effect on the search one, and the search one is where citation traffic comes from. The same split holds for Anthropic and is the reason a "block all AI bots" rule usually blocks the wrong half.
Crawler versus token. Google-Extended and Applebot-Extended are not crawlers. They are robots.txt tokens. Disallowing Google-Extended does not block Googlebot; it just opts your content out of Gemini training while Google Search keeps indexing you. Same shape for Apple. These are the only two bots where "block" means "do not train on" and not "do not crawl".
Vendor reference: official bot docs and IP sources
This is the canonical lookup. Every operator who blocks at the server or CDN layer should be working from the vendor's own published User-Agent list and IP source, not a third-party summary that goes stale the moment a vendor rotates a CIDR.
| Bot / User-Agent | Vendor | Purpose | Respects robots.txt | Official docs | IP source |
|---|---|---|---|---|---|
GPTBot, OAI-SearchBot, ChatGPT-User | OpenAI | Training, search index, user fetch | Yes | platform.openai.com/docs/bots | gptbot.json, searchbot.json, chatgpt-user.json |
ClaudeBot, Claude-User, Claude-SearchBot | Anthropic | Training, user fetch, citation index | Yes | support.claude.com (crawler article) | claude.com/crawling/bots.json |
PerplexityBot, Perplexity-User | Perplexity | Citation index, user fetch | Yes for PerplexityBot, no for Perplexity-User (per vendor) | docs.perplexity.ai/guides/bots | perplexitybot.json, perplexity-user.json |
Googlebot, Google-Extended, GoogleOther | Search index, training opt-out token, generic | Yes | developers.google.com (common crawlers) | common-crawlers.json, reverse DNS *.googlebot.com | |
CCBot | Common Crawl | Open dataset (consumed by many AI vendors) | Yes | commoncrawl.org/ccbot | index.commoncrawl.org/ccbot.json, reverse DNS |
Applebot, Applebot-Extended | Apple | Siri / Spotlight, training opt-out token | Yes | support.apple.com/en-us/119829 | Applebot IP CIDR JSON referenced on the support page, plus reverse DNS |
bingbot | Microsoft | Bing index (and Copilot grounding) | Yes | bing.com/webmasters (which crawlers), verify bingbot guide | No published JSON. Reverse DNS to *.search.msn.com, then forward-confirm |
Meta-ExternalAgent, Meta-ExternalFetcher, FacebookExternalHit | Meta | Training, user-triggered fetch, link unfurl | Partial (Meta documents 24h cache; some agents skip) | developers.facebook.com (sharing bot) | Not published on the bot page. Meta refers operators to its general IP allow-list documentation |
Bytespider | ByteDance | Aggressive training crawl | No (widely reported to ignore) | No canonical vendor page exists. Third-party references like darkvisitors.com/agents/bytespider are the best available | None published. Block at server or CDN by User-Agent |
DuckAssistBot, DuckDuckBot | DuckDuckGo | DuckAssist citation, search index | Yes | duckduckgo.com/duckduckbot | IP list published on the page and as JSON via the linked endpoint |
Amazonbot, Amzn-SearchBot, Amzn-User | Amazon | Alexa, search index, user fetch | Yes | developer.amazon.com/amazonbot | Per-bot IP lists linked from the Amazonbot page |
Two notes before you copy any of these into a config.
Vendor IP lists change. Re-pull the JSON at deploy time from a small build step or cron, do not hardcode IPs from a copy-paste, or your allow-list will silently drift.
Bytespider and the handful of others that ignore robots.txt need enforcement at the server or CDN layer. See the Nginx, Apache, and Cloudflare sections below for the actual rules.
Block AI bots with robots.txt
robots.txt is the first and easiest layer, and for most well-behaved bots, the only one you need. The file lives at the document root, served from /robots.txt. Each User-Agent block applies to one bot; Disallow: / means "do not crawl anything".
A minimal "block all training, allow search and citation" robots.txt:
# Training crawlers: block
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: Amazonbot
Disallow: /
# Training opt-out tokens (these are not crawlers)
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
# Search / citation / user-triggered: allow
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: Claude-User
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: Applebot
Allow: /
User-agent: MistralAI-User
Allow: /
User-agent: DuckAssistBot
Allow: /
# Everything else: default
User-agent: *
Allow: /
A few rules of the road.
- Order does not matter for
robots.txt. The most specific User-Agent block wins, then the*fallback. - Case sensitivity: the User-Agent name is case-insensitive in practice. Match the vendor's documented spelling anyway.
- Per-path control works the same way:
Disallow: /private/blocks one folder. You can mix and match. - Sitemap and Host directives go at the bottom, outside any per-agent block.
- Test, do not assume. A typo in a User-Agent name silently disables the rule. The "Verify the blocks actually work" section below covers the checks.
For Next.js sites, robots.ts (in app/) generates the file from code. Add the per-agent rules there in TypeScript. For static sites, put a literal robots.txt in public/. The two approaches produce the same output.
If you would rather click than type, the DNS Checker robots.txt tool builds the file for you. Pick the bots you want to block, switch the toggle for each, copy the live preview into robots.txt. The Analyzer tab also validates an existing file against RFC 9309 and runs a per-bot access check, which is the cheapest way to confirm a block is actually in effect before you wait on log data.

Per-bot robots.txt snippets
Standalone snippets for when you only want to block one specific crawler. Each one is the entire block needed; drop it into the file and the rest of your robots.txt is unaffected.
# Block GPTBot (OpenAI training)
User-agent: GPTBot
Disallow: /
# Block ClaudeBot (Anthropic training)
User-agent: ClaudeBot
Disallow: /
# Opt out of Gemini training (does not block Googlebot)
User-agent: Google-Extended
Disallow: /
# Opt out of Apple Intelligence training (does not block Applebot)
User-agent: Applebot-Extended
Disallow: /
# Block PerplexityBot (the citation crawler; user fetches still come via Perplexity-User)
User-agent: PerplexityBot
Disallow: /
# Block Bytespider (ByteDance / TikTok training). Note: Bytespider ignores
# robots.txt in practice. This line is for completeness; pair with a
# server-level block.
User-agent: Bytespider
Disallow: /
# Block Common Crawl (CCBot). Common Crawl publishes an open dataset that
# many AI vendors train on, so blocking CCBot is one of the higher-leverage
# anti-training moves.
User-agent: CCBot
Disallow: /
# Block Meta-ExternalAgent (Meta AI training)
User-agent: Meta-ExternalAgent
Disallow: /
# Block Amazonbot (Alexa and Amazon AI)
User-agent: Amazonbot
Disallow: /
# Block cohere-ai (Cohere training)
User-agent: cohere-ai
Disallow: /
Block AI bots with Nginx
robots.txt is a polite request; an Nginx rule is a closed door. Use it for the bots that ignore robots.txt (Bytespider is the famous case), for cases where you want to be sure a block is enforced, and for path-level rules that go beyond what robots.txt cleanly expresses.
Drop this into your server block (or, for site-wide rules, into the http block):
# Block AI training crawlers at the edge.
if ($http_user_agent ~* "GPTBot|ClaudeBot|CCBot|Bytespider|Meta-ExternalAgent|anthropic-ai|cohere-ai|Amazonbot") {
return 403;
}Notes:
~*is a case-insensitive regex match. The|separates alternatives. Add or remove bots as your policy changes.return 403is the cleanest response. Some operators prefer444(Nginx's "drop the connection" code) so no body is sent at all.- Avoid putting this inside a
locationblock that is overridden elsewhere; site-wide rules belong at theserverlevel. ifin Nginx has well-documented quirks insidelocationblocks, but for a simple top-level UA check it is the standard, correct approach.
A more robust shape that scales as your bot list grows uses a map:
map $http_user_agent $ai_bot {
default 0;
"~*GPTBot" 1;
"~*ClaudeBot" 1;
"~*CCBot" 1;
"~*Bytespider" 1;
"~*Meta-ExternalAgent" 1;
"~*anthropic-ai" 1;
"~*cohere-ai" 1;
"~*Amazonbot" 1;
}
server {
# ... your normal config ...
if ($ai_bot) {
return 403;
}
}The map block goes in the http context. Adding a bot is one line, and the per-server if stays trivial.
Block AI bots with Apache
The Apache equivalent lives in .htaccess (or the main vhost). Two clean shapes, depending on whether you have mod_rewrite or mod_authz_host available; both are standard on every modern Apache install.
mod_rewrite (drop this into .htaccess at the document root):
RewriteEngine On
# Match any of the AI training User-Agents and return 403.
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|CCBot|Bytespider|Meta-ExternalAgent|anthropic-ai|cohere-ai|Amazonbot) [NC]
RewriteRule ^ - [F,L][NC]is case-insensitive matching.[F]returns403 Forbidden.[L]stops processing further rules.- The
^and-mean "match anything, do nothing else"; the rule's only effect is theFflag.
mod_setenvif plus Require, which some operators prefer because it keeps the rule outside the rewrite pipeline:
SetEnvIfNoCase User-Agent "GPTBot" ai_bot
SetEnvIfNoCase User-Agent "ClaudeBot" ai_bot
SetEnvIfNoCase User-Agent "CCBot" ai_bot
SetEnvIfNoCase User-Agent "Bytespider" ai_bot
SetEnvIfNoCase User-Agent "Meta-ExternalAgent" ai_bot
SetEnvIfNoCase User-Agent "anthropic-ai" ai_bot
SetEnvIfNoCase User-Agent "cohere-ai" ai_bot
SetEnvIfNoCase User-Agent "Amazonbot" ai_bot
<RequireAll>
Require all granted
Require not env ai_bot
</RequireAll>This produces 403 for any request whose User-Agent contains one of the listed substrings.
Block AI bots with Cloudflare
When the site sits behind Cloudflare, the highest-leverage layer is Cloudflare itself. Requests get judged at the edge, never touching your origin, and Cloudflare keeps the bot list updated for you.
Cloudflare ships a managed "Block AI bots" feature in two related places.
Managed robots.txt (per-zone setting). Cloudflare can append AI-bot disallow rules to your robots.txt automatically. Find it under Security → Settings → Manage robots.txt (or the equivalent in newer dashboards), toggle it on, and Cloudflare injects the appropriate blocks based on its bot directory. This is convenient and mostly correct, but it relies on the bot honoring robots.txt.
Block AI Bots WAF rule. Cloudflare exposes a one-click WAF rule that blocks known AI crawlers at the request level, regardless of robots.txt. Security → Bots → Configure and enable the "Block AI bots" toggle. This is the right tool for Bytespider and for any case where you want enforcement rather than a request.
For a custom WAF rule (if you want path exceptions or fine control), build it from Security → WAF → Custom rules:
(http.user_agent contains "Bytespider") or
(http.user_agent contains "GPTBot") or
(http.user_agent contains "ClaudeBot") or
(http.user_agent contains "CCBot") or
(http.user_agent contains "Meta-ExternalAgent") or
(http.user_agent contains "Amazonbot")
Action: Block. Cloudflare matches the string anywhere in the User-Agent, which is what you want here.
One caution. Cloudflare's "Block AI bots" includes the citation bots by default in some plans, not just the training ones. If you want AI Overviews citations or ChatGPT search to keep finding you, audit the managed list and exclude ChatGPT-User, OAI-SearchBot, PerplexityBot, Claude-SearchBot, and Claude-User from the block.
Block AI bots in WordPress
WordPress sits on top of Apache or Nginx, so the rules in those sections apply. Two WordPress-specific paths are also useful.
Edit robots.txt from WordPress. WordPress generates a virtual robots.txt at /robots.txt if no static file exists. You can replace it with a real file in the document root, or you can filter the generated one from the theme or a small plugin:
add_filter( 'robots_txt', function ( $output, $public ) {
if ( ! $public ) {
// Site is set to "Discourage search engines"; don't add more.
return $output;
}
$extra = <<<RULES
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
RULES;
return $output . $extra;
}, 10, 2 );Drop that into mu-plugins/block-ai-bots.php, or into your theme's functions.php. WordPress appends your rules to whatever the core code already emits.
SEO plugins handle this too. Yoast SEO and Rank Math both expose a "robots.txt editor" in their dashboards. If you already run one, edit there to keep one source of truth; do not also filter robots_txt in code or you will end up with duplicate blocks.
Plugin enforcement. A plugin like "Blackhole for Bad Bots" or a custom mu-plugin can drop requests whose User-Agent matches a list, before WordPress finishes booting. That is the WordPress equivalent of the Nginx/Apache block, and useful on managed hosts where you do not control the web-server config directly.
Verify the blocks actually work
A robots.txt rule you cannot prove is enforced is a robots.txt rule that is not enforced. Three checks.
Curl as the bot. Pretend to be the bot and see what your stack returns:
curl -A "Bytespider" -I https://example.com/
curl -A "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)" -I https://example.com/For a server- or CDN-level block, expect HTTP/1.1 403. For a robots.txt-only block, the request still returns 200; what you are testing there is robots.txt itself.
Fetch robots.txt directly. Curl your /robots.txt and verify your blocks appear and parse correctly. AI vendor crawlers fetch this file before crawling.
curl https://example.com/robots.txtWatch your access log. A real block shows up in the log as a 403 from the bot's IP and User-Agent. If you see 200s for Bytespider after enabling a Cloudflare rule, the rule is not catching it. Common causes: the rule scopes to a path that does not include the URL the bot is hitting, or Cloudflare is in "log only" mode for that rule.
For Google's tokens, the canonical test is the official robots.txt tester in Search Console. Paste the URL and the bot name (Google-Extended, Googlebot) and Google reports allowed or blocked. The same tool works for any bot whose vendor publishes a tester.
The bots that ignore robots.txt
Some bots fetch robots.txt and then crawl anyway. This is not a robots.txt bug; robots.txt is a request, not an enforcement mechanism. There are three categories worth knowing.
Documented non-compliant. Bytespider is the well-known case. Independent measurements have repeatedly caught it fetching content from paths it was explicitly told to skip. Treat any robots.txt block on Bytespider as a hint only and enforce at the server or CDN.
User-triggered fetches. Several vendors run a separate crawler for "the user just asked the model to read this URL" cases. Perplexity's Perplexity-User, ChatGPT's ChatGPT-User, and Claude's Claude-User are the main examples. Vendor policy varies: most of them say they respect robots.txt for these too, but Perplexity has explicitly stated Perplexity-User may ignore the file because the user, not the bot, initiated the fetch. If you want a hard block, enforce at the server.
Spoofed User-Agents. Some scrapers send a real browser's User-Agent string and crawl as if they were a person. robots.txt cannot help here, and neither can any User-Agent-only rule. The defense is rate limiting, behavior analysis, and Cloudflare-style managed-bots scoring, which look at IP reputation and request patterns rather than the UA header alone.
The takeaway is the same as the opening: layer the controls. robots.txt covers most of the population in one cheap step. Server or CDN rules cover the bots that ignore the polite request. The two together cost very little and stop everything except spoofers, which need their own treatment.





