Blocking AI bots is the implementation half of a question with two halves. The strategic side, whether you should block them at all, lives in Should You Block AI Bots?. This article is the technical reference: who the bots are, what each one does, and the exact rules to keep them out, at every layer from robots.txt to your CDN. (If your goal is the opposite, making your pages easy for AI agents to read and act on, that now has its own scorecard: the Lighthouse Agentic Browsing audit.)

The short version: most major AI crawlers respect robots.txt, so the first rule belongs there. A few do not, so the second rule belongs at the web server or at the CDN. And the bot you want to think hardest about is rarely the famous one but the one nobody has heard of yet, which is why the right setup is layered and verifiable, not a single line in a single file.

How do I block AI bots?

There are five places to block AI bots, in increasing order of force. First, robots.txt opts you out of every well-behaved crawler with a single Disallow: / block per bot. Second, your web server (Nginx or Apache) drops the request when the User-Agent header matches, which catches bots that ignore robots.txt but still send their real UA. Third, your CDN (Cloudflare is the common case) blocks at the edge before the request ever reaches your origin. Fourth, WordPress can layer plugin-based or theme-level filters on top. Fifth, special vendor tokens like Google-Extended and Applebot-Extended opt you out of model training without blocking the search crawler. Combine robots.txt for the well-behaved bots, server or CDN rules for Bytespider, and vendor tokens for Google and Apple.

Jump to:

The bots that matter
Vendor reference: official bot docs and IP sources
Block AI bots with robots.txt
Per-bot robots.txt snippets
Block AI bots with Nginx
Block AI bots with Apache
Block AI bots with Cloudflare
Block AI bots in WordPress
Verify the blocks actually work
The bots that ignore robots.txt
FAQ

The bots that matter

There are dozens of AI-related User-Agents in the wild, but a handful drive almost all the traffic and almost all the policy debate. The table below is the working set I keep updated. The "Respects robots.txt" column is the one most other guides get wrong; verified per-vendor as of mid-2026.

Bot (User-agent)	Owner	Purpose	Respects robots.txt	Sends traffic	Default action
`GPTBot`	OpenAI	LLM training	Yes	No	Block if anti-training
`ChatGPT-User`	OpenAI	User-triggered fetch from ChatGPT	Yes	Yes (when a user clicks a citation)	Allow
`OAI-SearchBot`	OpenAI	ChatGPT search index	Yes	Yes	Allow
`ClaudeBot`	Anthropic	LLM training	Yes	No	Block if anti-training
`Claude-User`	Anthropic	User-triggered fetch from Claude	Yes	Yes	Allow
`Claude-SearchBot`	Anthropic	Claude citation index	Yes	Yes	Allow
`Google-Extended`	Google	Token for opting out of Gemini training	Yes (token only)	None directly	Disallow if anti-training
`Googlebot`	Google	Google Search index (also informs AI Overviews)	Yes	Yes	Allow
`PerplexityBot`	Perplexity	Citation index	Yes	Yes	Allow
`Perplexity-User`	Perplexity	User-triggered fetch	May ignore robots.txt	Yes	Allow, or block at the CDN
`Applebot`	Apple	Siri / Spotlight search	Yes	Yes	Allow
`Applebot-Extended`	Apple	Token for opting out of Apple Intelligence training	Yes (token only)	None directly	Disallow if anti-training
`Bytespider`	ByteDance	Aggressive training crawl	No	Minimal	Block at server or CDN
`CCBot`	Common Crawl	Open dataset used by many AI vendors	Yes	No	Block if anti-training
`Meta-ExternalAgent`	Meta	LLM training	Yes (with some delay)	No	Block if anti-training
`Amazonbot`	Amazon	Alexa and AI training	Yes	No direct	Block if anti-training
`cohere-ai`	Cohere	LLM training	Yes	No	Block if anti-training
`MistralAI-User`	Mistral	User-triggered fetch	Yes	Yes	Allow
`DuckAssistBot`	DuckDuckGo	DuckAssist citation	Yes	Yes	Allow

Two distinctions in this table do more work than any single block rule.

Training versus retrieval. Most vendors run separate crawlers for the two jobs. GPTBot and OAI-SearchBot are different bots from the same company; blocking the training one has zero effect on the search one, and the search one is where citation traffic comes from. The same split holds for Anthropic and is the reason a "block all AI bots" rule usually blocks the wrong half.

Crawler versus token. Google-Extended and Applebot-Extended are not crawlers. They are robots.txt tokens. Disallowing Google-Extended does not block Googlebot; it just opts your content out of Gemini training while Google Search keeps indexing you. Same shape for Apple. These are the only two bots where "block" means "do not train on" and not "do not crawl".

Vendor reference: official bot docs and IP sources

This is the canonical lookup. Every operator who blocks at the server or CDN layer should be working from the vendor's own published User-Agent list and IP source, not a third-party summary that goes stale the moment a vendor rotates a CIDR.

Bot / User-Agent	Vendor	Purpose	Respects robots.txt	Official docs	IP source
`GPTBot`, `OAI-SearchBot`, `ChatGPT-User`	OpenAI	Training, search index, user fetch	Yes	platform.openai.com/docs/bots	gptbot.json, searchbot.json, chatgpt-user.json
`ClaudeBot`, `Claude-User`, `Claude-SearchBot`	Anthropic	Training, user fetch, citation index	Yes	support.claude.com (crawler article)	claude.com/crawling/bots.json
`PerplexityBot`, `Perplexity-User`	Perplexity	Citation index, user fetch	Yes for PerplexityBot, no for Perplexity-User (per vendor)	docs.perplexity.ai/guides/bots	perplexitybot.json, perplexity-user.json
`Googlebot`, `Google-Extended`, `GoogleOther`	Google	Search index, training opt-out token, generic	Yes	developers.google.com (common crawlers)	common-crawlers.json, reverse DNS `*.googlebot.com`
`CCBot`	Common Crawl	Open dataset (consumed by many AI vendors)	Yes	commoncrawl.org/ccbot	index.commoncrawl.org/ccbot.json, reverse DNS
`Applebot`, `Applebot-Extended`	Apple	Siri / Spotlight, training opt-out token	Yes	support.apple.com/en-us/119829	Applebot IP CIDR JSON referenced on the support page, plus reverse DNS
`bingbot`	Microsoft	Bing index (and Copilot grounding)	Yes	bing.com/webmasters (which crawlers), verify bingbot guide	No published JSON. Reverse DNS to `*.search.msn.com`, then forward-confirm
`Meta-ExternalAgent`, `Meta-ExternalFetcher`, `FacebookExternalHit`	Meta	Training, user-triggered fetch, link unfurl	Partial (Meta documents 24h cache; some agents skip)	developers.facebook.com (sharing bot)	Not published on the bot page. Meta refers operators to its general IP allow-list documentation
`Bytespider`	ByteDance	Aggressive training crawl	No (widely reported to ignore)	No canonical vendor page exists. Third-party references like darkvisitors.com/agents/bytespider are the best available	None published. Block at server or CDN by User-Agent
`DuckAssistBot`, `DuckDuckBot`	DuckDuckGo	DuckAssist citation, search index	Yes	duckduckgo.com/duckduckbot	IP list published on the page and as JSON via the linked endpoint
`Amazonbot`, `Amzn-SearchBot`, `Amzn-User`	Amazon	Alexa, search index, user fetch	Yes	developer.amazon.com/amazonbot	Per-bot IP lists linked from the Amazonbot page

Two notes before you copy any of these into a config.

Vendor IP lists change. Re-pull the JSON at deploy time from a small build step or cron, do not hardcode IPs from a copy-paste, or your allow-list will silently drift.

Bytespider and the handful of others that ignore robots.txt need enforcement at the server or CDN layer. See the Nginx, Apache, and Cloudflare sections below for the actual rules.

Block AI bots with robots.txt

robots.txt is the first and easiest layer, and for most well-behaved bots, the only one you need. The file lives at the document root, served from /robots.txt. Each User-Agent block applies to one bot; Disallow: / means "do not crawl anything".

A minimal "block all training, allow search and citation" robots.txt:

code

# Training crawlers: block
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Amazonbot
Disallow: /

# Training opt-out tokens (these are not crawlers)
User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

# Search / citation / user-triggered: allow
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Applebot
Allow: /

User-agent: MistralAI-User
Allow: /

User-agent: DuckAssistBot
Allow: /

# Everything else: default
User-agent: *
Allow: /

A few rules of the road.

Order does not matter for robots.txt. The most specific User-Agent block wins, then the * fallback.
Case sensitivity: the User-Agent name is case-insensitive in practice. Match the vendor's documented spelling anyway.
Per-path control works the same way: Disallow: /private/ blocks one folder. You can mix and match.
Sitemap and Host directives go at the bottom, outside any per-agent block.
Test, do not assume. A typo in a User-Agent name silently disables the rule. The "Verify the blocks actually work" section below covers the checks.

For Next.js sites, robots.ts (in app/) generates the file from code. Add the per-agent rules there in TypeScript. For static sites, put a literal robots.txt in public/. The two approaches produce the same output.

If you would rather click than type, the DNS Checker robots.txt tool builds the file for you. Pick the bots you want to block, switch the toggle for each, copy the live preview into robots.txt. The Analyzer tab also validates an existing file against RFC 9309 and runs a per-bot access check, which is the cheapest way to confirm a block is actually in effect before you wait on log data.

The DNS Checker robots.txt tool on the Generator tab, with the AI Crawlers preset expanded and Block selected for GPTBot, CCBot, and Bytespider. The live preview on the right shows the generated robots.txt with User-agent and Disallow lines for each blocked bot. — DNS Checker's robots.txt tool: build the file from a bot picker, validate an existing one, and test per-bot access without leaving the page.

Per-bot robots.txt snippets

Standalone snippets for when you only want to block one specific crawler. Each one is the entire block needed; drop it into the file and the rest of your robots.txt is unaffected.

code

# Block GPTBot (OpenAI training)
User-agent: GPTBot
Disallow: /

code

# Block ClaudeBot (Anthropic training)
User-agent: ClaudeBot
Disallow: /

code

# Opt out of Gemini training (does not block Googlebot)
User-agent: Google-Extended
Disallow: /

code

# Opt out of Apple Intelligence training (does not block Applebot)
User-agent: Applebot-Extended
Disallow: /

code

# Block PerplexityBot (the citation crawler; user fetches still come via Perplexity-User)
User-agent: PerplexityBot
Disallow: /

code

# Block Bytespider (ByteDance / TikTok training). Note: Bytespider ignores
# robots.txt in practice. This line is for completeness; pair with a
# server-level block.
User-agent: Bytespider
Disallow: /

code

# Block Common Crawl (CCBot). Common Crawl publishes an open dataset that
# many AI vendors train on, so blocking CCBot is one of the higher-leverage
# anti-training moves.
User-agent: CCBot
Disallow: /

code

# Block Meta-ExternalAgent (Meta AI training)
User-agent: Meta-ExternalAgent
Disallow: /

code

# Block Amazonbot (Alexa and Amazon AI)
User-agent: Amazonbot
Disallow: /

code

# Block cohere-ai (Cohere training)
User-agent: cohere-ai
Disallow: /

Block AI bots with Nginx

robots.txt is a polite request; an Nginx rule is a closed door. Use it for the bots that ignore robots.txt (Bytespider is the famous case), for cases where you want to be sure a block is enforced, and for path-level rules that go beyond what robots.txt cleanly expresses.

Drop this into your server block (or, for site-wide rules, into the http block):

nginx

# Block AI training crawlers at the edge.
if ($http_user_agent ~* "GPTBot|ClaudeBot|CCBot|Bytespider|Meta-ExternalAgent|anthropic-ai|cohere-ai|Amazonbot") {
    return 403;
}

Notes:

~* is a case-insensitive regex match. The | separates alternatives. Add or remove bots as your policy changes.
return 403 is the cleanest response. Some operators prefer 444 (Nginx's "drop the connection" code) so no body is sent at all.
Avoid putting this inside a location block that is overridden elsewhere; site-wide rules belong at the server level.
if in Nginx has well-documented quirks inside location blocks, but for a simple top-level UA check it is the standard, correct approach.

A more robust shape that scales as your bot list grows uses a map:

nginx

map $http_user_agent $ai_bot {
    default                 0;
    "~*GPTBot"              1;
    "~*ClaudeBot"           1;
    "~*CCBot"               1;
    "~*Bytespider"          1;
    "~*Meta-ExternalAgent"  1;
    "~*anthropic-ai"        1;
    "~*cohere-ai"           1;
    "~*Amazonbot"           1;
}

server {
    # ... your normal config ...

    if ($ai_bot) {
        return 403;
    }
}

The map block goes in the http context. Adding a bot is one line, and the per-server if stays trivial.

Block AI bots with Apache

The Apache equivalent lives in .htaccess (or the main vhost). Two clean shapes, depending on whether you have mod_rewrite or mod_authz_host available; both are standard on every modern Apache install.

mod_rewrite (drop this into .htaccess at the document root):

apache

RewriteEngine On

# Match any of the AI training User-Agents and return 403.
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|CCBot|Bytespider|Meta-ExternalAgent|anthropic-ai|cohere-ai|Amazonbot) [NC]
RewriteRule ^ - [F,L]

[NC] is case-insensitive matching.
[F] returns 403 Forbidden. [L] stops processing further rules.
The ^ and - mean "match anything, do nothing else"; the rule's only effect is the F flag.

mod_setenvif plus Require, which some operators prefer because it keeps the rule outside the rewrite pipeline:

apache

SetEnvIfNoCase User-Agent "GPTBot"             ai_bot
SetEnvIfNoCase User-Agent "ClaudeBot"          ai_bot
SetEnvIfNoCase User-Agent "CCBot"              ai_bot
SetEnvIfNoCase User-Agent "Bytespider"         ai_bot
SetEnvIfNoCase User-Agent "Meta-ExternalAgent" ai_bot
SetEnvIfNoCase User-Agent "anthropic-ai"       ai_bot
SetEnvIfNoCase User-Agent "cohere-ai"          ai_bot
SetEnvIfNoCase User-Agent "Amazonbot"          ai_bot

<RequireAll>
    Require all granted
    Require not env ai_bot
</RequireAll>

This produces 403 for any request whose User-Agent contains one of the listed substrings.

Block AI bots with Cloudflare

When the site sits behind Cloudflare, the highest-leverage layer is Cloudflare itself. Requests get judged at the edge, never touching your origin, and Cloudflare keeps the bot list updated for you.

Cloudflare ships a managed "Block AI bots" feature in two related places.

Managed robots.txt (per-zone setting). Cloudflare can append AI-bot disallow rules to your robots.txt automatically. Find it under Security → Settings → Manage robots.txt (or the equivalent in newer dashboards), toggle it on, and Cloudflare injects the appropriate blocks based on its bot directory. This is convenient and mostly correct, but it relies on the bot honoring robots.txt.

Block AI Bots WAF rule. Cloudflare exposes a one-click WAF rule that blocks known AI crawlers at the request level, regardless of robots.txt. Security → Bots → Configure and enable the "Block AI bots" toggle. This is the right tool for Bytespider and for any case where you want enforcement rather than a request.

For a custom WAF rule (if you want path exceptions or fine control), build it from Security → WAF → Custom rules:

code

(http.user_agent contains "Bytespider") or
(http.user_agent contains "GPTBot") or
(http.user_agent contains "ClaudeBot") or
(http.user_agent contains "CCBot") or
(http.user_agent contains "Meta-ExternalAgent") or
(http.user_agent contains "Amazonbot")

Action: Block. Cloudflare matches the string anywhere in the User-Agent, which is what you want here.

One caution. Cloudflare's "Block AI bots" includes the citation bots by default in some plans, not just the training ones. If you want AI Overviews citations or ChatGPT search to keep finding you, audit the managed list and exclude ChatGPT-User, OAI-SearchBot, PerplexityBot, Claude-SearchBot, and Claude-User from the block.

Block AI bots in WordPress

WordPress sits on top of Apache or Nginx, so the rules in those sections apply. Two WordPress-specific paths are also useful.

Edit robots.txt from WordPress. WordPress generates a virtual robots.txt at /robots.txt if no static file exists. You can replace it with a real file in the document root, or you can filter the generated one from the theme or a small plugin:

php

add_filter( 'robots_txt', function ( $output, $public ) {
    if ( ! $public ) {
        // Site is set to "Discourage search engines"; don't add more.
        return $output;
    }
    $extra = <<<RULES

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /
RULES;

    return $output . $extra;
}, 10, 2 );

Drop that into mu-plugins/block-ai-bots.php, or into your theme's functions.php. WordPress appends your rules to whatever the core code already emits.

SEO plugins handle this too. Yoast SEO and Rank Math both expose a "robots.txt editor" in their dashboards. If you already run one, edit there to keep one source of truth; do not also filter robots_txt in code or you will end up with duplicate blocks.

Plugin enforcement. A plugin like "Blackhole for Bad Bots" or a custom mu-plugin can drop requests whose User-Agent matches a list, before WordPress finishes booting. That is the WordPress equivalent of the Nginx/Apache block, and useful on managed hosts where you do not control the web-server config directly.

Verify the blocks actually work

A robots.txt rule you cannot prove is enforced is a robots.txt rule that is not enforced. Three checks.

Curl as the bot. Pretend to be the bot and see what your stack returns:

bash

curl -A "Bytespider" -I https://example.com/
curl -A "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)" -I https://example.com/

For a server- or CDN-level block, expect HTTP/1.1 403. For a robots.txt-only block, the request still returns 200; what you are testing there is robots.txt itself.

Fetch robots.txt directly. Curl your /robots.txt and verify your blocks appear and parse correctly. AI vendor crawlers fetch this file before crawling.

bash

curl https://example.com/robots.txt

Watch your access log. A real block shows up in the log as a 403 from the bot's IP and User-Agent. If you see 200s for Bytespider after enabling a Cloudflare rule, the rule is not catching it. Common causes: the rule scopes to a path that does not include the URL the bot is hitting, or Cloudflare is in "log only" mode for that rule.

For Google's tokens, the canonical test is the official robots.txt tester in Search Console. Paste the URL and the bot name (Google-Extended, Googlebot) and Google reports allowed or blocked. The same tool works for any bot whose vendor publishes a tester.

The bots that ignore robots.txt

Some bots fetch robots.txt and then crawl anyway. This is not a robots.txt bug; robots.txt is a request, not an enforcement mechanism. There are three categories worth knowing.

Documented non-compliant. Bytespider is the well-known case. Independent measurements have repeatedly caught it fetching content from paths it was explicitly told to skip. Treat any robots.txt block on Bytespider as a hint only and enforce at the server or CDN.

User-triggered fetches. Several vendors run a separate crawler for "the user just asked the model to read this URL" cases. Perplexity's Perplexity-User, ChatGPT's ChatGPT-User, and Claude's Claude-User are the main examples. Vendor policy varies: most of them say they respect robots.txt for these too, but Perplexity has explicitly stated Perplexity-User may ignore the file because the user, not the bot, initiated the fetch. If you want a hard block, enforce at the server.

Spoofed User-Agents. Some scrapers send a real browser's User-Agent string and crawl as if they were a person. robots.txt cannot help here, and neither can any User-Agent-only rule. The defense is rate limiting, behavior analysis, and Cloudflare-style managed-bots scoring, which look at IP reputation and request patterns rather than the UA header alone.

The takeaway is the same as the opening: layer the controls. robots.txt covers most of the population in one cheap step. Server or CDN rules cover the bots that ignore the polite request. The two together cost very little and stop everything except spoofers, which need their own treatment.

FAQ

It depends on which bots you block. Blocking pure training crawlers (GPTBot, ClaudeBot, CCBot, Bytespider) has no direct SEO effect, because those bots do not surface your site in any search interface. Blocking the citation and user-fetch bots (OAI-SearchBot, PerplexityBot, ChatGPT-User, Claude-SearchBot) does reduce visibility, because those are exactly the bots that put your URL into ChatGPT search, Perplexity answers, and Claude citations.

Block the training crawlers if your policy is anti-training. Allow the citation crawlers if you want AI search visibility. They are different bots.

No. Google-Extended is a token, not a separate crawler. Googlebot still indexes your site for Search and AI Overviews regardless of what you say to Google-Extended. The token only controls whether Google may use the data it has crawled to train Gemini and similar generative products. Apple's Applebot-Extended follows the same pattern relative to Applebot.

ByteDance has not published a clear answer, but independent measurements (notably Cloudflare's and several academic studies) have repeatedly found Bytespider fetching pages from paths it was explicitly told to skip. Whether by design or by bug, the practical result is the same: blocking Bytespider in robots.txt is unreliable, and you should enforce it at the server or CDN. Cloudflare's "Block AI bots" toggle treats Bytespider as a hard block by default.

OpenAI runs three distinct bots and you block them independently. GPTBot is the training crawler. OAI-SearchBot builds the search index ChatGPT queries to cite sources. ChatGPT-User fetches a specific URL when a user clicks a link or asks ChatGPT to browse to it. Blocking GPTBot opts you out of training but keeps you in ChatGPT search; blocking all three opts you out entirely.

For some vendors, yes. OpenAI, Anthropic, Perplexity, and a few others publish their crawler IP ranges so you can verify a request is really from them, or block them at the network layer. IP-based blocks survive User-Agent spoofing, which is their main advantage. The downside is that the lists change and you have to keep them updated; for most sites, User-Agent rules in robots.txt plus a CDN are enough.

Not in core. WordPress has a "Discourage search engines" toggle in Settings → Reading, but that emits a blanket Disallow: / for User-agent: *, which blocks Google, Bing, and everyone else as well as the AI bots. For targeted AI-bot blocking, edit robots.txt (directly or via Yoast / Rank Math), filter robots_txt in code, or run a plugin that maintains the rules for you. The code snippet in the WordPress section above is the lightest-weight path.

The Internet Archive is a different problem with different mechanics. Its main crawler is archive.org_bot (and the older ia_archiver), and the Archive's relationship with robots.txt changed in 2017. For the full procedure, including retroactive removal, see How to Block the Wayback Machine from Archiving Your Site.

How to Block AI Bots (robots.txt, Nginx, Apache, Cloudflare, WordPress)

How do I block AI bots?

The bots that matter

Vendor reference: official bot docs and IP sources

Block AI bots with robots.txt

Per-bot robots.txt snippets

Block AI bots with Nginx

Block AI bots with Apache

Block AI bots with Cloudflare

Block AI bots in WordPress

Verify the blocks actually work

The bots that ignore robots.txt

FAQ

Ishan Karunaratne

Related posts

Should You Block AI Bots? An Honest Strategic Guide

How to Use Regex in .htaccess (Apache mod_rewrite)

How to Fix "Error Establishing a Database Connection" in WordPress

Will blocking AI bots affect my SEO?

Does blocking Google-Extended hurt my Google Search rankings?

Why does Bytespider ignore robots.txt?

What is the difference between blocking GPTBot and blocking OpenAI?

Can I block bots by IP instead of by User-Agent?

Does WordPress have a setting to block AI bots?

How do I block the Wayback Machine?

Ishan Karunaratne