Back around 2023, when AI started becoming a thing, I was already managing (and still manage) a portfolio of large sites at the company I work for. Thousands of pages each, thousands of daily visitors, every site on its own dedicated box. One week I started seeing the servers get slow at unpredictable times, even though these were dedicated boxes built for far more traffic than any one site was throwing at them. The pattern in the access logs was the same on every site: AI bots, hitting during all hours, peak and off-peak, every single hour of the day, sustained, every day. Nothing about it felt accidental.
The reading at the time was simple. AI vendors were scraping our content, training models on it, and the only downstream artifact would be a chatbot answer somewhere with no link, no attribution, no traffic back. The company decided to block every AI bot at the robots.txt level. I made the same call on my personal sites. That decision is what prompted the original How to Block AI Bots reference I wrote, first published in April 2024.
A few months later the company direction shifted: unblock the citation-class bots so search engines and chat-style search could consume our content. Around that time, while researching some of my own content on ChatGPT in my own niches, I realised it was quoting my original articles back to me, citing the links to my site. At that point I decided to remove all the blocks I had added in the robots.txt on my personal sites too, so the content could be cited and referenced from chatbots as well.
So this piece is not an argument that you should always block, or always allow. It is the decision framework I wish I'd had on day one, with the practical robots.txt posture for whichever way you decide. The "should I block AI bots?" question used to be one question because there was effectively one AI bot. It is two questions now: do you mind being trained on, and are you willing to give up citation traffic to enforce it. Most sites get the answer wrong because they treat them as one.
This article is the strategy half. The implementation half, the actual robots.txt, Nginx, Apache, Cloudflare, and WordPress rules, lives in How to Block AI Bots.
The short answer
For most sites, do not block the citation crawlers, and decide separately whether to block the training crawlers. Citation crawlers (OpenAI's OAI-SearchBot, Perplexity's PerplexityBot, Anthropic's Claude-SearchBot, and their user-triggered cousins) are the bots that put your URL into ChatGPT, Perplexity, Claude, and Google AI Overviews. They send traffic. Training crawlers (GPTBot, ClaudeBot, Bytespider, CCBot, Meta-ExternalAgent) do not put you anywhere users see; they feed the next model. Blocking the citation crawlers costs you visibility in the channel that is growing fastest. Blocking the training crawlers costs you almost nothing in traffic terms; whether to do it is a values call, not a business call.
There are exceptions in both directions. Some sites should block everything. A handful should genuinely allow everything. The rest of this article is the framework for figuring out where you sit.
Update: 2026, AI Overviews changed the stakes
By 2026 this question matters far more than it did in 2023 or 2024, because the surface where your content gets discovered has shifted. Google's Search Generative Experience (SGE), now widely referred to as AI Overviews, is Google's AI-powered search feature. Instead of just returning a list of blue links, it uses advanced language models (Gemini) to synthesize a multi-source answer to a query on the spot, and cites the sources it drew from at the very top of the results page.
The practical consequence is direct. If your content cannot be consumed by the relevant crawlers, you do not appear in the AI Overview block at the top of the page, and you are not cited as a source there. Sites that do allow consumption can end up pulling more traffic from that block than they used to get from the traditional blue-link results sitting underneath it. The position that used to be "rank in the top three" is now "be one of the cited sources in the Overview".
The same dynamic plays out across the chat-style search products. ChatGPT, Claude, Perplexity, and the other LLM search experiences cite articles in their answers. For an article of yours to be cited in any of those answers, the corresponding crawler has to be allowed to consume it. Block the crawler and you remove yourself from the citation pool entirely, regardless of how good the article is.
None of this changes the values question about training. The training crawlers are still a separate decision from the citation crawlers, and the framework in the rest of this article still applies. What 2026 changed is the cost side of the citation calculation: blocking the citation bots in 2024 cost you a little visibility in a small new channel, blocking them in 2026 costs you visibility in what is rapidly becoming the primary surface for search itself.
I am not arguing here that you should or should not allow them. I am laying out what is at stake in 2026 so the decision is informed. The reasons you might still want to block are below; the reasons letting them consume now carries real upside are above.
Why people block AI bots
When I went back and audited why the original block decision felt obvious at the time, four reasons covered almost all of it. Naming them clearly is what makes the rest of the decision tree work.
1. Anti-training principle. "I do not want my writing used to train a commercial model without permission, full stop." This is a values position and it deserves a values answer. If that is your reason, the cost of acting on it is low: blocking the training crawlers in robots.txt has essentially no visibility downside, and the major ones (GPTBot, ClaudeBot, CCBot) respect the file. Do it.
2. Bandwidth and cost. Aggressive AI crawlers can hit a site harder than the old generation of search bots, and a handful (Bytespider is the famous one) crawl heavily and ignore polite limits. If your hosting bill or your origin's load is the actual problem, the right fix is at the CDN or web server, not robots.txt. The "How to Block AI Bots" implementation guide has the Cloudflare and Nginx patterns.
3. Real-user performance during the crawl. A serious AI crawl is not a polite trickle. It is hundreds of URLs in parallel, from many IPs, sustained for hours. Every one of those requests goes through the same origin your readers do: the same PHP-FPM pool, the same database connections, the same cache backend. The URLs bots reach for tend to be the long tail you almost never serve, so they miss the page cache and force a full render, which is the most expensive path you have.
While that is happening, real visitors land in the queue behind it. TTFB stretches, some requests time out, and the site feels slow even when your dashboard looks fine, because average response time hides it. The pain shows up in p95 and p99. The hosting-bill argument in bullet 2 is a separate concern: bandwidth and CPU can be bought, but the perception of a slow site loses you visitors. If this is your problem, robots.txt will not save you from the worst offenders. Rate-limit or block at the CDN or web-server layer using the patterns in the "How to Block AI Bots" implementation guide.
4. Control of how content surfaces. A subset of operators want to be in Google but not in ChatGPT, or in Perplexity but not in Claude. That is a real preference and the tools support it: each vendor has its own User-Agent and you can pick which to allow. The cost is operational complexity, and the benefit is being able to express a more nuanced policy than "allow all" or "block all".
What is rarely a reason in practice, despite often being cited: "AI bots are stealing my content". They are not stealing it in a legal sense (the courts are still working that out and several major cases are pending), and for citation crawlers they are not even taking it permanently; they are reading it to decide whether to point users at it. The "training" complaint is the legitimate one. "Stealing" is usually a rhetorical version of the same thing.
Why blocking the citation crawlers is increasingly the wrong move
Five years ago, AI bots were almost all training bots and blocking them all was a coherent stance. Today the citation half of the ecosystem is a real referral channel, and the math has flipped for most public-facing sites.
Citation traffic is now a measurable share. ChatGPT, Perplexity, Claude, Gemini, and Bing Copilot all surface citations for the sources they use. Click-through on those citations is non-trivial, and for several content categories (developer documentation, how-to content, comparison content) it is meaningfully bigger than the equivalent Bing referral. The AI Overviews row in Google Search Console also feeds off whether you are crawlable by Googlebot; blocking Google-Extended is fine, but blocking Googlebot to "stop AI" also opts you out of Google Search entirely.
Brand visibility in AI answers is the new search ranking. When someone asks ChatGPT "how do I generate an MD5 hash on Linux", the model cites a small set of pages. Being one of those pages now plays the role that being in the top three Google results played in 2015. Blocking the citation crawler is the AI equivalent of noindex. Most operators do not want that, even if they did want anti-training.
Most training-corpus dynamics are already locked in. Common Crawl (CCBot) has been operating openly for years and the major models trained on snapshots that predate any block you put in place today. Blocking CCBot going forward changes future corpus inclusion, not past. That does not make the block pointless (future models matter), but it does take some of the urgency out of "block today or it is too late". It is not too late, and a thoughtful block is fine, but the catastrophizing framing is wrong.
Enforcement is partial. Even with a perfectly configured robots.txt, some bots ignore it (Bytespider) and some scrapers spoof real browsers and crawl anyway. A block is a meaningful signal of intent and most well-behaved bots respect it, but treating "I blocked them" as "they cannot have it" overstates the technical reality. If you really need enforcement, you need server-level and CDN rules, and you need to accept that some leakage is irreducible.
Who should block everything
Two profiles fit. If you match one, block both training and citation bots.
Paywalled or proprietary content. A subscription publication whose content lives behind a paywall has nothing to gain from AI citations (the user lands on a paywall) and a lot to lose from AI summarisation (the user gets the gist without ever paying). Many of the major news publishers have settled on a "block GPTBot, block all of it" posture for exactly this reason, and several have separately negotiated paid licensing deals with OpenAI and Anthropic. If you are paid for the content directly, defaulting to block is rational.
Brand-sensitive or controlled content. Internal documentation, legal opinions, medical guidance, anything where you cannot accept the model paraphrasing your words and presenting the paraphrase as fact. The citation provides minimal protection here; the model still summarises in its own words above the citation link. If misrepresentation is a real risk, blocking the citation crawler is sometimes the only mitigation worth the effort.
Who should not block
The other tail of the distribution is publishers, indie developers, marketers, and anyone whose business model is "people find my content and use the product or service the content points to". For this audience:
Allow citation crawlers always. This is where the new visibility comes from. OAI-SearchBot, PerplexityBot, Claude-SearchBot, Applebot, MistralAI-User, DuckAssistBot: allow them all.
The training-crawler call is yours. If you do not feel strongly, allow them too. The marginal "your content gets used in training" change from one indie site is genuinely small (one drop in a corpus measured in trillions of tokens), the operational cost of allowing is zero, and there is a non-trivial chance that being widely cited in training data correlates with being well-known in the models' eventual answers, although this is hard to measure cleanly. If you do feel strongly about training, block the training bots and move on; the visibility cost is genuinely close to zero.
This blog (TechEarl) sits firmly in this camp, which is why the public robots.txt explicitly allows every legitimate AI crawler. Once I saw my own articles cited back to me in chat answers, the attribution value clearly outweighed any marginal training concern.
The middle path most sites should take
For everyone in between, the layered "block training, allow citation" stance is the right default. Concretely:
- Block in
robots.txt:GPTBot,ClaudeBot,CCBot,Meta-ExternalAgent,cohere-ai,Amazonbot,Bytespider. - Disallow (the token-only opt-outs):
Google-Extended,Applebot-Extended. This keepsGooglebotandApplebotindexing you but opts your content out of Gemini and Apple Intelligence training. - Allow:
OAI-SearchBot,ChatGPT-User,PerplexityBot,Perplexity-User,Claude-SearchBot,Claude-User,Applebot,MistralAI-User,DuckAssistBot,Googlebot,Bingbot. - Block at the CDN or web server:
Bytespider(because it ignoresrobots.txt), and any aggressive scraper you can identify by IP or rate.
That posture preserves visibility, declares an explicit anti-training position, and is enforceable for the part that needs enforcement. It is what most thoughtful sites are converging on in 2026.
Vendor source-of-truth links
Once you have decided what to block and what to allow, implement against the vendor's own published User-Agent and IP documentation, not against a third-party summary (including this one) that may already be stale. Each vendor below maintains the canonical list of crawler names, the exact User-Agent strings, and where applicable a separate JSON file of IP ranges you can verify hits against.
- OpenAI (GPTBot for training, OAI-SearchBot for citation, ChatGPT-User for user-triggered fetches): developers.openai.com/api/docs/bots. IP ranges are published per-bot on the same page.
- Anthropic (ClaudeBot for training, Claude-SearchBot for citation, Claude-User for user-triggered fetches): support.claude.com/en/articles/8896518.
- Perplexity (PerplexityBot for citation, Perplexity-User for user-triggered fetches): docs.perplexity.ai/guides/bots. IP ranges are documented inline with WAF guidance.
- Google (Googlebot for search, Google-Extended as the Gemini training opt-out token, GoogleOther as a generic product crawler): developers.google.com/search/docs/crawling-indexing/google-common-crawlers. IPs are at google.com/static/crawling/ipranges/common-crawlers.json.
- Apple (Applebot for search, Applebot-Extended as the Apple Intelligence training opt-out token): support.apple.com/en-us/119829.
- Microsoft Bing (bingbot for search and Copilot): bing.com/webmasters/help/which-crawlers-does-bing-use-8c184ec0. IPs at
bingbot.jsonlinked from that page. - Meta (Meta-ExternalAgent for training, Meta-ExternalFetcher for user-triggered fetches, FacebookExternalHit for link unfurls): developers.facebook.com/docs/sharing/bot.
- Common Crawl (CCBot, the open corpus feeding most commercial models): commoncrawl.org/ccbot. IPs at index.commoncrawl.org/ccbot.json.
- ByteDance (Bytespider, for TikTok / Doubao training): no official cooperation page exists. Bytespider is widely reported to ignore
robots.txt, so arobots.txtrule alone is not sufficient. Block at the CDN or web-server layer by User-Agent. - DuckDuckGo (DuckAssistBot for AI-assisted answers): duckduckgo.com/duckduckgo-help-pages/results/duckassistbot.
- Amazon (Amazonbot for training, plus Amzn-SearchBot and Amzn-User): developer.amazon.com/amazonbot.
Last verified: 2026-05-24.
The full implementation reference (complete User-Agent inventory, per-vendor IP-list JSON URLs, and a robots.txt compliance matrix showing which bots actually honour the file) lives in the sibling article How to Block AI Bots, which is kept in sync with this list.
Why blocking CCBot is the highest-leverage anti-training move
One source deserves a specific note. Common Crawl publishes a free, open dataset of web pages and almost every commercial AI model has trained on some Common Crawl snapshot, either directly or indirectly. CCBot is the crawler that builds it. Blocking CCBot is therefore higher-leverage than blocking any single vendor: a single block keeps your content out of the open corpus that funnels into many models, not just one.
Common Crawl respects robots.txt, so the block is reliable. It is also retroactive only forward: snapshots already published continue to exist and continue to be used. If your goal is "future models are not trained on me", blocking CCBot is the single highest-leverage line you can add to robots.txt.
What blocking actually achieves and what it does not
To set expectations correctly:
- A
robots.txtblock does opt you out of crawling by every well-behaved vendor that reads the file. That is the large majority by volume. - It does not stop bots that ignore
robots.txt(Bytespider, some scrapers). For those, you need the web-server or CDN block. - It does not retroactively remove your content from corpora already published. The training data already collected stays collected.
- It does not prevent users from copy-pasting your content into a model as a prompt. Nothing technical can.
- It does create a clear, machine-readable record of your stated policy, which matters legally and practically as the landscape continues to evolve.
In short: a block is a policy declaration that most legitimate bots honour, and a partial technical control on top of that. It is worth doing if you want the policy on the record, but it is not a content fortress.
What to do next
Once you have decided what your policy is, the implementation is the easy part:
- How to Block AI Bots (robots.txt, Nginx, Apache, Cloudflare, WordPress) is the complete reference with copy-paste rules for every layer.
- How to Block the Wayback Machine from Archiving Your Site covers the separate, related question of historical archiving.





