TechEarl

How to Block the Wayback Machine from Archiving Your Site

The Internet Archive plays by different rules than AI or search bots. Here is how to keep new pages out of the Wayback Machine, how to remove pages that are already archived, and what to do when robots.txt is not enough.

Ishan KarunaratneIshan Karunaratne⏱️ 11 min readUpdated
Share thisCopied
How to block the Wayback Machine: the ia_archiver and archive.org_bot user agents, the robots.txt approach and its limits since 2017, the email removal process, and DMCA escalation for stubborn cases.

The Wayback Machine is not an AI bot, a search engine, or a competitor's scraper. It is the Internet Archive's project to save snapshots of the public web, and it plays by a different set of rules from everything else you might want to block. robots.txt works less reliably here than for other crawlers, the bot has two names (one legacy, one current), and removing a page that is already archived is a different process from preventing future archiving. This article walks both directions: keeping new content out, and getting existing snapshots taken down.

If you are looking at the broader question of AI crawlers, the Wayback Machine is genuinely separate. See Should You Block AI Bots? and How to Block AI Bots for that side.

How do I block the Wayback Machine?

The block has two parts because the problem has two parts. Going forward, add a User-agent: ia_archiver and User-agent: archive.org_bot block to your robots.txt with Disallow: /. The Wayback Machine respects this for most sites in practice, although a 2017 policy change means there is no longer an absolute guarantee. For pages already archived, send a removal request to info@archive.org from an email address on the domain in question; the Archive's policy is to honor these. For stubborn cases involving copyrighted material, the DMCA process is the escalation path. Cloudflare and server-level blocks of the User-Agent strings are a useful belt-and-braces layer; the Archive's IP ranges change over time but its User-Agent strings are stable.

What the Wayback Machine is and why people block it

The Wayback Machine (web.archive.org) is the public face of the Internet Archive's crawl. It periodically saves snapshots of pages it can reach and makes them browsable forever, so anyone can look up what a URL said on a given date in the past. For researchers, journalists, and lawyers, this is a public good. For site owners, it is occasionally a problem.

The legitimate reasons to block come down to three.

Old content you have outgrown. A draft post you published, decided was wrong, and pulled down still exists in the Archive long after it left your site. The same applies to opinion pieces you no longer agree with, an old design you have replaced, or a previous employer's logo on a now-personal domain.

Privacy and personal-data exposure. A page that briefly showed an email address, a phone number, or a name that should not have been public stays public in the Archive until you act. This is the most common reason indie sites contact info@archive.org.

Commercial control. Paywalled or subscription content that briefly appeared on the open web (a free preview window, a leak, a misconfiguration) can be cached by the Archive and stay readable indefinitely. The newspaper and publishing world has a long history of these requests.

What is not a good reason: trying to rewrite history because you do not like what was previously published. The Archive is generally cooperative with privacy and ownership concerns and generally not cooperative with reputational ones, which is the right policy and worth knowing before you write.

The bot user-agents

The Internet Archive's crawler answers to two User-Agent strings.

  • ia_archiver — the legacy name, used in robots.txt rules for decades. Many older robots.txt files still use this name and most sites still match on it. The bot fetches robots.txt and applies the per-User-Agent rule the standard way.
  • archive.org_bot — the modern name, used by the current crawl infrastructure. Some operators report seeing this UA in their logs more often than ia_archiver in 2026. Match on both to be safe.

There is no separate "AI" User-Agent for the Archive. The Wayback Machine does not train models on what it crawls; its purpose is preservation and public access. The bot is not in the AI bot list because it is not an AI bot.

The robots.txt approach (and its limits since 2017)

The standard, polite block is straightforward:

code
User-agent: ia_archiver
Disallow: /

User-agent: archive.org_bot
Disallow: /

Drop that into your robots.txt. The Archive's crawler reads the file before crawling and, in most cases, will honor the rule for future crawls.

The 2017 caveat is worth understanding. In April 2017, the Internet Archive published a policy statement saying that robots.txt was designed for search engines, not for archives, and that strict compliance was not the right policy for a historical record. As a first step, they stopped honoring robots.txt for US government and military sites for both crawling and display. They signalled that broader changes might follow, and in practice the Archive's compliance with robots.txt has become more case-by-case than absolute.

What this means for you:

  • A robots.txt block is still a strong signal and the Archive honors it for most sites in most cases.
  • A robots.txt block does not guarantee future non-crawling, and it does not retroactively remove what is already archived.
  • For a hard guarantee, the email removal path below is the official mechanism.

Removing pages already archived

The Archive's policy for removal requests is documented publicly and they actively encourage them. The process:

  1. Send a plain-text email to info@archive.org. No specific template is required, but the more specific you are, the faster the request moves.
  2. Send it from an email address on the domain whose snapshots you want removed. A request from you@example.com to remove example.com snapshots is faster to process than the same request from an arbitrary address; the on-domain email is the Archive's proof of authority.
  3. Identify exactly what you want removed. Either the specific URLs (https://web.archive.org/web/*/example.com/page-to-remove) or "all of example.com" — both are valid; one is faster.
  4. State the reason briefly. "Privacy: this page exposed my full name and address" is enough. The reason matters less for clearly legitimate requests but helps the team prioritize.
  5. Wait. Turnaround in 2026 is typically a few days to a couple of weeks. The Archive's team is small but responsive.

What you get back is removal of the snapshots from the Wayback Machine's public interface and, in most cases, exclusion of the domain or path from future crawls. The Archive's response email confirms the change.

A working request looks something like:

Subject: Removal request: example.com

Hi,

I am the owner of example.com (sending from this on-domain address as confirmation). Please remove all snapshots of example.com/private-page/ from the Wayback Machine and exclude that URL from future crawls. The page briefly exposed personal information that should not be archived.

Thank you.

That is the whole process for most requests.

When the email path is not enough

The Archive is cooperative for legitimate privacy and ownership requests, but it pushes back on requests that look like reputation management or attempts to erase published history. If you have a harder case, two escalations exist.

DMCA takedown. If the archived content is copyrighted material that you own and did not authorize the Archive to host, the DMCA process applies. The Archive maintains a DMCA agent (publicly listed on archive.org/about) and processes valid takedown notices according to standard DMCA rules. The classic use case is a publisher whose paywalled article ended up in the Archive after a brief misconfiguration; the DMCA notice is the formal path.

Legal counsel. For high-stakes situations (defamation claims, court-ordered redactions, ongoing litigation), routing the request through your attorney is standard. The Archive responds to legal correspondence the way most online services do.

For the vast majority of operators reading this, the plain email to info@archive.org from an on-domain address is the path. The escalations exist; they rarely need to be used.

Belt-and-braces server-level blocks

If you want to enforce the block in real time rather than wait for the Archive to honor robots.txt, you can drop requests from the Archive's User-Agent at the web server. This stops snapshots from being captured at all, not just from being displayed.

For Nginx:

nginx
if ($http_user_agent ~* "ia_archiver|archive\.org_bot") {
    return 403;
}

For Apache .htaccess:

apache
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (ia_archiver|archive\.org_bot) [NC]
RewriteRule ^ - [F,L]

For Cloudflare WAF custom rules, match http.user_agent contains "archive.org_bot" or http.user_agent contains "ia_archiver" and choose Block.

Two notes on this approach.

  • It only stops the Archive's own crawler. If a third-party tool submits your URL to the Archive's "Save Page Now" feature, the Archive fetches the page from its own infrastructure, not the bot's, and the User-Agent may differ. The email removal path remains the only way to handle "Save Page Now" abuse.
  • The Archive sometimes crawls from networks that look like ordinary fetches, particularly when a user manually triggers a save. Treat the User-Agent block as a strong reduction, not a guarantee.

After the block: what to expect

A few days after sending a removal request, the snapshots become inaccessible on web.archive.org. The URL pattern returns "Page cannot be displayed due to robots.txt" or a similar message. The data is not deleted from the Archive's internal storage in most cases; it is suppressed from the public interface. For the rare cases where data deletion is required (legal orders, certain privacy regimes), state that explicitly in the request and the Archive will route the request to the team that handles it.

Future crawls of the affected URLs or domain are skipped. If you change your mind later, the same email channel is the way to re-enable.

What this does not do

  • It does not remove your content from Google's cache, Bing's cache, or any search engine's snapshots. Each of those has its own removal mechanism (Google Search Console's "Removals" tool is the right entry point for Google).
  • It does not affect mirrors of your content on other archive services (archive.today, freezepage, regional archives). Each of those is a separate request.
  • It does not remove screenshots that other users have personally saved. Anything that left your control is no longer in your control.
  • It does not unwind training-data inclusion if an AI model was already trained on the snapshots before removal. See Should You Block AI Bots? for that side of the problem.

FAQ

TagsWayback MachineInternet ArchiveArchivingrobots.txtPrivacyDMCA

Found this useful? Pass it on.

Copied
Ishan Karunaratne

Ishan Karunaratne

Tech Architect · Software Engineer · AI/DevOps

Tech architect and software engineer with 20+ years building software, Linux systems, and DevOps infrastructure, and lately working AI into the stack. Currently Chief Technology Officer at a healthcare tech startup, which is where most of these field notes come from.

Keep reading

Related posts

Six techniques that actually reduce LLM hallucination: grounding, citations, tool use, structured outputs, explicit don't-know, and LLM-as-judge verification.

How to Stop an LLM from Hallucinating

Six techniques that actually reduce LLM hallucination: grounding with retrieved context, citation requirements, tool use for facts, structured outputs, explicit don't-know permission, and LLM-as-judge verification.

A practical DNS health check walkthrough. Cover NS, A, AAAA, MX, SPF, DKIM, DMARC, CAA, DNSSEC in one pass, with real examples and fixes for the most common misconfigurations.

How to Run a DNS Health Check on Your Domain

A practical DNS health check covers nameservers, A and AAAA records, MX, SPF, DKIM, DMARC, and CAA. Here is the full checklist, what each record actually tells you, and how to verify all of them in one pass.