How to Block the Wayback Machine (and Remove Archived Pages)

The Wayback Machine is not an AI bot, a search engine, or a competitor's scraper. It is the Internet Archive's project to save snapshots of the public web, and it plays by a different set of rules from everything else you might want to block. robots.txt works less reliably here than for other crawlers, the bot has two names (one legacy, one current), and removing a page that is already archived is a different process from preventing future archiving. This article walks both directions: keeping new content out, and getting existing snapshots taken down.

If you are looking at the broader question of AI crawlers, the Wayback Machine is genuinely separate. See Should You Block AI Bots? and How to Block AI Bots for that side.

How do I block the Wayback Machine?

Going forward, add User-agent: ia_archiver and User-agent: archive.org_bot blocks to your robots.txt with Disallow: /. The Wayback Machine respects this for most sites in practice, although a 2017 policy change means there is no longer an absolute guarantee. For pages already archived, email a removal request to info@archive.org; that opens a review by the Archive's team, which states up front that it makes no guarantees about the outcome. For copyrighted material, the DMCA process is the separate escalation path. Cloudflare and server-level blocks of the User-Agent strings are a useful belt-and-braces layer; the Archive's IP ranges change over time but its User-Agent strings are stable.

What the Wayback Machine is and why people block it

The Wayback Machine (web.archive.org) is the public face of the Internet Archive's crawl. It periodically saves snapshots of pages it can reach and makes them browsable forever, so anyone can look up what a URL said on a given date in the past. For researchers, journalists, and lawyers, this is a public good. For site owners, it is occasionally a problem.

The legitimate reasons to block come down to three.

Old content you have outgrown. A draft post you published, decided was wrong, and pulled down still exists in the Archive long after it left your site. The same applies to opinion pieces you no longer agree with, an old design you have replaced, or a previous employer's logo on a now-personal domain.

Privacy and personal-data exposure. A page that briefly showed an email address, a phone number, or a name that should not have been public stays public in the Archive until you act. This is the most common reason indie sites contact info@archive.org.

Commercial control. Paywalled or subscription content that briefly appeared on the open web (a free preview window, a leak, a misconfiguration) can be cached by the Archive and stay readable indefinitely. The newspaper and publishing world has a long history of these requests.

What is not a good reason: trying to rewrite history because you do not like what was previously published. The Archive is generally cooperative with privacy and ownership concerns and generally not cooperative with reputational ones, which is the right policy and worth knowing before you write.

The bot user-agents

The Internet Archive's crawler answers to two User-Agent strings.

ia_archiver — the legacy name, used in robots.txt rules for decades. Many older robots.txt files still use this name and most sites still match on it. The bot fetches robots.txt and applies the per-User-Agent rule the standard way.
archive.org_bot — the modern name, used by the current crawl infrastructure. Some operators report seeing this UA in their logs more often than ia_archiver in 2026. Match on both to be safe.

There is no separate "AI" User-Agent for the Archive. The Wayback Machine does not train models on what it crawls; its purpose is preservation and public access. The bot is not in the AI bot list because it is not an AI bot.

The robots.txt approach (and its limits since 2017)

The standard, polite block is straightforward:

code

User-agent: ia_archiver
Disallow: /

User-agent: archive.org_bot
Disallow: /

Drop that into your robots.txt. The Archive's crawler reads the file before crawling and, in most cases, will honor the rule for future crawls.

The 2017 caveat is worth understanding. In April 2017, the Internet Archive published a policy statement saying that robots.txt was designed for search engines, not for archives, and that strict compliance was not the right policy for a historical record. As a first step, they stopped honoring robots.txt for US government and military sites for both crawling and display. They signalled that broader changes might follow, and in practice the Archive's compliance with robots.txt has become more case-by-case than absolute.

What this means for you:

A robots.txt block is still a strong signal and the Archive honors it for most sites in most cases.
A robots.txt block does not guarantee future non-crawling, and it does not retroactively remove what is already archived.
For anything already archived, the email request below is the official channel. It is a review, not a guaranteed takedown, but it is the only mechanism that can remove existing snapshots.

How do I remove my website from the Wayback Machine?

To remove your website from the Wayback Machine, email info@archive.org and give the URL or URLs you want excluded, the time period to exclude, and the period during which you controlled the site or account. That opens a review by the Archive's team, which is explicit that it makes no guarantees about the outcome before reviewing. In practice the Archive cooperates with legitimate privacy and ownership requests and pushes back on ones that look like reputation management. Removal is a separate action from a robots.txt block: it takes down snapshots that already exist on web.archive.org, whereas robots.txt only affects future crawls. The Archive does not publish a turnaround target, so allow anywhere from a few days to several weeks for the review to complete. Sending from an address on the domain itself is not required, but it makes the request easier to verify.

The Internet Archive documents the request process on its help center. What to include:

Email info@archive.org. Plain text is fine; no specific template is required.
List the URL or URLs. Either specific snapshots (https://web.archive.org/web/*/example.com/page-to-remove) or the whole site ("all of example.com").
State the time period to exclude, and, where relevant, the period during which you controlled the site or account. This is how the Archive gauges that the request is legitimate.
Explain why, briefly. "Privacy: this page exposed my full name and address" is enough; clear privacy and ownership reasons are treated more readily than vague ones.
Expect a review, not an instant takedown. The Archive publishes no turnaround target, makes no guarantee about the outcome up front, and a small team reads info@archive.org; allow a few days to several weeks.

If the request is approved, the snapshots are removed from the Wayback Machine's public interface and, in most cases, the domain or path is excluded from future crawls. The Archive's reply confirms the change.

A working request looks something like:

Subject: Exclusion request: example.com

Hi,

I am the owner of example.com and would like to request exclusion of example.com/private-page/ from the Wayback Machine, for all dates. I have controlled this site from 2019 to the present. The page briefly exposed personal information that should not be archived. Please let me know if you need anything else to process this.

Thank you.

That covers what the Archive asks for in a request.

"This URL has been excluded from the Wayback Machine"

If you open a snapshot and see "Sorry. This URL has been excluded from the Wayback Machine," an exclusion is already in effect: the page has been pulled from public view, usually because a site owner asked for it or because the site's robots.txt excludes the Archive. The exclusion is persistent, not a temporary error or a cache miss, so refreshing or trying again later will not bring the page back. If you own the domain and did not request it, the same info@archive.org channel is where to ask why, or to request that the snapshots be restored, again as a review with no guaranteed outcome. If you are a visitor hoping to read an excluded page, there is no workaround on the Archive's side.

When the email path is not enough

The Archive is cooperative for legitimate privacy and ownership requests, but it pushes back on requests that look like reputation management or attempts to erase published history. If you have a harder case, two escalations exist.

DMCA takedown. If the archived content is copyrighted material that you own and did not authorize the Archive to host, the DMCA process applies. The Archive maintains a DMCA agent (publicly listed on archive.org/about) and processes valid takedown notices according to standard DMCA rules. The classic use case is a publisher whose paywalled article ended up in the Archive after a brief misconfiguration; the DMCA notice is the formal path.

Legal counsel. For high-stakes situations (defamation claims, court-ordered redactions, ongoing litigation), routing the request through your attorney is standard. The Archive responds to legal correspondence the way most online services do.

For the vast majority of operators reading this, the plain email to info@archive.org from an on-domain address is the path. The escalations exist; they rarely need to be used.

Belt-and-braces server-level blocks

If you want to enforce the block in real time rather than wait for the Archive to honor robots.txt, you can drop requests from the Archive's User-Agent at the web server. This stops snapshots from being captured at all, not just from being displayed.

For Nginx:

nginx

if ($http_user_agent ~* "ia_archiver|archive\.org_bot") {
    return 403;
}

For Apache .htaccess:

apache

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (ia_archiver|archive\.org_bot) [NC]
RewriteRule ^ - [F,L]

For Cloudflare WAF custom rules, match http.user_agent contains "archive.org_bot" or http.user_agent contains "ia_archiver" and choose Block.

Two notes on this approach.

It only stops the Archive's own crawler. If a third-party tool submits your URL to the Archive's "Save Page Now" feature, the Archive fetches the page from its own infrastructure, not the bot's, and the User-Agent may differ. The email removal path remains the only way to handle "Save Page Now" abuse.
The Archive sometimes crawls from networks that look like ordinary fetches, particularly when a user manually triggers a save. Treat the User-Agent block as a strong reduction, not a guarantee.

After the block: what to expect

Once a request is approved, the snapshots become inaccessible on web.archive.org. The URL then returns the "This URL has been excluded from the Wayback Machine" message described above. The data is not deleted from the Archive's internal storage in most cases; it is suppressed from the public interface. For the rare cases where outright data deletion is required (legal orders, certain privacy regimes), state that explicitly in the request and the Archive will route it to the team that handles those.

Future crawls of the affected URLs or domain are skipped. If you change your mind later, the same email channel is the way to re-enable.

What this does not do

It does not remove your content from Google's cache, Bing's cache, or any search engine's snapshots. Each of those has its own removal mechanism (Google Search Console's "Removals" tool is the right entry point for Google).
It does not affect mirrors of your content on other archive services (archive.today, freezepage, regional archives). Each of those is a separate request.
It does not remove screenshots that other users have personally saved. Anything that left your control is no longer in your control.
It does not unwind training-data inclusion if an AI model was already trained on the snapshots before removal. See Should You Block AI Bots? for that side of the problem.

FAQ

Mostly, but not absolutely. The Archive's official policy since 2017 is that robots.txt is designed for search engines and is not always the right control for an archive of historical record. In practice the Wayback Machine still honors robots.txt blocks for the vast majority of sites for future crawls. To remove snapshots that already exist, though, the only channel is an email request to info@archive.org, which the Archive reviews case by case.

ia_archiver is the legacy Alexa-era user agent that historically fed crawl data into the Archive; many older robots.txt files still match it. archive.org_bot is the Archive's current crawler. Match on both so your rule covers the legacy name and the modern one.

The Archive does not publish a turnaround target, so treat it as a review rather than a fixed timeline; allow anywhere from a few days to several weeks. A small team reads info@archive.org. A clear list of URLs, the time period to exclude, and the dates you controlled the site help the review move along, but the Archive is explicit that it makes no guarantee about the outcome before reviewing.

List the URL or URLs you want excluded, the time period to exclude, and the period during which you controlled the site or account, plus a brief reason for the request. The Archive uses the control-period detail to confirm the request is legitimate, then reviews it. Sending from an address on the domain is not required but makes the request easier to verify.

Yes. Use a path-specific robots.txt rule: User-agent: archive.org_bot followed by Disallow: /private/ applies only to that path. The same applies to the email removal: request specific URLs or paths rather than the whole domain, and the Archive will scope its action accordingly.

No. The Wayback Machine is not a search engine and Google does not use Archive snapshots as a ranking signal. Blocking ia_archiver and archive.org_bot has no effect on Google Search, Bing, AI search engines, or anything else. It only affects the Wayback Machine itself.

Yes. You decide what crawlers you allow on your own site, and robots.txt, server-level blocks, and CDN rules are all standard ways to express that. The Archive will honor a clear request to be excluded. The harder legal questions appear in the other direction (whether the Archive's existing snapshots can be required to be deleted), and those are handled case by case through DMCA or court process. For everyday operator-side blocking, no special legal cover is required.

How to Block the Wayback Machine from Archiving Your Site

How do I block the Wayback Machine?

What the Wayback Machine is and why people block it

The bot user-agents

The robots.txt approach (and its limits since 2017)

How do I remove my website from the Wayback Machine?

"This URL has been excluded from the Wayback Machine"

When the email path is not enough

Belt-and-braces server-level blocks

After the block: what to expect

What this does not do

FAQ

Sources

Ishan Karunaratne

Related posts

How to Populate ACF Fields from Gravity Forms

How to Protect Your Main Branch on GitHub

How to Set the Date and Time From the macOS Command Line

Does the Wayback Machine respect robots.txt?

What is the difference between ia_archiver and archive.org_bot?

How long does archive.org take to process a removal request?

What information does archive.org need to remove my snapshots?

Can I prevent a single page from being archived without blocking my whole site?

Does blocking the Wayback Machine affect SEO?

Is it legal to block the Internet Archive?

Sources

Ishan Karunaratne