Static XML Sitemap for Large WordPress Sites with WP-CLI

If your WordPress site has hundreds of thousands of URLs and the sitemap is being built on every request (core's /wp-sitemap.xml or an SEO plugin assembling it on the fly), it will be slow and it will sometimes time out. The fix is to stop generating it per request and write static sitemap files to disk with a WP-CLI command, chunked to the 50,000-URL limit, with a sitemap index on top, regenerated on a schedule. Here is the command that does it:

bash

wp te sitemap generate --dir=/var/www/html/sitemaps --base-url=https://example.com

That walks every published post in fixed-size batches, writes sitemap-1.xml, sitemap-2.xml, and so on, then writes a sitemap_index.xml that points at all of them. The rest of this article is the command's implementation, why the per-request approach falls over at scale, and how to serve and schedule the static files.

Why dynamic sitemaps choke at scale

WordPress core added XML Sitemaps in 5.5 (August 2020). They are fine for a normal site: core paginates at 2,000 URLs per file by default per object type (that 2,000 is the per-type figure the wp_sitemaps_get_max_urls() filter applies, not a single global cap), and a request to /wp-sitemap-posts-post-1.xml runs a bounded query and returns. SEO plugins do roughly the same thing, often with their own caching layer.

The problem starts when the URL count climbs into the hundreds of thousands and the sitemap is regenerated at request time. Every hit to a sitemap URL becomes a database query, an object hydration pass, and an XML serialization, all inside the PHP request that has to finish before max_execution_time. A few things go wrong together:

Search engine crawlers fetch sitemaps aggressively. Googlebot, Bingbot, and the rest will pull every sitemap in your index, sometimes in parallel, sometimes repeatedly. Each fetch is real DB and CPU work if it is computed live.
Cold caches hurt the most. The first request after a cache purge or a deploy pays the full cost. On a big site that can blow past the PHP timeout and return a partial or 504 response, which a crawler then records as a broken sitemap.
The work is redundant. The post set barely changes minute to minute, but a per-request sitemap recomputes the same XML for every fetch. You are paying to produce identical bytes over and over.

A static file has none of that cost at read time. Nginx or Apache hands back a flat .xml off disk in microseconds, no PHP, no MySQL. The expensive generation runs once, out of band, on your schedule rather than the crawler's.

The 50,000-URL / 50MB limit and why you chunk

The sitemap protocol is strict about size. From sitemaps.org: a single sitemap file may contain no more than 50,000 URLs and must be no larger than 50MB (52,428,800 bytes) uncompressed. The same ceilings apply to a sitemap index: no more than 50,000 sitemaps listed, and 50MB.

So a site with 600,000 URLs cannot live in one file. You split the URLs across multiple sitemap files, each holding at most 50,000 entries, and write a sitemap index that lists every chunk. The index is what you submit to search engines; they read it and fetch each child sitemap.

50,000 is the hard ceiling, not a recommendation. In practice I chunk smaller (often 25,000-45,000) so I stay clear of the 50MB byte limit too. A URL with a long path plus <lastmod> runs well under a kilobyte, so 50,000 of them is nowhere near 50MB, but if you add <image:image> blocks or long query strings the byte budget tightens fast. Pick a MAX_URLS_PER_FILE you are comfortable with and let the code do the math.

The WP-CLI command

The command registers under the te namespace and exposes a generate subcommand, so the full invocation is wp te sitemap generate. The class method approach is the standard WP-CLI pattern: register a class with WP_CLI::add_command() and its public methods become subcommands.

php

<?php
/**
 * Plugin Name: TE Static Sitemap
 * Plugin URI:  https://techearl.com/wordpress-static-xml-sitemap-wp-cli
 * Description: WP-CLI command that writes static, chunked XML sitemap files plus an index to disk for large WordPress sites.
 * Version:     1.0.0
 * Author:      Ishan Karunaratne
 * Author URI:  https://techearl.com
 * License:     GPL-2.0-or-later
 * Text Domain: te-static-sitemap
 */

if ( ! defined( 'ABSPATH' ) ) {
    exit;
}

if ( ! defined( 'WP_CLI' ) || ! WP_CLI ) {
    return;
}

const TE_SITEMAP_MAX_URLS  = 45000;
const TE_SITEMAP_BATCH_SIZE = 5000;

class TE_Sitemap_Command {

    /**
     * Generate static sitemap files and an index.
     *
     * ## OPTIONS
     *
     * [--dir=<path>]
     * : Directory to write the files into. Must be writable and web-served.
     *
     * [--base-url=<url>]
     * : Site base URL. Defaults to home_url().
     *
     * @when after_wp_load
     */
    public function generate( $args, $assoc_args ) {
        $dir      = rtrim( $assoc_args['dir'] ?? ABSPATH, '/' );
        $base_url = rtrim( $assoc_args['base-url'] ?? home_url(), '/' );

        if ( ! is_dir( $dir ) || ! is_writable( $dir ) ) {
            WP_CLI::error( "Directory not writable: {$dir}" );
        }

        $files = te_generate_sitemap( $dir, $base_url );

        WP_CLI::success(
            sprintf( 'Wrote %d sitemap file(s) and a sitemap_index.xml.', count( $files ) )
        );
    }
}

WP_CLI::add_command( 'te sitemap', 'TE_Sitemap_Command' );

The work itself lives in te_generate_sitemap(), kept as a plain function so it can also be called from a scheduled event (more on that below). It pages through post IDs, opens a new chunk file each time it crosses TE_SITEMAP_MAX_URLS, and writes the index at the end:

php

function te_generate_sitemap( $dir, $base_url ) {
    $files       = array();
    $file_index  = 1;
    $in_file     = 0;
    $handle      = null;
    $paged       = 1;

    $open_file = static function () use ( &$handle, &$file_index, &$files, $dir ) {
        $name = "sitemap-{$file_index}.xml";
        $path = "{$dir}/{$name}";
        $handle = fopen( $path, 'w' );
        fwrite( $handle, '<?xml version="1.0" encoding="UTF-8"?>' . "\n" );
        fwrite( $handle, '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">' . "\n" );
        $files[] = $name;
    };

    $close_file = static function () use ( &$handle ) {
        if ( $handle ) {
            fwrite( $handle, '</urlset>' . "\n" );
            fclose( $handle );
            $handle = null;
        }
    };

    $open_file();

    do {
        $query = new WP_Query( array(
            'post_type'              => 'post',
            'post_status'            => 'publish',
            'posts_per_page'         => TE_SITEMAP_BATCH_SIZE,
            'paged'                  => $paged,
            'fields'                 => 'ids',
            'no_found_rows'          => true,
            'update_post_meta_cache' => false,
            'update_post_term_cache' => false,
            'orderby'                => 'ID',
            'order'                  => 'ASC',
        ) );

        if ( empty( $query->posts ) ) {
            break;
        }

        foreach ( $query->posts as $post_id ) {
            if ( $in_file >= TE_SITEMAP_MAX_URLS ) {
                $close_file();
                $file_index++;
                $in_file = 0;
                $open_file();
            }

            $loc     = esc_url( get_permalink( $post_id ) );
            $lastmod = get_post_modified_time( 'c', true, $post_id );

            fwrite(
                $handle,
                "  <url><loc>{$loc}</loc><lastmod>{$lastmod}</lastmod></url>\n"
            );

            $in_file++;
        }

        $paged++;
        unset( $query );
        wp_cache_flush();

    } while ( true );

    $close_file();

    te_write_sitemap_index( $dir, $base_url, $files );

    return $files;
}

And the index writer, which lists every chunk file:

php

function te_write_sitemap_index( $dir, $base_url, $files ) {
    $now    = gmdate( 'c' );
    $handle = fopen( "{$dir}/sitemap_index.xml", 'w' );

    fwrite( $handle, '<?xml version="1.0" encoding="UTF-8"?>' . "\n" );
    fwrite( $handle, '<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">' . "\n" );

    foreach ( $files as $name ) {
        $loc = esc_url( "{$base_url}/sitemaps/{$name}" );
        fwrite( $handle, "  <sitemap><loc>{$loc}</loc><lastmod>{$now}</lastmod></sitemap>\n" );
    }

    fwrite( $handle, '</sitemapindex>' . "\n" );
    fclose( $handle );
}

This is deliberately a posts-only example. A real large site usually has pages, products, and custom post types too: extend the WP_Query to loop over each post_type you need (and terms, if you index category and tag archives), keeping the same batch-and-chunk structure. Building your own WP-CLI command around this is the natural way to package it; see my notes on writing a custom WP-CLI command for the registration and argument-parsing details.

Batching and memory on a huge dataset

The thing that breaks naive sitemap scripts on big sites is memory, not speed. Load 600,000 full WP_Post objects into PHP and you are out of memory long before you finish. Every flag in that WP_Query is there to keep the footprint flat:

'fields' => 'ids' returns an array of integers, not hydrated post objects. You only need the ID to call get_permalink() and get_post_modified_time().
'no_found_rows' => true skips the SQL_CALC_FOUND_ROWS pass. You are paginating with paged and stopping when a batch comes back empty, so you never need the total count, and that count is one of the most expensive parts of a WP_Query on a large table.
'update_post_meta_cache' => false and 'update_post_term_cache' => false stop WordPress eagerly priming the meta and term caches for every post in the batch. You are not reading meta or terms here, so priming them is pure waste and pure memory.
unset( $query ) drops the batch's objects so PHP can reclaim them.
wp_cache_flush() at the end of each batch is the important one. WordPress accumulates objects in its in-memory cache as you query, and over hundreds of batches that growth is what tips you into the memory limit. Flushing per batch keeps usage bounded no matter how many posts you have.

The result is constant memory regardless of dataset size: you hold one batch (here 5,000 IDs) at a time, stream each URL straight to the open file handle with fwrite(), and never build the whole XML string in memory. A 600,000-URL site processes in the same memory envelope as a 6,000-URL one, it just takes longer.

If even a single batch is heavy, drop TE_SITEMAP_BATCH_SIZE. Smaller batches mean more queries but lower peak memory. On a constrained box I have run this at 1,000 and it is fine; the file handle does the streaming, so batch size only affects how much you hold in RAM at once, not the output.

Serving the files and pointing crawlers at them

Once the files exist on disk, you serve them like any other static asset and tell search engines where the index is.

If you write them under the web root (for example /var/www/html/sitemaps/), they are already reachable at https://example.com/sitemaps/sitemap_index.xml with no extra config: the web server hands back the flat file, PHP never runs. Then:

Point robots.txt at the index so any crawler discovers it:

text

Sitemap: https://example.com/sitemaps/sitemap_index.xml

Submit the index URL in Google Search Console and Bing Webmaster Tools. You submit the one index file; the crawler reads it and fetches each child sitemap itself.
Turn off the dynamic sitemap so you do not have two competing sources. For core, disable it with the wp_sitemaps_enabled filter returning false; for an SEO plugin, switch its XML sitemap feature off. You want exactly one sitemap of record, and on a large site that is the static index.

If you would rather keep the canonical /sitemap_index.xml path at the root instead of under /sitemaps/, write the index file to the web root directly, or add a small rewrite mapping the path to the static file. Just make sure the rewrite resolves to the flat file and does not fall through to WordPress's PHP, otherwise you are back to a dynamic response.

Scheduling regeneration

Static files are a snapshot, so they go stale as you publish. Regenerate them on a cadence that matches how often your content changes. Two ways:

System cron (preferred for large sites). A real cron entry calling WP-CLI runs reliably regardless of site traffic, which WP-Cron does not, since WP-Cron only fires when someone hits the site. A nightly run:

bash

0 3 * * * cd /var/www/html && wp te sitemap generate --dir=/var/www/html/sitemaps --base-url=https://example.com >> /var/log/te-sitemap.log 2>&1

That is the most robust option on a busy production box: the generation runs once a night on the server's schedule, completely decoupled from request traffic.

WP-Cron, if you cannot add a system cron entry. Schedule a recurring event that calls the same te_generate_sitemap() function the CLI command uses:

php

add_action( 'init', function () {
    if ( ! wp_next_scheduled( 'te_sitemap_rebuild' ) ) {
        wp_schedule_event( time(), 'daily', 'te_sitemap_rebuild' );
    }
} );

add_action( 'te_sitemap_rebuild', function () {
    $dir = WP_CONTENT_DIR . '/uploads/sitemaps';
    te_generate_sitemap( $dir, home_url() );
} );

Be honest about WP-Cron's limits on a heavy job, though: it runs inside a web request and is subject to the same PHP timeout that made the per-request sitemap a problem in the first place. For a site big enough to need static sitemaps, a real system cron is the right tool. If you do go the WP-Cron route, my walkthrough on scheduling a recurring task with WP-Cron covers wp_schedule_event() and the gotchas in detail.

Verify it worked

the wp te sitemap generate command writing chunked static sitemap files plus a sitemap index, then curl fetching the index — Real output: the WP-CLI command writing static sitemap chunks and an index.

Run the command and confirm the files landed:

bash

wp te sitemap generate --dir=/var/www/html/sitemaps --base-url=https://example.com

You should see something like Success: Wrote 13 sitemap file(s) and a sitemap_index.xml. Then list the directory and fetch one chunk to eyeball the XML:

bash

ls -la /var/www/html/sitemaps/
curl -s https://example.com/sitemaps/sitemap-1.xml | head -n 20
curl -s https://example.com/sitemaps/sitemap_index.xml

Check three things: each chunk opens with the <urlset> declaration and contains <url><loc> entries, no single file exceeds 50,000 URLs (grep -c '<loc>' sitemap-1.xml), and the index lists every chunk you generated. If you are serving from the web root, both curl calls should return the file in milliseconds with no PHP involvement. Confirm in your access log that the request did not hit index.php.

When the per-request cost of generating the sitemap is hurting you. Core's XML Sitemaps (since WordPress 5.5) and most SEO plugins build the sitemap when the URL is requested, which is fine until you have hundreds of thousands of URLs and aggressive crawlers fetching them repeatedly.

At that scale each fetch is a live database query and XML serialization inside the PHP timeout, and cold-cache requests can 504. A static file written once and served off disk removes that cost entirely. For a normal-sized site, leave the dynamic sitemap alone.

The sitemaps.org protocol caps a single sitemap file at 50,000 URLs and 50MB (52,428,800 bytes) uncompressed. A sitemap index has the same limits: at most 50,000 child sitemaps and 50MB.

Once you cross 50,000 URLs you must split into multiple files and list them in a sitemap index. I chunk below the ceiling (often 25,000 to 45,000) to stay clear of the 50MB byte limit too, especially if entries carry image blocks or long URLs.

To keep memory and query cost flat on a huge dataset. 'fields' => 'ids' returns plain integers instead of hydrating a full post object for every row, which you do not need just to build a URL.

'no_found_rows' => true skips the total-row count, one of the most expensive parts of a query on a large table, which you do not need because you paginate until a batch comes back empty. Disabling the meta and term cache priming saves the rest.

Regenerate it on a schedule. The reliable option on a large site is a system cron entry that calls wp te sitemap generate at a fixed time, independent of site traffic.

If you cannot add a system cron, schedule a WP-Cron event that calls the same generation function. Be aware WP-Cron runs inside a web request and is subject to the same PHP timeout, so for a genuinely large site a real cron job is the better choice.

Add a Sitemap: line to robots.txt pointing at the index, for example https://example.com/sitemaps/sitemap_index.xml, and submit that one index URL in Google Search Console and Bing Webmaster Tools. The crawler reads the index and fetches each child sitemap itself.

Also disable the dynamic sitemap (the wp_sitemaps_enabled filter for core, or the plugin's XML sitemap setting) so there is exactly one sitemap of record rather than two competing sources.

Generate a Static XML Sitemap for Large WordPress Sites (WP-CLI)

Why dynamic sitemaps choke at scale

The 50,000-URL / 50MB limit and why you chunk

The WP-CLI command

Batching and memory on a huge dataset

Serving the files and pointing crawlers at them

Scheduling regeneration

Verify it worked

See also

Sources

Ishan Karunaratne

Related posts

Common ACF Performance Problems on Large WordPress Sites

Add a Custom Sitemap to Yoast SEO (wpseo_sitemap_index)

Using Claude CLI to Manage WordPress Sites

When should I use static sitemaps instead of the WordPress core sitemap?

How many URLs can one sitemap file hold?

Why does the WP_Query use fields=ids and no_found_rows?

How do I keep the static sitemap from going stale?

How do I tell search engines to use the static index?

Sources

Ishan Karunaratne