TechEarl

WordPress: Moderate Comments Using Regular Expressions

Use the built-in WordPress comment-moderation regex fields and the pre_comment_approved filter to approve, hold, spam, or trash comments based on PCRE patterns.

Ishan KarunaratneIshan Karunaratne⏱️ 10 min readUpdated
How to moderate WordPress comments using regular expressions. Covers the Comment Moderation and Disallowed Comment Keys fields under Settings, Discussion, plus the pre_comment_approved filter for custom PCRE rules.

WordPress ships two regex-aware textareas in Settings → Discussion ("Comment Moderation" and "Disallowed Comment Keys") plus a PHP filter, pre_comment_approved, that lets you run arbitrary PCRE against every incoming comment before it hits the database. Used together they cover most non-AI spam without an external service.

How do I moderate WordPress comments with regular expressions?

WordPress has two built-in regex-friendly fields under Settings → Discussion:

  1. Comment Moderation — any comment matching a line here is held in the moderation queue.
  2. Disallowed Comment Keys (formerly "Comment Blacklist" until WordPress 5.5) — matches are sent straight to spam.

Each line is treated as a case-insensitive substring match by wp_check_comment_disallowed_list(), but because that function runs each line through preg_match() you can use full PCRE syntax: \b word boundaries, character classes, lookaheads, alternation. For anything beyond list-of-strings filtering, drop a function on the pre_comment_approved filter and run preg_match() against $commentdata['comment_content'], returning 0 to hold, 'spam' to spam, 'trash' to discard, or 1 to auto-approve.

Jump to:

The two built-in regex fields

Both live in /wp-admin/options-discussion.php:

FieldBehavior when matchedStored in option
Comment ModerationComment held for manual reviewmoderation_keys
Disallowed Comment KeysComment goes straight to spamdisallowed_keys

Internally WordPress reads each textarea line-by-line, trims it, and runs the equivalent of:

php
if ( preg_match( '#' . preg_quote( $word, '#' ) . '#i', $comment_content ) ) {
    // hold or spam
}

The preg_quote() part means special regex characters in plain words ('?', '.', '+') are escaped, so adding viagra just matches the substring viagra. But the moment your line LOOKS like a regex, WordPress still wraps it in #...#i and PCRE evaluates it — anchors, classes, \b, lookaheads all work. There's no documented "regex mode toggle"; it's an emergent behavior of how preg_match() handles the input.

The fields check the comment author name, email, URL, body, IP, and user agent — all six fields, concatenated. Useful when a spammer rotates the body but reuses the same throwaway email domain.

Real-world moderation patterns

A short list of patterns I actually use, paired with what they catch.

Block common spam phrases with word boundaries

code
\b(viagra|cialis|payday loan|crypto pump)\b

\b is a word boundary — it stops analyst from matching cialis substrings and stops replayday from matching payday. Word boundaries are the single biggest accuracy upgrade you can make to a keyword list.

Limit URL count per comment

code
(https?://[^\s]+.*?){3,}

Matches any comment with three or more HTTP/HTTPS URLs. Three is the threshold I default to; legitimate comments rarely include more than two links. Drop in Comment Moderation (hold for review) rather than Disallowed Comment Keys so a real reader doesn't get silently spammed for over-citing.

Match suspicious email domains

code
@(mail\.ru|yandex\.ru|protonmail\.com|tutanota\.com)$

Anchored to $ because WordPress concatenates author fields and the email lives at a known position relative to other punctuation. This is blunt: a legitimate Proton user gets caught. I use it in Comment Moderation, never in Disallowed Comment Keys, and review the queue daily. For a proper email pattern reference see regex match email address.

Detect mass non-Latin script spam

code
[\x{0400}-\x{04FF}\x{0590}-\x{05FF}\x{0600}-\x{06FF}]{20,}

Twenty or more consecutive Cyrillic, Hebrew, or Arabic characters. This is a controversial filter. A site whose audience legitimately writes in those scripts will lose real readers to it. I only use the pattern on English-only sites and only after seeing the same script-spam vector hit me three or more times in a week. Discussion threads on this exact trade-off show up regularly in the WordPress.org support forums and on the WordPress Stack Exchange — read the counter-arguments before deploying.

code
\[url=.*?\].*?\[/url\]

BBCode-style link injection. The [url=...] syntax isn't rendered by WordPress, but spambots paste it anyway because the same payload gets reused across vBulletin / phpBB targets. Match-and-spam.

The pre_comment_approved filter

The textareas in the admin cover keyword-and-pattern matching. For anything stateful (rate limits, dynamic blocklists, IP geolocation), use the pre_comment_approved filter in code:

php
function te_check_comment( $approved, $commentdata ) {
    $content = $commentdata['comment_content'];
    $pattern = '#\b(buy now|click here|free trial)\b#i';

    if ( preg_match( $pattern, $content ) ) {
        $approved = 0; // hold for moderation
    }

    return $approved;
}
add_filter( 'pre_comment_approved', 'te_check_comment', 10, 2 );

The filter receives the approval status WordPress already computed (0, 1, 'spam', 'trash', or WP_Error) plus the full $commentdata array. Whatever you return becomes the final status.

Drop this in wp-content/mu-plugins/comment-moderation.php (must-use plugin — loads automatically, can't be deactivated by a wp-admin user) or in your theme's functions.php if the site is theme-locked.

Approve, hold, spam, or trash

The four return values map to four queue destinations. Same regex, different $approved:

Hold for moderation

php
function te_check_comment( $approved, $commentdata ) {
    if ( preg_match( '#\b(viagra|cialis)\b#i', $commentdata['comment_content'] ) ) {
        $approved = 0;
    }
    return $approved;
}
add_filter( 'pre_comment_approved', 'te_check_comment', 10, 2 );

Mark as spam

php
function te_check_comment( $approved, $commentdata ) {
    if ( preg_match( '#\b(viagra|cialis)\b#i', $commentdata['comment_content'] ) ) {
        $approved = 'spam';
    }
    return $approved;
}
add_filter( 'pre_comment_approved', 'te_check_comment', 10, 2 );

Send straight to trash

php
function te_check_comment( $approved, $commentdata ) {
    if ( preg_match( '#\b(viagra|cialis)\b#i', $commentdata['comment_content'] ) ) {
        $approved = 'trash';
    }
    return $approved;
}
add_filter( 'pre_comment_approved', 'te_check_comment', 10, 2 );

Force approve (whitelist a known-good pattern)

php
function te_check_comment( $approved, $commentdata ) {
    if ( preg_match( '#@mycompany\.com$#i', $commentdata['comment_author_email'] ) ) {
        $approved = 1;
    }
    return $approved;
}
add_filter( 'pre_comment_approved', 'te_check_comment', 10, 2 );

Useful for a corporate site where employees commenting from the company domain shouldn't get caught by Akismet false positives.

Every comment still hits wp_comments with the resolved status, so you can audit later via WP_Comment_Query even for trashed entries.

Testing patterns before they go live

Never deploy a regex to a production pre_comment_approved filter without testing it against real comment content first. The cost of a bad pattern is silent suppression of legitimate comments — readers don't email you to say "my comment never appeared", they just stop visiting.

The workflow I use:

  1. Export the last 500 approved comments from a staging or local copy: wp comment list --status=approve --format=csv --fields=comment_ID,comment_content > approved.csv.
  2. Paste a representative comment into regex101.com (set the flavor to PCRE / PCRE2).
  3. Iterate the pattern until it matches the spam variants and zero legitimate samples.
  4. Stage the pattern in a pre_comment_approved filter that logs rather than blocks: error_log( 'Would have moderated: ' . $commentdata['comment_content'] );.
  5. Tail the log for a week. If the false-positive rate is acceptable, flip from logging to blocking.

For broader pattern reference I keep the regex cheat sheet and URL matching reference open in tabs while iterating.

Plugin alternatives

Regex moderation handles 80% of low-effort spam, but it doesn't scale to AI-generated comments that read like real prose. Tier the defense:

ToolCatchesWhen to use
Built-in regex fieldsSubstring + keyword spamAlways — it's free and runs before plugins
pre_comment_approved filterCustom logic (rate limits, geo, signed-in users)When the regex fields aren't expressive enough
Akismet (Automattic)Statistical / ML spam, including AI-generatedDefault for any non-trivial WordPress site; free for personal use
Antispam BeeHoneypot, BBCode, language heuristicsGDPR-friendly Akismet alternative
Cloudflare Turnstile / hCaptchaBot signatures, no user frictionWhen bots are submitting comments faster than humans could type

Akismet's hosted classification API is what catches the modern "this comment praises your article in fluent English but the URL is a casino affiliate" pattern. Regex can't beat that — but regex IS what catches the 1990s-style payload spam that still makes up most of the queue.

For administrative recovery if a bad rule locks out genuine commenters, see how to change a WordPress password for getting back into admin to disable the filter, and how to increase the PHP memory limit if a runaway regex pattern is timing out PHP on long comment bodies.

What to do next

If you're building a comment-moderation system, the regex cluster on this site covers the patterns you'll reach for most often:

WordPress-side, two adjacent references:

FAQ

TagsWordPressComment ModerationRegular ExpressionsPCREAnti-SpamPHP
Share
Ishan Karunaratne

Ishan Karunaratne

Tech Architect · Software Engineer · AI/DevOps

Tech architect and software engineer with 20+ years building software, Linux systems, and DevOps infrastructure, and lately working AI into the stack. Currently Chief Technology Officer at a healthcare tech startup, which is where most of these field notes come from.

Keep reading

Related posts

Write LLM evals that catch regressions. Pick metrics (exact match, LLM-as-judge, embedding similarity), build a golden dataset, run on every PR, monitor trends.

How to Write LLM Evals That Catch Regressions

Write LLM evals that catch real regressions: pick the right metrics (exact match, LLM-as-judge, embedding similarity), build a golden dataset, run on every PR, and watch the trend over time.