WordPress ships two regex-aware textareas in Settings → Discussion ("Comment Moderation" and "Disallowed Comment Keys") plus a PHP filter, pre_comment_approved, that lets you run arbitrary PCRE against every incoming comment before it hits the database. Used together they cover most non-AI spam without an external service.
How do I moderate WordPress comments with regular expressions?
WordPress has two built-in regex-friendly fields under Settings → Discussion:
- Comment Moderation — any comment matching a line here is held in the moderation queue.
- Disallowed Comment Keys (formerly "Comment Blacklist" until WordPress 5.5) — matches are sent straight to spam.
Each line is treated as a case-insensitive substring match by wp_check_comment_disallowed_list(), but because that function runs each line through preg_match() you can use full PCRE syntax: \b word boundaries, character classes, lookaheads, alternation. For anything beyond list-of-strings filtering, drop a function on the pre_comment_approved filter and run preg_match() against $commentdata['comment_content'], returning 0 to hold, 'spam' to spam, 'trash' to discard, or 1 to auto-approve.
Jump to:
- The two built-in regex fields
- Real-world moderation patterns
- The pre_comment_approved filter
- Approve, hold, spam, or trash
- Testing patterns before they go live
- Plugin alternatives
- What to do next
- FAQ
The two built-in regex fields
Both live in /wp-admin/options-discussion.php:
| Field | Behavior when matched | Stored in option |
|---|---|---|
| Comment Moderation | Comment held for manual review | moderation_keys |
| Disallowed Comment Keys | Comment goes straight to spam | disallowed_keys |
Internally WordPress reads each textarea line-by-line, trims it, and runs the equivalent of:
if ( preg_match( '#' . preg_quote( $word, '#' ) . '#i', $comment_content ) ) {
// hold or spam
}The preg_quote() part means special regex characters in plain words ('?', '.', '+') are escaped, so adding viagra just matches the substring viagra. But the moment your line LOOKS like a regex, WordPress still wraps it in #...#i and PCRE evaluates it — anchors, classes, \b, lookaheads all work. There's no documented "regex mode toggle"; it's an emergent behavior of how preg_match() handles the input.
The fields check the comment author name, email, URL, body, IP, and user agent — all six fields, concatenated. Useful when a spammer rotates the body but reuses the same throwaway email domain.
Real-world moderation patterns
A short list of patterns I actually use, paired with what they catch.
Block common spam phrases with word boundaries
\b(viagra|cialis|payday loan|crypto pump)\b
\b is a word boundary — it stops analyst from matching cialis substrings and stops replayday from matching payday. Word boundaries are the single biggest accuracy upgrade you can make to a keyword list.
Limit URL count per comment
(https?://[^\s]+.*?){3,}
Matches any comment with three or more HTTP/HTTPS URLs. Three is the threshold I default to; legitimate comments rarely include more than two links. Drop in Comment Moderation (hold for review) rather than Disallowed Comment Keys so a real reader doesn't get silently spammed for over-citing.
Match suspicious email domains
@(mail\.ru|yandex\.ru|protonmail\.com|tutanota\.com)$
Anchored to $ because WordPress concatenates author fields and the email lives at a known position relative to other punctuation. This is blunt: a legitimate Proton user gets caught. I use it in Comment Moderation, never in Disallowed Comment Keys, and review the queue daily. For a proper email pattern reference see regex match email address.
Detect mass non-Latin script spam
[\x{0400}-\x{04FF}\x{0590}-\x{05FF}\x{0600}-\x{06FF}]{20,}
Twenty or more consecutive Cyrillic, Hebrew, or Arabic characters. This is a controversial filter. A site whose audience legitimately writes in those scripts will lose real readers to it. I only use the pattern on English-only sites and only after seeing the same script-spam vector hit me three or more times in a week. Discussion threads on this exact trade-off show up regularly in the WordPress.org support forums and on the WordPress Stack Exchange — read the counter-arguments before deploying.
Match a known link-injection pattern
\[url=.*?\].*?\[/url\]
BBCode-style link injection. The [url=...] syntax isn't rendered by WordPress, but spambots paste it anyway because the same payload gets reused across vBulletin / phpBB targets. Match-and-spam.
The pre_comment_approved filter
The textareas in the admin cover keyword-and-pattern matching. For anything stateful (rate limits, dynamic blocklists, IP geolocation), use the pre_comment_approved filter in code:
function te_check_comment( $approved, $commentdata ) {
$content = $commentdata['comment_content'];
$pattern = '#\b(buy now|click here|free trial)\b#i';
if ( preg_match( $pattern, $content ) ) {
$approved = 0; // hold for moderation
}
return $approved;
}
add_filter( 'pre_comment_approved', 'te_check_comment', 10, 2 );The filter receives the approval status WordPress already computed (0, 1, 'spam', 'trash', or WP_Error) plus the full $commentdata array. Whatever you return becomes the final status.
Drop this in wp-content/mu-plugins/comment-moderation.php (must-use plugin — loads automatically, can't be deactivated by a wp-admin user) or in your theme's functions.php if the site is theme-locked.
Approve, hold, spam, or trash
The four return values map to four queue destinations. Same regex, different $approved:
Hold for moderation
function te_check_comment( $approved, $commentdata ) {
if ( preg_match( '#\b(viagra|cialis)\b#i', $commentdata['comment_content'] ) ) {
$approved = 0;
}
return $approved;
}
add_filter( 'pre_comment_approved', 'te_check_comment', 10, 2 );Mark as spam
function te_check_comment( $approved, $commentdata ) {
if ( preg_match( '#\b(viagra|cialis)\b#i', $commentdata['comment_content'] ) ) {
$approved = 'spam';
}
return $approved;
}
add_filter( 'pre_comment_approved', 'te_check_comment', 10, 2 );Send straight to trash
function te_check_comment( $approved, $commentdata ) {
if ( preg_match( '#\b(viagra|cialis)\b#i', $commentdata['comment_content'] ) ) {
$approved = 'trash';
}
return $approved;
}
add_filter( 'pre_comment_approved', 'te_check_comment', 10, 2 );Force approve (whitelist a known-good pattern)
function te_check_comment( $approved, $commentdata ) {
if ( preg_match( '#@mycompany\.com$#i', $commentdata['comment_author_email'] ) ) {
$approved = 1;
}
return $approved;
}
add_filter( 'pre_comment_approved', 'te_check_comment', 10, 2 );Useful for a corporate site where employees commenting from the company domain shouldn't get caught by Akismet false positives.
Every comment still hits wp_comments with the resolved status, so you can audit later via WP_Comment_Query even for trashed entries.
Testing patterns before they go live
Never deploy a regex to a production pre_comment_approved filter without testing it against real comment content first. The cost of a bad pattern is silent suppression of legitimate comments — readers don't email you to say "my comment never appeared", they just stop visiting.
The workflow I use:
- Export the last 500 approved comments from a staging or local copy:
wp comment list --status=approve --format=csv --fields=comment_ID,comment_content > approved.csv. - Paste a representative comment into regex101.com (set the flavor to PCRE / PCRE2).
- Iterate the pattern until it matches the spam variants and zero legitimate samples.
- Stage the pattern in a
pre_comment_approvedfilter that logs rather than blocks:error_log( 'Would have moderated: ' . $commentdata['comment_content'] );. - Tail the log for a week. If the false-positive rate is acceptable, flip from logging to blocking.
For broader pattern reference I keep the regex cheat sheet and URL matching reference open in tabs while iterating.
Plugin alternatives
Regex moderation handles 80% of low-effort spam, but it doesn't scale to AI-generated comments that read like real prose. Tier the defense:
| Tool | Catches | When to use |
|---|---|---|
| Built-in regex fields | Substring + keyword spam | Always — it's free and runs before plugins |
pre_comment_approved filter | Custom logic (rate limits, geo, signed-in users) | When the regex fields aren't expressive enough |
| Akismet (Automattic) | Statistical / ML spam, including AI-generated | Default for any non-trivial WordPress site; free for personal use |
| Antispam Bee | Honeypot, BBCode, language heuristics | GDPR-friendly Akismet alternative |
| Cloudflare Turnstile / hCaptcha | Bot signatures, no user friction | When bots are submitting comments faster than humans could type |
Akismet's hosted classification API is what catches the modern "this comment praises your article in fluent English but the URL is a casino affiliate" pattern. Regex can't beat that — but regex IS what catches the 1990s-style payload spam that still makes up most of the queue.
For administrative recovery if a bad rule locks out genuine commenters, see how to change a WordPress password for getting back into admin to disable the filter, and how to increase the PHP memory limit if a runaway regex pattern is timing out PHP on long comment bodies.
What to do next
If you're building a comment-moderation system, the regex cluster on this site covers the patterns you'll reach for most often:
- Regex cheat sheet — single-page PCRE reference for quick lookups while writing moderation rules.
- Regex word boundaries —
\bis the single biggest accuracy upgrade for keyword lists. - Regex match email address — for filtering by author email patterns.
- Regex match URL — for limiting URL counts or matching link-injection payloads.
WordPress-side, two adjacent references:
- Change a WordPress password — admin recovery if a bad rule locks you out.
- WordPress wp_insert_post memory deep dive — if you're importing historical comments at scale.





