TechEarl

How to Use Capturing Groups and Backreferences in Regex

Capturing groups, named groups, non-capturing groups, and backreferences in regex. JavaScript / Python / PHP examples, engine notes, common mistakes, and the duplicate-word and swap-fields use cases.

Ishan Karunaratne⏱️ 11 min readUpdated
Share thisCopied
Regex capturing groups, named groups, non-capturing groups, and backreferences in JavaScript, Python, and PHP, with the duplicate-word and field-swap use cases.

Parentheses in regex do two things at once. They group a sub-pattern so quantifiers apply to the whole group, and they capture the matched substring so you can pull it out after the match. (\d+) matches one-or-more digits AND lets you grab those digits via match[1] (JavaScript), m.group(1) (Python), or $1 in a replacement string. The variants from there: named groups (?<name>...), non-capturing (?:...), and backreferences \1 or \k<name> that match the SAME text the group captured. Below I walk all four with the most common practical use cases (duplicate-word detection, swapping fields with a replacement, structured parsing), engine notes per language, and the bugs I've shipped.

The reason this feature is everywhere in real regex code: most useful patterns aren't just matching, they're extracting. You match a URL to extract the domain. You match a log line to pull the timestamp. You match a date to capture the month. Capturing groups are how.

Quick reference

SyntaxPurpose
(pattern)Capturing group, numbered left-to-right starting at 1
(?:pattern)Non-capturing group (grouping only)
(?<name>pattern)Named capturing group
\1, \2Backreference to a numbered group
\k<name>Backreference to a named group
$1, $2 (replacement)Insert the captured text into a replacement string

Basic capturing groups

code
(\d{4})-(\d{2})-(\d{2})

Three groups: year, month, day. After a match against 2025-10-29, the groups contain:

  • Group 0 (the full match): 2025-10-29
  • Group 1: 2025
  • Group 2: 10
  • Group 3: 29

Groups are numbered left-to-right by their opening parenthesis, starting at 1.

Named groups

For complex patterns with many groups, names are easier to read than numbers:

code
(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})

In code, access via match.groups.year (JavaScript), m.group("year") (Python), or $matches['year'] (PHP). The numbered access (match[1], etc.) still works in parallel.

Different engines use different syntax for named groups:

EngineNamed group syntaxNamed backreference
JavaScript(?<name>...)\k<name>
Python(?P<name>...) or (?<name>...) (3.12+)(?P=name) or \k<name>
PHP (PCRE)(?<name>...) or (?P<name>...)\k<name> or (?P=name)
Ruby(?<name>...)\k<name>
.NET(?<name>...)\k<name>

For cross-engine portability, prefer (?<name>...). It's the most widely accepted form.

Non-capturing groups

If you want grouping (for | alternation or to apply a quantifier) without capturing, prefix with ?::

code
(?:https?|ftp):\/\/

The group is needed because of the alternation, but you don't care about the captured value separately. Without ?:, this would use up group 1 for https / http / ftp and shift all your subsequent groups by one.

Use non-capturing groups whenever you don't actually need the captured value. It makes intent clearer and avoids polluting the numbered groups.

Backreferences: match the same text again

A backreference matches the same text that an earlier capturing group matched. Syntax: \1, \2, etc. for numbered groups; \k<name> for named groups.

Duplicate-word detection:

code
\b(\w+)\s+\1\b

This matches the the in a sentence. The (\w+) captures a word, \s+\1 requires whitespace then the SAME word. Useful for editorial scanning.

Same-tag HTML matching:

code
<(h[1-6])([^>]*)>(.*?)<\/\1>

The \1 ensures the closing tag is the same heading level as the opening tag. Without it, <h1>...</h3> would match (which is invalid HTML).

Backreferences in replacement strings

The same backreferences work in replacement strings for find-and-replace operations:

code
Find: (\w+) (\w+)
Replace: $2 $1

This swaps two whitespace-separated words. Against Hello World, it produces World Hello. Some engines use \1 and \2 in replacements (PHP, Python's re.sub), others use $1 and $2 (JavaScript). Named-group replacements use $<name> or \g<name> depending on engine.

Practical use cases

Find duplicate consecutive words:

code
\b(\w+)\s+\1\b

Swap "Lastname, Firstname" to "Firstname Lastname":

code
Find:    (\w+),\s+(\w+)
Replace: $2 $1

Parse "name=value" pairs:

code
(\w+)=("[^"]*"|'[^']*'|[^\s]+)

The two groups give you the key and the value (with quotes still attached if present).

Wrap HTML hex colour values in <code> tags:

code
Find:    #([0-9A-Fa-f]{3}|[0-9A-Fa-f]{6})\b
Replace: <code>#$1</code>

Convert dates from MM/DD/YYYY to YYYY-MM-DD:

code
Find:    (\d{2})\/(\d{2})\/(\d{4})
Replace: $3-$1-$2

Examples in JavaScript, Python, and PHP

JavaScript:

javascript
const text = "Born 1985-10-29, registered 2010-04-15";

// Numbered groups
const isoDate = /(\d{4})-(\d{2})-(\d{2})/g;
let m;
while ((m = isoDate.exec(text)) !== null) {
  console.log(`year=${m[1]}, month=${m[2]}, day=${m[3]}`);
}

// Named groups
const named = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/g;
for (const match of text.matchAll(named)) {
  console.log(`year=${match.groups.year}, day=${match.groups.day}`);
}

// Swap two words
"Hello World".replace(/(\w+) (\w+)/, "$2 $1");  // 'World Hello'

Python:

python
import re

text = "Born 1985-10-29, registered 2010-04-15"

# Numbered groups
for m in re.finditer(r"(\d{4})-(\d{2})-(\d{2})", text):
    print(f"year={m.group(1)}, month={m.group(2)}, day={m.group(3)}")

# Named groups
for m in re.finditer(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})", text):
    print(f"year={m.group('year')}, day={m.group('day')}")

# Swap two words
re.sub(r"(\w+) (\w+)", r"\2 \1", "Hello World")  # 'World Hello'

PHP:

php
$text = "Born 1985-10-29, registered 2010-04-15";

// Numbered groups
preg_match_all('/(\d{4})-(\d{2})-(\d{2})/', $text, $matches);
// $matches[1] = ['1985', '2010'], $matches[2] = ['10', '04'], $matches[3] = ['29', '15']

// Named groups
preg_match_all('/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/', $text, $matches);
// $matches['year'] = ['1985', '2010']

// Swap two words
preg_replace('/(\w+) (\w+)/', '$2 $1', "Hello World");  // 'World Hello'

Engine compatibility

Numbered groups work everywhere. Named groups and backreferences have meaningful per-engine quirks.

EngineNumbered groupsNamed groupsBackreferencesReplacement syntax
JavaScriptWorks(?<name>...) (ES2018+)\1 in pattern, $1 in replacement$1, $<name>
Python (re)Works(?P<name>...) and (?<name>...) (3.12+)\1 in pattern, \1 in replacement\1, \g<name>
Python (regex pkg)WorksAll formsWorksAll forms
PHP (PCRE)Works(?<name>...) or (?P<name>...)\1 in pattern, $1 or \1 in replacement$1, ${name}
JavaWorks(?<name>...)\1 in pattern, $1 in replacement$1, ${name}
.NETWorks(?<name>...)\1 in pattern, $1 in replacement$1, ${name}
Go (RE2)Works(?P<name>...) onlyNot supported$1, ${name}
Rust (regex crate)Works(?P<name>...) onlyNot supported$1, ${name}
RubyWorks(?<name>...)\1 in pattern, \1 in replacement\1, \k<name>
POSIX BRE (sed, grep)\(...\) onlyNot supported\1 in pattern, \1 in replacement\1
POSIX ERE (grep -E, awk)(...)Not supportedVaries (BSD vs GNU)\1

The most important caveat: Go and Rust do not support backreferences in the pattern. So \b(\w+)\s+\1\b does not work in those engines. The workaround is to find all words with a non-backreference pattern, then check adjacent pairs in code.

Common mistakes

The bugs I see most often.

Forgetting that adding a new group renumbers everything to its right. If your replacement was $3-$1-$2 and you add a new capturing group earlier in the pattern, $3 now points at something else. Either use non-capturing groups (?:...) for everything you don't need to extract, or switch to named groups so additions don't reorder anything.

Backslash escape confusion in the replacement string. PHP and Python use \1 in replacements; JavaScript uses $1. If you copy a Python re.sub call to JavaScript and forget to convert, the replacement becomes the literal string \1. Match the syntax to the engine.

Using a backreference where none is captured. (?:foo)\1 does not work because (?:...) is non-capturing, so \1 has no group to refer to. Use a capturing group (foo)\1 if you need the backreference.

Greedy capture across a delimiter. (.*),(.*) against a,b,c captures group 1 as a,b and group 2 as c because .* is greedy. Use ([^,]*),(.*) to capture up to the first comma instead.

Trying to use backreferences in lookbehinds in fixed-width engines. Java, Python's stdlib re, and .NET (until recently) require lookbehinds to be fixed-width. A pattern like (?<=(\w+)) may compile but the backreference width depends on the input. Use the regex package in Python for variable-width lookbehinds, or restructure.

Capturing inside a quantifier and expecting to get all matches. (\w+)+ only captures the LAST iteration's text into group 1. To get all matches, run findall / matchAll / preg_match_all over the unrolled pattern, or split on a delimiter first.

Test cases

PatternInputGroups
(\d{4})-(\d{2})-(\d{2})2025-10-292025, 10, 29
(?<y>\d{4})2025groups.y = 2025
(\w+),\s*(\w+)Smith, AliceSmith, Alice
\b(\w+)\s+\1\bthe the catMatches the the, group 1 is the
(?:https?|ftp):\/\/https://example.comNo groups (non-capturing)
(.)(.)(.)\3\2\1abccbaa, b, c (palindrome of length 6)

FAQ

See also

External reference: the MDN regex groups reference covers JavaScript specifics; test at regex101.com.

TagsRegexCapturing GroupsNamed GroupsBackreferencesRegular ExpressionsJavaScriptPythonPHP

Found this useful? Pass it on.

Copied

Ishan Karunaratne

Software Systems Architect · Senior Software Engineer · Engineering Leadership

Software systems architect and senior software engineer with more than two decades designing, building, and running production software, Linux systems, and DevOps infrastructure, and lately working AI into the stack. Now a CTO, though what I write here is drawn from the full arc of that work, across architecture, engineering, and operations, not any single job.

Keep reading

Related posts

List all Linux users and groups from /etc/passwd and /etc/group with getent, and tell human accounts from system accounts by UID.

How to List Users and Groups on Linux

List every user and group from /etc/passwd and /etc/group with getent, tell human accounts from system ones by UID, and see which groups a user belongs to.

Three patterns for counting ACF Repeater rows: count() on the raw field, get_field_count, and a fast meta-only count without loading the rows.

How to Count Rows in an ACF Repeater Field

Counting ACF Repeater rows is three short patterns: count() on the raw field, get_field_count() inside a loop, and a faster meta-only count that skips loading the rows. Each has its right use case.