TechEarl

Regex Word Boundaries: \b, \B, and Lookaround Equivalents

Regex word boundaries (\b and \B) match positions between word and non-word characters with zero width. The full reference with engine differences, Unicode handling, lookaround alternatives, and worked examples for whole-word replace, search highlighting, and log parsing.

Ishan Karunaratne⏱️ 13 min readUpdated
Share thisCopied
Complete reference for regex word boundaries: \b and \B zero-width assertions, engine-by-engine support (JS, Python, Java, PCRE, POSIX), Unicode handling, and lookaround alternatives. Worked examples for whole-word replace and search highlighting.

\b is the zero-width assertion that matches the position between a word character and a non-word character: the start of "cat" in "the cat" or the end of "cat" in "cat food", but not anywhere inside "scatter". \B is its negation, a position where there is no word boundary. Both are zero-width. They consume no characters, they only assert position. Below is the full reference: definition, the exact positions that match, per-engine support (JavaScript, Python, Java, PCRE, .NET, POSIX, Go, Rust), Unicode caveats, the lookaround alternative when you need finer control, and worked examples for the cases I use word boundaries for most often.

What are regex word boundaries?

A regex word boundary is a zero-width assertion that matches at a position where a word character (\w, meaning a letter, digit, or underscore) sits next to a non-word character (\W, meaning anything else), or at the start or end of the string. The \b metacharacter matches that position; \B matches every position that is not a word boundary. Because they're zero-width, \b and \B consume no characters. They only assert that the current position satisfies the condition. The classic use is "match a whole word, not the substring inside another word": \bcat\b matches "cat" in "the cat sat" but not "cat" in "scatter" or "category". The same idea extends to log parsing (token-precise matches), refactor tools (rename only whole identifiers), and search-and-replace (don't mangle "Java" into "Python" inside "JavaScript").

Jump to:

Word characters: what counts as \w

Before word boundaries make sense, we need to be precise about what the engine considers a "word character". In most modern regex engines, \w is shorthand for [a-zA-Z0-9_]:

Character classMembers
Lettersa-z, A-Z
Digits0-9
Underscore_

Yes, the underscore is a word character. This catches people off-guard: \bsnake_case\b matches because _ doesn't form a boundary, but \bsnake-case\b does not (the - is a non-word character, so the engine sees two separate words).

Anything not in \w is in \W: spaces, tabs, punctuation, hyphens, and every non-ASCII letter or symbol depending on the engine and its Unicode mode.

Where \b matches: the four positions

A word boundary exists at any position where one side is a word character and the other side is not (including string start/end). The four cases:

PositionWhen it matchesExample
Start of string before a word char\bcat against "cat sat"matches at index 0
End of string after a word charcat\b against "the cat"matches at index 7
After non-word, before word\bcat\b against "the cat sat"matches the standalone cat
After word, before non-wordsame as above, the trailing edge of catmatches

What \b does NOT match: positions where both sides are word characters (\bcat\b against "scatter" finds no match) or both sides are non-word characters ("!@#" has no word boundaries).

PatternInputMatchesDoesn't match
\bbook\bthe book!book
\bbook\bnotebookthe substring is inside a word
\bbook\bbookstoretrailing edge is word-to-word
\bdog\bbulldog ranleading edge is word-to-word
\b\d+\broom 42 floor 342, 3
\b[A-Z]+\busing SQL todaySQL

Negated word boundary: \B

\B matches every position that \b does NOT, namely positions where both sides are word characters, or both sides are non-word characters. It's the assertion for "match this substring only when it is inside a word":

PatternInputMatchesWhy
\Bis\Bthis islandis inside thisboth sides are word chars
\Boo\Bbookoo inside bookboth sides are word chars
\Bing\Bsinginging (middle)not standalone
ing\Bsinginging not at endend position is word→string-end
\Bedbeddingmatches inside beddingleft side is word char

The classic use: highlight every occurrence of a sequence as a substring but skip the standalone form. \Bing\b matches ing at the end of a word but not the standalone word "ing".

Engine compatibility

Word boundary support varies by engine, especially around Unicode. The differences matter when you're writing patterns that need to work in more than one runtime.

Engine\b and \BUnicode-aware by defaultNotes
JavaScriptyesno (ASCII only without u flag)Use the u or v flag for Unicode word boundaries. Without it, \bä\b won't match ä correctly.
Python (re)yesdepends on flagIn Python 3, re.UNICODE is the default for str patterns; explicit re.A (re.ASCII) reverts to ASCII-only. For bytes patterns the default is ASCII.
Python (regex package)yesyes by defaultThe third-party regex package is Unicode-aware by default and exposes finer control with the (?V1) version flag.
Java (java.util.regex)yesno by default\w and \b are ASCII unless you set Pattern.UNICODE_CHARACTER_CLASS (or use (?U) inline). The UNICODE_CASE flag is separate and only affects case-insensitive matching.
PCRE / PCRE2 / PHPyesoptionalEnable with the start-of-pattern callout (*UCP), the PHP delimiter modifier u (e.g., /pattern/u), or the PCRE2_UCP compile option. There is no (?UCP) inline flag.
.NET (System.Text.RegularExpressions)yesyes by defaultWord characters and boundaries are Unicode-aware out of the box. RegexOptions.ECMAScript restricts both to ASCII.
Ruby (Onigmo)yesyes with UTF-8 sourceRuby source files default to UTF-8 since 2.0, in which case \w and \b recognise Unicode letters. For other encodings, \w falls back to ASCII.
Go (regexp, RE2)yesnoGo's regexp is ASCII for \w and \b and has no Unicode-word flag. For Unicode word matching, build the class explicitly with [\p{L}\p{N}_] and pair with explicit lookaround logic, or use the regexp2 third-party package.
Rust (regex crate)yesyes when Unicode is enabled (default)The default regex crate build uses Unicode \w and Unicode-aware \b. The regex-lite and no-default-features builds drop back to ASCII.
POSIX BRE / EREnon/aUse [[:<:]] and [[:>:]] (GNU extensions) or \< and \> (BSD/vim) for whole-word matching.

For the broader cross-engine reference, see the Regex Cheat Sheet.

Unicode word boundaries

ASCII-only \b produces surprising results on non-English text. Examples:

javascript
// Without 'u' flag, JavaScript treats non-ASCII as non-word
"naïve".match(/\bna\b/);        // null - "na" stops at "ï"

// With 'u' flag, full Unicode word handling
"naïve".match(/\bna\b/u);       // null - "ï" is treated as a word char, so "na" is mid-word

// "café" full word
"café break".match(/\bcafé\b/);  // null in ASCII mode
"café break".match(/\bcafé\b/u); // ["café"]

Python 3 defaults to Unicode for str patterns:

python
import re
re.search(r"\bnaïve\b", "naïve")        # match, Unicode by default for str
re.search(r"\bnaïve\b", "naïve", re.A)  # None, ASCII mode forced

Java is the opposite default; you have to opt in:

java
Pattern p1 = Pattern.compile("\\bnaïve\\b");                              // ASCII, no match
Pattern p2 = Pattern.compile("\\bnaïve\\b", Pattern.UNICODE_CHARACTER_CLASS); // Unicode, matches

PCRE2 also introduced an explicit Unicode word boundary \b{wb} (Perl 5.22 and PCRE2) that uses Unicode TR29 segmentation rules (handles CJK, emoji clusters, contractions). Most engines don't implement it yet.

For applications dealing with multilingual content, default to the Unicode flag for your engine and test specifically against accented characters, CJK, and emoji.

Lookaround alternative for finer control

\b is a yes/no assertion. When you need to assert which kind of character comes before or after, for example "preceded by whitespace specifically, not punctuation", use lookarounds instead:

Goal\b patternLookaround alternative
Standalone whole word\bcat\b(?<!\w)cat(?!\w)
Preceded only by space (not punctuation)not possible(?<=\s)cat\b
Followed only by space or end-of-stringcat\b (too broad)`cat(?=\s
Whole word excluding underscore as boundary\bid\b (matches _id)(?<![A-Za-z0-9])id(?![A-Za-z0-9])

The last case is particularly useful when working with identifiers: \b treats _ as a word character (so \bid\b won't match the id inside user_id). If you specifically want to find id as a standalone token even when adjacent to underscore, use the explicit lookaround.

Engines that support lookbehind but not variable-width: JavaScript before 2018 and Python before 3.7 required fixed-width lookbehind. Modern engines (JavaScript V8 since 2018, Python 3.7+, .NET, PCRE2, Java 9+) support variable-width lookbehind. Go's regexp (RE2) does not support lookarounds at all, neither fixed nor variable.

Common use cases with examples

1. Whole-word find and replace

The most common use of \b. Rename a variable in code without mangling longer identifiers:

bash
# sed: rename 'count' to 'total' but skip 'counter', 'counterpart'
sed -E 's/\bcount\b/total/g' file.txt
python
# Python: same
import re
new = re.sub(r"\bcount\b", "total", source)

2. Highlight search terms

javascript
const term = "react";
const re = new RegExp(`\\b${term}\\b`, "giu");
return text.replace(re, m => `<mark>${m}</mark>`);

The \b anchors prevent highlighting "react" inside "reaction" or "reactor".

3. Extract whole numbers from text

python
re.findall(r"\b\d+\b", "room 42 on floor 3 has 1024 widgets")
# ['42', '3', '1024']

4. Match log tokens precisely

python
# Match LEVEL tokens in a log file: INFO, WARN, ERROR
re.findall(r"\b(?:INFO|WARN|ERROR)\b", log_line)

Without \b, the pattern would match ERROR inside MIRRORED or similar collisions.

5. Validate identifier-like strings

javascript
// Match identifiers that are at least 3 chars, alphanumeric + underscore
const isIdentifier = (s) => /^\w{3,}$/.test(s);

For input-validation patterns specifically, the Regex Anchors article covers ^ and $ which often pair with \b for full-string assertions.

Common mistakes

Mistake 1: assuming \b separates hyphens. \b treats - as a non-word character, so \bword\b matches word in word-art. If you don't want that, use the explicit lookaround (?<![A-Za-z0-9_-])word(?![A-Za-z0-9_-]).

Mistake 2: assuming \b is Unicode-aware. In JavaScript without the u flag, every non-ASCII character is treated as a non-word character. \bnaïve\b won't match naïve because ï looks like a word boundary in ASCII mode. The same trap applies in Java with the default flags, and in Go regardless.

Mistake 3: using \b in POSIX BRE/ERE. It doesn't exist there. Use [[:<:]] and [[:>:]] (GNU grep -E) or \< and \> (BSD grep, vim) for whole-word matching.

Mistake 4: thinking \B is "the opposite end" of \b. \B is "no boundary HERE", not "boundary on the other side". \Bbook\B matches book only when both edges are inside a word, which is rare for the whole word book (it would need to be inside something like bookbookkeeper).

Mistake 5: confusing \b with ^ and $. ^ and $ anchor to the start and end of the line (or string with the appropriate flag); \b anchors to the start or end of a word. Both are zero-width but operate at different scales. See the Regex Anchors guide for the line/string anchors.

Mistake 6: copying a Python \b pattern to Go. Go's regexp will compile the pattern fine, but \w is ASCII-only and there is no flag to flip it. If you need Unicode word matching in Go, either rewrite using \p{L}\p{N}_ character classes with explicit lookaround logic, or pull in the regexp2 package which supports the full PCRE-style feature set including Unicode word boundaries.

What to do next

For more advanced word-matching patterns:

For specific real-world patterns that build on word boundaries:

For the one-page reference with every regex shortcut on one page, see the Regex Cheat Sheet.

FAQ

See also

TagsRegular ExpressionsRegexWord Boundaries\b\BAnchorsZero-Width Assertions

Found this useful? Pass it on.

Copied

Ishan Karunaratne

Tech Architect · Software Engineer · AI/DevOps

Tech architect and software engineer with 20+ years building software, Linux systems, and DevOps infrastructure, and lately working AI into the stack. Currently Chief Technology Officer at a healthcare tech startup, which is where most of these field notes come from.

Keep reading

Related posts

How to Use Regex Lookaheads and Lookbehinds

Regex lookaheads and lookbehinds assert what comes before or after a match without consuming characters. Full reference with syntax, password validation, variable-width vs fixed-width support per engine, and examples in JavaScript, Python, PHP, Go, Java, .NET.

R

Running Docker Containers as a Non-Root User

By default, processes inside Docker containers run as root, which is risky. Switch to a non-root USER, fix permissions on volumes and ports, and configure Compose and Kubernetes to refuse to run root containers.