TechEarl

Regex Word Boundaries: \b, \B, and Lookaround Equivalents

Regex word boundaries (\b and \B) match positions between word and non-word characters with zero width. The full reference with engine differences, Unicode handling, lookaround alternatives, and worked examples for whole-word replace, search highlighting, and log parsing.

Ishan Karunaratne⏱️ 14 min readUpdated
Share thisCopied
Complete reference for regex word boundaries: \b and \B zero-width assertions, engine-by-engine support (JS, Python, Java, PCRE, POSIX), Unicode handling, and lookaround alternatives. Worked examples for whole-word replace and search highlighting.

\b is the zero-width assertion that matches the position between a word character and a non-word character: the start of "cat" in "the cat" or the end of "cat" in "cat food", but not anywhere inside "scatter". \B is its negation, a position where there is no word boundary. Both are zero-width: they consume no characters, they only assert position.

What is a word boundary (\b) in regex?

A word boundary is the zero-width position between a word character (\w: a letter, digit, or underscore) and a non-word character (\W: anything else), or at the very start or end of the string. The \b metacharacter matches that position. It is a position, not a character: \b does not match a literal b, and it consumes nothing, so it never "uses up" any of the input.

In one line: \b says "a word starts or ends right here". The classic use is "match a whole word, not the substring inside another word": \bcat\b matches "cat" in "the cat sat" but not "cat" in "scatter" or "category". \B is the negation, every position that is not a word boundary.

That covers the short answer. The rest of this page is the full reference: the exact positions that match, per-engine support (JavaScript, Python, Java, PCRE, .NET, POSIX, Go, Rust), Unicode caveats, the lookaround alternative when you need finer control, and worked examples in Python and JavaScript. The same idea extends to log parsing (token-precise matches), refactor tools (rename only whole identifiers), and search-and-replace (don't mangle "Java" into "Python" inside "JavaScript").

Jump to:

\b is a boundary, not the letter b

This trips up a lot of people searching for "b in regex" or "regex b", so it is worth being explicit. The escape \b (backslash-b) is a word boundary, a zero-width position. It is not the literal character b. To match an actual lowercase b you just write b with no backslash. So:

  • b matches the letter b.
  • \b matches a position (a word boundary), and matches no character at all.
  • \bb\b matches a standalone single-letter word "b" (the boundary, then the letter, then the boundary).

There is one more wrinkle that catches people out: inside a character class, \b changes meaning. In most engines [\b] is the backspace control character (U+0008), not a word boundary. A word boundary is meaningless inside a [...] class, so the engine reuses the escape for backspace there, the same way it does in string literals. If you meant "word boundary", keep \b outside the brackets.

python
import re
re.findall(r"\bb\b", "a b c bb")   # ['b']  - the standalone "b", not the "bb"
re.search(r"[\b]", "a\bc")         # match  - [\b] is backspace inside a class

Word characters: what counts as \w

Before word boundaries make sense, we need to be precise about what the engine considers a "word character". In most modern regex engines, \w is shorthand for [a-zA-Z0-9_]:

Character classMembers
Lettersa-z, A-Z
Digits0-9
Underscore_

Yes, the underscore is a word character. This catches people off-guard: \bsnake_case\b matches because _ doesn't form a boundary, but \bsnake-case\b does not (the - is a non-word character, so the engine sees two separate words).

Anything not in \w is in \W: spaces, tabs, punctuation, hyphens, and every non-ASCII letter or symbol depending on the engine and its Unicode mode.

Where \b matches: the four positions

A word boundary exists at any position where one side is a word character and the other side is not (including string start/end). The four cases:

PositionWhen it matchesExample
Start of string before a word char\bcat against "cat sat"matches at index 0
End of string after a word charcat\b against "the cat"matches at index 7
After non-word, before word\bcat\b against "the cat sat"matches the standalone cat
After word, before non-wordsame as above, the trailing edge of catmatches

What \b does NOT match: positions where both sides are word characters (\bcat\b against "scatter" finds no match) or both sides are non-word characters ("!@#" has no word boundaries).

PatternInputMatchesDoesn't match
\bbook\bthe book!book
\bbook\bnotebookthe substring is inside a word
\bbook\bbookstoretrailing edge is word-to-word
\bdog\bbulldog ranleading edge is word-to-word
\b\d+\broom 42 floor 342, 3
\b[A-Z]+\busing SQL todaySQL

Negated word boundary: \B

\B matches every position that \b does NOT, namely positions where both sides are word characters, or both sides are non-word characters. It's the assertion for "match this substring only when it is inside a word":

PatternInputMatchesWhy
\Bis\Bthis islandis inside thisboth sides are word chars
\Boo\Bbookoo inside bookboth sides are word chars
\Bing\Bsinginging (middle)not standalone
ing\Bsinginging not at endend position is word→string-end
\Bedbeddingmatches inside beddingleft side is word char

The classic use: highlight every occurrence of a sequence as a substring but skip the standalone form. \Bing\b matches ing at the end of a word but not the standalone word "ing".

Engine compatibility

Word boundary support varies by engine, especially around Unicode. The differences matter when you're writing patterns that need to work in more than one runtime.

Engine\b and \BUnicode-aware by defaultNotes
JavaScriptyesno (ASCII only without u flag)Use the u or v flag for Unicode word boundaries. Without it, \bä\b won't match ä correctly.
Python (re)yesdepends on flagIn Python 3, re.UNICODE is the default for str patterns; explicit re.A (re.ASCII) reverts to ASCII-only. For bytes patterns the default is ASCII.
Python (regex package)yesyes by defaultThe third-party regex package is Unicode-aware by default and exposes finer control with the (?V1) version flag.
Java (java.util.regex)yesno by default\w and \b are ASCII unless you set Pattern.UNICODE_CHARACTER_CLASS (or use (?U) inline). The UNICODE_CASE flag is separate and only affects case-insensitive matching.
PCRE / PCRE2 / PHPyesoptionalEnable with the start-of-pattern callout (*UCP), the PHP delimiter modifier u (e.g., /pattern/u), or the PCRE2_UCP compile option. There is no (?UCP) inline flag.
.NET (System.Text.RegularExpressions)yesyes by defaultWord characters and boundaries are Unicode-aware out of the box. RegexOptions.ECMAScript restricts both to ASCII.
Ruby (Onigmo)yesyes with UTF-8 sourceRuby source files default to UTF-8 since 2.0, in which case \w and \b recognise Unicode letters. For other encodings, \w falls back to ASCII.
Go (regexp, RE2)yesnoGo's regexp is ASCII for \w and \b and has no Unicode-word flag. For Unicode word matching, build the class explicitly with [\p{L}\p{N}_] and pair with explicit lookaround logic, or use the regexp2 third-party package.
Rust (regex crate)yesyes when Unicode is enabled (default)The default regex crate build uses Unicode \w and Unicode-aware \b. The regex-lite and no-default-features builds drop back to ASCII.
POSIX BRE / EREnon/aUse [[:<:]] and [[:>:]] (GNU extensions) or \< and \> (BSD/vim) for whole-word matching.

For the broader cross-engine reference, see the Regex Cheat Sheet.

Unicode word boundaries

ASCII-only \b produces surprising results on non-English text. Examples:

javascript
// Without 'u' flag, JavaScript treats non-ASCII as non-word
"naïve".match(/\bna\b/);        // null - "na" stops at "ï"

// With 'u' flag, full Unicode word handling
"naïve".match(/\bna\b/u);       // null - "ï" is treated as a word char, so "na" is mid-word

// "café" full word
"café break".match(/\bcafé\b/);  // null in ASCII mode
"café break".match(/\bcafé\b/u); // ["café"]

Python 3 defaults to Unicode for str patterns:

python
import re
re.search(r"\bnaïve\b", "naïve")        # match, Unicode by default for str
re.search(r"\bnaïve\b", "naïve", re.A)  # None, ASCII mode forced

Java is the opposite default; you have to opt in:

java
Pattern p1 = Pattern.compile("\\bnaïve\\b");                              // ASCII, no match
Pattern p2 = Pattern.compile("\\bnaïve\\b", Pattern.UNICODE_CHARACTER_CLASS); // Unicode, matches

PCRE2 also introduced an explicit Unicode word boundary \b{wb} (Perl 5.22 and PCRE2) that uses Unicode TR29 segmentation rules (handles CJK, emoji clusters, contractions). Most engines don't implement it yet.

For applications dealing with multilingual content, default to the Unicode flag for your engine and test specifically against accented characters, CJK, and emoji.

Lookaround alternative for finer control

\b is a yes/no assertion. When you need to assert which kind of character comes before or after, for example "preceded by whitespace specifically, not punctuation", use lookarounds instead:

Goal\b patternLookaround alternative
Standalone whole word\bcat\b(?<!\w)cat(?!\w)
Preceded only by space (not punctuation)not possible(?<=\s)cat\b
Followed only by space or end-of-stringcat\b (too broad)`cat(?=\s
Whole word excluding underscore as boundary\bid\b (matches _id)(?<![A-Za-z0-9])id(?![A-Za-z0-9])

The last case is particularly useful when working with identifiers: \b treats _ as a word character (so \bid\b won't match the id inside user_id). If you specifically want to find id as a standalone token even when adjacent to underscore, use the explicit lookaround.

Engines that support lookbehind but not variable-width: JavaScript before 2018 and Python before 3.7 required fixed-width lookbehind. Modern engines (JavaScript V8 since 2018, Python 3.7+, .NET, PCRE2, Java 9+) support variable-width lookbehind. Go's regexp (RE2) does not support lookarounds at all, neither fixed nor variable.

Word boundaries in Python and JavaScript

These two dominate, so here is the canonical whole-word match in each. The pattern is the same; what differs is how you express it and how Unicode behaves.

In Python, use a raw string (r"...") so the backslash reaches the regex engine instead of being eaten as a string escape. r"\bword\b" is the idiomatic form:

python
import re

re.findall(r"\bword\b", "word wordsmith password word.")  # ['word', 'word']
re.sub(r"\bcat\b", "dog", "cat scatter cat")              # 'dog scatter dog'
bool(re.search(r"\bcat\b", "the cat sat"))                # True

Python 3 is Unicode-aware by default for str patterns, so r"\bnaïve\b" matches naïve with no extra flag. Pass re.A (re.ASCII) if you specifically want ASCII-only boundaries.

In JavaScript, write the literal with a single backslash (/\bword\b/), or double it when building from a string with new RegExp("\\bword\\b"):

javascript
"word wordsmith password word.".match(/\bword\b/g);  // ['word', 'word']
"cat scatter cat".replace(/\bcat\b/g, "dog");        // 'dog scatter dog'
/\bcat\b/.test("the cat sat");                        // true

JavaScript is ASCII-only for \w and \b unless you add the u (or v) flag, so reach for /\bnaïve\b/u on non-ASCII text. More on that in Unicode word boundaries above.

Common use cases with examples

1. Whole-word find and replace

The most common use of \b. Rename a variable in code without mangling longer identifiers:

bash
# sed: rename 'count' to 'total' but skip 'counter', 'counterpart'
sed -E 's/\bcount\b/total/g' file.txt
python
# Python: same
import re
new = re.sub(r"\bcount\b", "total", source)

2. Highlight search terms

javascript
const term = "react";
const re = new RegExp(`\\b${term}\\b`, "giu");
return text.replace(re, m => `<mark>${m}</mark>`);

The \b anchors prevent highlighting "react" inside "reaction" or "reactor".

3. Extract whole numbers from text

python
re.findall(r"\b\d+\b", "room 42 on floor 3 has 1024 widgets")
# ['42', '3', '1024']

4. Match log tokens precisely

python
# Match LEVEL tokens in a log file: INFO, WARN, ERROR
re.findall(r"\b(?:INFO|WARN|ERROR)\b", log_line)

Without \b, the pattern would match ERROR inside MIRRORED or similar collisions.

5. Validate identifier-like strings

javascript
// Match identifiers that are at least 3 chars, alphanumeric + underscore
const isIdentifier = (s) => /^\w{3,}$/.test(s);

For input-validation patterns specifically, the Regex Anchors article covers ^ and $ which often pair with \b for full-string assertions.

Common mistakes

Mistake 1: assuming \b separates hyphens. \b treats - as a non-word character, so \bword\b matches word in word-art. If you don't want that, use the explicit lookaround (?<![A-Za-z0-9_-])word(?![A-Za-z0-9_-]).

Mistake 2: assuming \b is Unicode-aware. In JavaScript without the u flag, every non-ASCII character is treated as a non-word character. \bnaïve\b won't match naïve because ï looks like a word boundary in ASCII mode. The same trap applies in Java with the default flags, and in Go regardless.

Mistake 3: using \b in POSIX BRE/ERE. It doesn't exist there. Use [[:<:]] and [[:>:]] (GNU grep -E) or \< and \> (BSD grep, vim) for whole-word matching.

Mistake 4: thinking \B is "the opposite end" of \b. \B is "no boundary HERE", not "boundary on the other side". \Bbook\B matches book only when both edges are inside a word, which is rare for the whole word book (it would need to be inside something like bookbookkeeper).

Mistake 5: confusing \b with ^ and $. ^ and $ anchor to the start and end of the line (or string with the appropriate flag); \b anchors to the start or end of a word. Both are zero-width but operate at different scales. See the Regex Anchors guide for the line/string anchors.

Mistake 6: copying a Python \b pattern to Go. Go's regexp will compile the pattern fine, but \w is ASCII-only and there is no flag to flip it. If you need Unicode word matching in Go, either rewrite using \p{L}\p{N}_ character classes with explicit lookaround logic, or pull in the regexp2 package which supports the full PCRE-style feature set including Unicode word boundaries.

What to do next

For more advanced word-matching patterns:

For specific real-world patterns that build on word boundaries:

For the one-page reference with every regex shortcut on one page, see the Regex Cheat Sheet.

FAQ

\b is a zero-width assertion that matches the position between a word character (\w: letter, digit, or underscore) and a non-word character (\W: anything else), or at the start or end of the string. It consumes no characters; it only asserts that the current position is a word boundary.

The classic use is \bword\b to match a standalone occurrence of word without matching the substring inside a longer word like password or wordsmith.

\b matches at a word boundary; \B matches at every position that is NOT a word boundary. Use \b to find standalone words (\bcat\b matches cat in the cat sat); use \B to find substrings inside words (\Bcat\B matches cat in scatter but not the standalone cat).

Both are zero-width and consume no characters.

Yes, in every major engine that follows the \w = [A-Za-z0-9_] convention (JavaScript, Python, Java, PCRE, .NET, Ruby, Go, Rust). That means \b does NOT match between a letter and an underscore: \bid\b won't match id inside user_id because there's no boundary between r and _.

If you want to treat underscore as a separator, use an explicit lookaround: (?<![A-Za-z0-9])id(?![A-Za-z0-9]).

Not by default. Without the u (or v) flag, JavaScript treats every non-ASCII character as a non-word character, so \bnaïve\b won't match naïve. The ï looks like a non-word character and creates spurious boundaries.

Add the u flag (/\bnaïve\b/u) and modern V8 handles full Unicode word semantics. The v flag (Unicode-sets mode, ES2024) implies u and adds support for set notation inside character classes.

Not standard \b or \B. POSIX BRE and ERE use the bracket-expression classes [[:<:]] (start-of-word) and [[:>:]] (end-of-word) as GNU extensions, or \< and \> in BSD grep and vim.

For portable shell scripting, use grep -P (PCRE mode) when available, or fall back to the explicit class extensions for the target platform.

Wrap the word in \b anchors: \bword\b. That matches word as a standalone token (preceded and followed by either a non-word character or the start/end of the string) but not as a substring inside a longer word.

For case-insensitive matching add the i flag; for Unicode word handling add the appropriate flag for your engine (u in JavaScript, (*UCP) in PCRE, UNICODE_CHARACTER_CLASS in Java).

Yes. (?<!\w)cat(?!\w) is equivalent to \bcat\b and gives you finer control. For example, to treat underscore as a separator (which \b doesn't): (?<![A-Za-z0-9])cat(?![A-Za-z0-9]).

Lookarounds require an engine that supports them (most do; Go's regexp / RE2 does NOT, and the Rust regex crate does not). See Regex Lookaheads and Lookbehinds for the full reference.

It does. The hyphen - is a non-word character, so there IS a word boundary between cat and -. \bcat\b matches the cat in cat-food and the standalone cat equally.

If you want cat-food to be treated as one token (no boundary at the hyphen), use the lookaround form (?<![A-Za-z0-9-])cat(?![A-Za-z0-9-]) to include hyphen as a "word" character for boundary purposes.

See also

Word boundaries are one of the most misunderstood zero-width assertions. If you want the concepts properly grounded:

  • Learning Regular Expressions (Ben Forta). The gentlest on-ramp: short, current, and example-driven. A good first book if you are still finding your feet.
  • Regular Expressions Cookbook (Jan Goyvaerts and Steven Levithan, 2nd edition). Problem-then-solution recipes across eight languages (JavaScript, Python, PHP, Java, .NET, Ruby, Perl, VB). The one to keep next to the keyboard.
  • Mastering Regular Expressions (Jeffrey Friedl, 3rd edition). The definitive deep-dive on how regex engines actually work: backtracking, NFA versus DFA, and the optimisation that makes a pattern fast or catastrophic. Dense, and unmatched once you are past the basics.
TagsRegular ExpressionsRegexWord Boundaries\b\BAnchorsZero-Width Assertions

Found this useful? Pass it on.

Copied

Ishan Karunaratne

Software Systems Architect · Senior Software Engineer · Engineering Leadership

Software systems architect and senior software engineer with more than two decades designing, building, and running production software, Linux systems, and DevOps infrastructure, and lately working AI into the stack. Now a CTO, though what I write here is drawn from the full arc of that work, across architecture, engineering, and operations, not any single job.

Keep reading

Related posts

Regex lookaheads and lookbehinds assert what comes before or after a match without consuming characters. Full reference with syntax, password validation, variable-width vs fixed-width support per engine, and examples in JavaScript, Python, PHP, Go, Java, .NET.

How to Use Regex Lookaheads and Lookbehinds

Regex lookaheads and lookbehinds assert what comes before or after a match without consuming characters. Full reference with syntax, password validation, variable-width vs fixed-width support per engine, and examples in JavaScript, Python, PHP, Go, Java, .NET.