\b is the zero-width assertion that matches the position between a word character and a non-word character: the start of "cat" in "the cat" or the end of "cat" in "cat food", but not anywhere inside "scatter". \B is its negation, a position where there is no word boundary. Both are zero-width. They consume no characters, they only assert position. Below is the full reference: definition, the exact positions that match, per-engine support (JavaScript, Python, Java, PCRE, .NET, POSIX, Go, Rust), Unicode caveats, the lookaround alternative when you need finer control, and worked examples for the cases I use word boundaries for most often.
What are regex word boundaries?
A regex word boundary is a zero-width assertion that matches at a position where a word character (\w, meaning a letter, digit, or underscore) sits next to a non-word character (\W, meaning anything else), or at the start or end of the string. The \b metacharacter matches that position; \B matches every position that is not a word boundary. Because they're zero-width, \b and \B consume no characters. They only assert that the current position satisfies the condition. The classic use is "match a whole word, not the substring inside another word": \bcat\b matches "cat" in "the cat sat" but not "cat" in "scatter" or "category". The same idea extends to log parsing (token-precise matches), refactor tools (rename only whole identifiers), and search-and-replace (don't mangle "Java" into "Python" inside "JavaScript").
Jump to:
- Word characters: what counts as
\w - Where
\bmatches: the four positions - Negated word boundary:
\B - Engine compatibility
- Unicode word boundaries
- Lookaround alternative for finer control
- Common use cases with examples
- Common mistakes
- FAQ
Word characters: what counts as \w
Before word boundaries make sense, we need to be precise about what the engine considers a "word character". In most modern regex engines, \w is shorthand for [a-zA-Z0-9_]:
| Character class | Members |
|---|---|
| Letters | a-z, A-Z |
| Digits | 0-9 |
| Underscore | _ |
Yes, the underscore is a word character. This catches people off-guard: \bsnake_case\b matches because _ doesn't form a boundary, but \bsnake-case\b does not (the - is a non-word character, so the engine sees two separate words).
Anything not in \w is in \W: spaces, tabs, punctuation, hyphens, and every non-ASCII letter or symbol depending on the engine and its Unicode mode.
Where \b matches: the four positions
A word boundary exists at any position where one side is a word character and the other side is not (including string start/end). The four cases:
| Position | When it matches | Example |
|---|---|---|
| Start of string before a word char | \bcat against "cat sat" | matches at index 0 |
| End of string after a word char | cat\b against "the cat" | matches at index 7 |
| After non-word, before word | \bcat\b against "the cat sat" | matches the standalone cat |
| After word, before non-word | same as above, the trailing edge of cat | matches |
What \b does NOT match: positions where both sides are word characters (\bcat\b against "scatter" finds no match) or both sides are non-word characters ("!@#" has no word boundaries).
| Pattern | Input | Matches | Doesn't match |
|---|---|---|---|
\bbook\b | the book! | book | |
\bbook\b | notebook | the substring is inside a word | |
\bbook\b | bookstore | trailing edge is word-to-word | |
\bdog\b | bulldog ran | leading edge is word-to-word | |
\b\d+\b | room 42 floor 3 | 42, 3 | |
\b[A-Z]+\b | using SQL today | SQL |
Negated word boundary: \B
\B matches every position that \b does NOT, namely positions where both sides are word characters, or both sides are non-word characters. It's the assertion for "match this substring only when it is inside a word":
| Pattern | Input | Matches | Why |
|---|---|---|---|
\Bis\B | this island | is inside this | both sides are word chars |
\Boo\B | book | oo inside book | both sides are word chars |
\Bing\B | singing | ing (middle) | not standalone |
ing\B | singing | ing not at end | end position is word→string-end |
\Bed | bedding | matches inside bedding | left side is word char |
The classic use: highlight every occurrence of a sequence as a substring but skip the standalone form. \Bing\b matches ing at the end of a word but not the standalone word "ing".
Engine compatibility
Word boundary support varies by engine, especially around Unicode. The differences matter when you're writing patterns that need to work in more than one runtime.
| Engine | \b and \B | Unicode-aware by default | Notes |
|---|---|---|---|
| JavaScript | yes | no (ASCII only without u flag) | Use the u or v flag for Unicode word boundaries. Without it, \bä\b won't match ä correctly. |
Python (re) | yes | depends on flag | In Python 3, re.UNICODE is the default for str patterns; explicit re.A (re.ASCII) reverts to ASCII-only. For bytes patterns the default is ASCII. |
Python (regex package) | yes | yes by default | The third-party regex package is Unicode-aware by default and exposes finer control with the (?V1) version flag. |
Java (java.util.regex) | yes | no by default | \w and \b are ASCII unless you set Pattern.UNICODE_CHARACTER_CLASS (or use (?U) inline). The UNICODE_CASE flag is separate and only affects case-insensitive matching. |
| PCRE / PCRE2 / PHP | yes | optional | Enable with the start-of-pattern callout (*UCP), the PHP delimiter modifier u (e.g., /pattern/u), or the PCRE2_UCP compile option. There is no (?UCP) inline flag. |
.NET (System.Text.RegularExpressions) | yes | yes by default | Word characters and boundaries are Unicode-aware out of the box. RegexOptions.ECMAScript restricts both to ASCII. |
| Ruby (Onigmo) | yes | yes with UTF-8 source | Ruby source files default to UTF-8 since 2.0, in which case \w and \b recognise Unicode letters. For other encodings, \w falls back to ASCII. |
Go (regexp, RE2) | yes | no | Go's regexp is ASCII for \w and \b and has no Unicode-word flag. For Unicode word matching, build the class explicitly with [\p{L}\p{N}_] and pair with explicit lookaround logic, or use the regexp2 third-party package. |
Rust (regex crate) | yes | yes when Unicode is enabled (default) | The default regex crate build uses Unicode \w and Unicode-aware \b. The regex-lite and no-default-features builds drop back to ASCII. |
| POSIX BRE / ERE | no | n/a | Use [[:<:]] and [[:>:]] (GNU extensions) or \< and \> (BSD/vim) for whole-word matching. |
For the broader cross-engine reference, see the Regex Cheat Sheet.
Unicode word boundaries
ASCII-only \b produces surprising results on non-English text. Examples:
// Without 'u' flag, JavaScript treats non-ASCII as non-word
"naïve".match(/\bna\b/); // null - "na" stops at "ï"
// With 'u' flag, full Unicode word handling
"naïve".match(/\bna\b/u); // null - "ï" is treated as a word char, so "na" is mid-word
// "café" full word
"café break".match(/\bcafé\b/); // null in ASCII mode
"café break".match(/\bcafé\b/u); // ["café"]Python 3 defaults to Unicode for str patterns:
import re
re.search(r"\bnaïve\b", "naïve") # match, Unicode by default for str
re.search(r"\bnaïve\b", "naïve", re.A) # None, ASCII mode forcedJava is the opposite default; you have to opt in:
Pattern p1 = Pattern.compile("\\bnaïve\\b"); // ASCII, no match
Pattern p2 = Pattern.compile("\\bnaïve\\b", Pattern.UNICODE_CHARACTER_CLASS); // Unicode, matchesPCRE2 also introduced an explicit Unicode word boundary \b{wb} (Perl 5.22 and PCRE2) that uses Unicode TR29 segmentation rules (handles CJK, emoji clusters, contractions). Most engines don't implement it yet.
For applications dealing with multilingual content, default to the Unicode flag for your engine and test specifically against accented characters, CJK, and emoji.
Lookaround alternative for finer control
\b is a yes/no assertion. When you need to assert which kind of character comes before or after, for example "preceded by whitespace specifically, not punctuation", use lookarounds instead:
| Goal | \b pattern | Lookaround alternative |
|---|---|---|
| Standalone whole word | \bcat\b | (?<!\w)cat(?!\w) |
| Preceded only by space (not punctuation) | not possible | (?<=\s)cat\b |
| Followed only by space or end-of-string | cat\b (too broad) | `cat(?=\s |
| Whole word excluding underscore as boundary | \bid\b (matches _id) | (?<![A-Za-z0-9])id(?![A-Za-z0-9]) |
The last case is particularly useful when working with identifiers: \b treats _ as a word character (so \bid\b won't match the id inside user_id). If you specifically want to find id as a standalone token even when adjacent to underscore, use the explicit lookaround.
Engines that support lookbehind but not variable-width: JavaScript before 2018 and Python before 3.7 required fixed-width lookbehind. Modern engines (JavaScript V8 since 2018, Python 3.7+, .NET, PCRE2, Java 9+) support variable-width lookbehind. Go's regexp (RE2) does not support lookarounds at all, neither fixed nor variable.
Common use cases with examples
1. Whole-word find and replace
The most common use of \b. Rename a variable in code without mangling longer identifiers:
# sed: rename 'count' to 'total' but skip 'counter', 'counterpart'
sed -E 's/\bcount\b/total/g' file.txt# Python: same
import re
new = re.sub(r"\bcount\b", "total", source)2. Highlight search terms
const term = "react";
const re = new RegExp(`\\b${term}\\b`, "giu");
return text.replace(re, m => `<mark>${m}</mark>`);The \b anchors prevent highlighting "react" inside "reaction" or "reactor".
3. Extract whole numbers from text
re.findall(r"\b\d+\b", "room 42 on floor 3 has 1024 widgets")
# ['42', '3', '1024']4. Match log tokens precisely
# Match LEVEL tokens in a log file: INFO, WARN, ERROR
re.findall(r"\b(?:INFO|WARN|ERROR)\b", log_line)Without \b, the pattern would match ERROR inside MIRRORED or similar collisions.
5. Validate identifier-like strings
// Match identifiers that are at least 3 chars, alphanumeric + underscore
const isIdentifier = (s) => /^\w{3,}$/.test(s);For input-validation patterns specifically, the Regex Anchors article covers ^ and $ which often pair with \b for full-string assertions.
Common mistakes
Mistake 1: assuming \b separates hyphens. \b treats - as a non-word character, so \bword\b matches word in word-art. If you don't want that, use the explicit lookaround (?<![A-Za-z0-9_-])word(?![A-Za-z0-9_-]).
Mistake 2: assuming \b is Unicode-aware. In JavaScript without the u flag, every non-ASCII character is treated as a non-word character. \bnaïve\b won't match naïve because ï looks like a word boundary in ASCII mode. The same trap applies in Java with the default flags, and in Go regardless.
Mistake 3: using \b in POSIX BRE/ERE. It doesn't exist there. Use [[:<:]] and [[:>:]] (GNU grep -E) or \< and \> (BSD grep, vim) for whole-word matching.
Mistake 4: thinking \B is "the opposite end" of \b. \B is "no boundary HERE", not "boundary on the other side". \Bbook\B matches book only when both edges are inside a word, which is rare for the whole word book (it would need to be inside something like bookbookkeeper).
Mistake 5: confusing \b with ^ and $. ^ and $ anchor to the start and end of the line (or string with the appropriate flag); \b anchors to the start or end of a word. Both are zero-width but operate at different scales. See the Regex Anchors guide for the line/string anchors.
Mistake 6: copying a Python \b pattern to Go. Go's regexp will compile the pattern fine, but \w is ASCII-only and there is no flag to flip it. If you need Unicode word matching in Go, either rewrite using \p{L}\p{N}_ character classes with explicit lookaround logic, or pull in the regexp2 package which supports the full PCRE-style feature set including Unicode word boundaries.
What to do next
For more advanced word-matching patterns:
- Regex Lookaheads and Lookbehinds: the variable-width alternative to
\bwhen you need finer control over what comes before or after. - Regex Anchors:
^and$for line and string boundaries, the natural complement to\b. - Regex Capturing Groups and Backreferences: when you need to reuse a matched word elsewhere in the pattern.
For specific real-world patterns that build on word boundaries:
- Match Email Address:
\bis essential for picking emails out of prose. - Match URLs: same idea, anchored to whole-token extraction.
- Match Domain Name: domain extraction from logs and mixed text.
- Match Numbers:
\b\d+\bis the canonical whole-number pattern. - Match HTML Tags: token-level matching against markup.
For the one-page reference with every regex shortcut on one page, see the Regex Cheat Sheet.
FAQ
See also
- Regex Anchors:
^and$are the full-string equivalents; word boundaries are the per-token version - Regex Lookaheads and Lookbehinds: the variable-width replacement for
\bwhen you need custom boundary characters - Validate Password Strength with Regex: a counterexample, anchored validation that uses
^/$rather than\b - Regex Capturing Groups and Backreferences:
\b(\w+)\bis the standard whole-word capture pattern - Regex Cheat Sheet: the wider syntax and engine compatibility reference



