\b is the zero-width assertion that matches the position between a word character and a non-word character: the start of "cat" in "the cat" or the end of "cat" in "cat food", but not anywhere inside "scatter". \B is its negation, a position where there is no word boundary. Both are zero-width: they consume no characters, they only assert position.
What is a word boundary (\b) in regex?
A word boundary is the zero-width position between a word character (\w: a letter, digit, or underscore) and a non-word character (\W: anything else), or at the very start or end of the string. The \b metacharacter matches that position. It is a position, not a character: \b does not match a literal b, and it consumes nothing, so it never "uses up" any of the input.
In one line: \b says "a word starts or ends right here". The classic use is "match a whole word, not the substring inside another word": \bcat\b matches "cat" in "the cat sat" but not "cat" in "scatter" or "category". \B is the negation, every position that is not a word boundary.
That covers the short answer. The rest of this page is the full reference: the exact positions that match, per-engine support (JavaScript, Python, Java, PCRE, .NET, POSIX, Go, Rust), Unicode caveats, the lookaround alternative when you need finer control, and worked examples in Python and JavaScript. The same idea extends to log parsing (token-precise matches), refactor tools (rename only whole identifiers), and search-and-replace (don't mangle "Java" into "Python" inside "JavaScript").
Jump to:
\bis a boundary, not the letterb- Word characters: what counts as
\w - Where
\bmatches: the four positions - Negated word boundary:
\B - Engine compatibility
- Unicode word boundaries
- Lookaround alternative for finer control
- Common use cases with examples
- Common mistakes
- FAQ
\b is a boundary, not the letter b
This trips up a lot of people searching for "b in regex" or "regex b", so it is worth being explicit. The escape \b (backslash-b) is a word boundary, a zero-width position. It is not the literal character b. To match an actual lowercase b you just write b with no backslash. So:
bmatches the letter b.\bmatches a position (a word boundary), and matches no character at all.\bb\bmatches a standalone single-letter word "b" (the boundary, then the letter, then the boundary).
There is one more wrinkle that catches people out: inside a character class, \b changes meaning. In most engines [\b] is the backspace control character (U+0008), not a word boundary. A word boundary is meaningless inside a [...] class, so the engine reuses the escape for backspace there, the same way it does in string literals. If you meant "word boundary", keep \b outside the brackets.
import re
re.findall(r"\bb\b", "a b c bb") # ['b'] - the standalone "b", not the "bb"
re.search(r"[\b]", "a\bc") # match - [\b] is backspace inside a classWord characters: what counts as \w
Before word boundaries make sense, we need to be precise about what the engine considers a "word character". In most modern regex engines, \w is shorthand for [a-zA-Z0-9_]:
| Character class | Members |
|---|---|
| Letters | a-z, A-Z |
| Digits | 0-9 |
| Underscore | _ |
Yes, the underscore is a word character. This catches people off-guard: \bsnake_case\b matches because _ doesn't form a boundary, but \bsnake-case\b does not (the - is a non-word character, so the engine sees two separate words).
Anything not in \w is in \W: spaces, tabs, punctuation, hyphens, and every non-ASCII letter or symbol depending on the engine and its Unicode mode.
Where \b matches: the four positions
A word boundary exists at any position where one side is a word character and the other side is not (including string start/end). The four cases:
| Position | When it matches | Example |
|---|---|---|
| Start of string before a word char | \bcat against "cat sat" | matches at index 0 |
| End of string after a word char | cat\b against "the cat" | matches at index 7 |
| After non-word, before word | \bcat\b against "the cat sat" | matches the standalone cat |
| After word, before non-word | same as above, the trailing edge of cat | matches |
What \b does NOT match: positions where both sides are word characters (\bcat\b against "scatter" finds no match) or both sides are non-word characters ("!@#" has no word boundaries).
| Pattern | Input | Matches | Doesn't match |
|---|---|---|---|
\bbook\b | the book! | book | |
\bbook\b | notebook | the substring is inside a word | |
\bbook\b | bookstore | trailing edge is word-to-word | |
\bdog\b | bulldog ran | leading edge is word-to-word | |
\b\d+\b | room 42 floor 3 | 42, 3 | |
\b[A-Z]+\b | using SQL today | SQL |
Negated word boundary: \B
\B matches every position that \b does NOT, namely positions where both sides are word characters, or both sides are non-word characters. It's the assertion for "match this substring only when it is inside a word":
| Pattern | Input | Matches | Why |
|---|---|---|---|
\Bis\B | this island | is inside this | both sides are word chars |
\Boo\B | book | oo inside book | both sides are word chars |
\Bing\B | singing | ing (middle) | not standalone |
ing\B | singing | ing not at end | end position is word→string-end |
\Bed | bedding | matches inside bedding | left side is word char |
The classic use: highlight every occurrence of a sequence as a substring but skip the standalone form. \Bing\b matches ing at the end of a word but not the standalone word "ing".
Engine compatibility
Word boundary support varies by engine, especially around Unicode. The differences matter when you're writing patterns that need to work in more than one runtime.
| Engine | \b and \B | Unicode-aware by default | Notes |
|---|---|---|---|
| JavaScript | yes | no (ASCII only without u flag) | Use the u or v flag for Unicode word boundaries. Without it, \bä\b won't match ä correctly. |
Python (re) | yes | depends on flag | In Python 3, re.UNICODE is the default for str patterns; explicit re.A (re.ASCII) reverts to ASCII-only. For bytes patterns the default is ASCII. |
Python (regex package) | yes | yes by default | The third-party regex package is Unicode-aware by default and exposes finer control with the (?V1) version flag. |
Java (java.util.regex) | yes | no by default | \w and \b are ASCII unless you set Pattern.UNICODE_CHARACTER_CLASS (or use (?U) inline). The UNICODE_CASE flag is separate and only affects case-insensitive matching. |
| PCRE / PCRE2 / PHP | yes | optional | Enable with the start-of-pattern callout (*UCP), the PHP delimiter modifier u (e.g., /pattern/u), or the PCRE2_UCP compile option. There is no (?UCP) inline flag. |
.NET (System.Text.RegularExpressions) | yes | yes by default | Word characters and boundaries are Unicode-aware out of the box. RegexOptions.ECMAScript restricts both to ASCII. |
| Ruby (Onigmo) | yes | yes with UTF-8 source | Ruby source files default to UTF-8 since 2.0, in which case \w and \b recognise Unicode letters. For other encodings, \w falls back to ASCII. |
Go (regexp, RE2) | yes | no | Go's regexp is ASCII for \w and \b and has no Unicode-word flag. For Unicode word matching, build the class explicitly with [\p{L}\p{N}_] and pair with explicit lookaround logic, or use the regexp2 third-party package. |
Rust (regex crate) | yes | yes when Unicode is enabled (default) | The default regex crate build uses Unicode \w and Unicode-aware \b. The regex-lite and no-default-features builds drop back to ASCII. |
| POSIX BRE / ERE | no | n/a | Use [[:<:]] and [[:>:]] (GNU extensions) or \< and \> (BSD/vim) for whole-word matching. |
For the broader cross-engine reference, see the Regex Cheat Sheet.
Unicode word boundaries
ASCII-only \b produces surprising results on non-English text. Examples:
// Without 'u' flag, JavaScript treats non-ASCII as non-word
"naïve".match(/\bna\b/); // null - "na" stops at "ï"
// With 'u' flag, full Unicode word handling
"naïve".match(/\bna\b/u); // null - "ï" is treated as a word char, so "na" is mid-word
// "café" full word
"café break".match(/\bcafé\b/); // null in ASCII mode
"café break".match(/\bcafé\b/u); // ["café"]Python 3 defaults to Unicode for str patterns:
import re
re.search(r"\bnaïve\b", "naïve") # match, Unicode by default for str
re.search(r"\bnaïve\b", "naïve", re.A) # None, ASCII mode forcedJava is the opposite default; you have to opt in:
Pattern p1 = Pattern.compile("\\bnaïve\\b"); // ASCII, no match
Pattern p2 = Pattern.compile("\\bnaïve\\b", Pattern.UNICODE_CHARACTER_CLASS); // Unicode, matchesPCRE2 also introduced an explicit Unicode word boundary \b{wb} (Perl 5.22 and PCRE2) that uses Unicode TR29 segmentation rules (handles CJK, emoji clusters, contractions). Most engines don't implement it yet.
For applications dealing with multilingual content, default to the Unicode flag for your engine and test specifically against accented characters, CJK, and emoji.
Lookaround alternative for finer control
\b is a yes/no assertion. When you need to assert which kind of character comes before or after, for example "preceded by whitespace specifically, not punctuation", use lookarounds instead:
| Goal | \b pattern | Lookaround alternative |
|---|---|---|
| Standalone whole word | \bcat\b | (?<!\w)cat(?!\w) |
| Preceded only by space (not punctuation) | not possible | (?<=\s)cat\b |
| Followed only by space or end-of-string | cat\b (too broad) | `cat(?=\s |
| Whole word excluding underscore as boundary | \bid\b (matches _id) | (?<![A-Za-z0-9])id(?![A-Za-z0-9]) |
The last case is particularly useful when working with identifiers: \b treats _ as a word character (so \bid\b won't match the id inside user_id). If you specifically want to find id as a standalone token even when adjacent to underscore, use the explicit lookaround.
Engines that support lookbehind but not variable-width: JavaScript before 2018 and Python before 3.7 required fixed-width lookbehind. Modern engines (JavaScript V8 since 2018, Python 3.7+, .NET, PCRE2, Java 9+) support variable-width lookbehind. Go's regexp (RE2) does not support lookarounds at all, neither fixed nor variable.
Word boundaries in Python and JavaScript
These two dominate, so here is the canonical whole-word match in each. The pattern is the same; what differs is how you express it and how Unicode behaves.
In Python, use a raw string (r"...") so the backslash reaches the regex engine instead of being eaten as a string escape. r"\bword\b" is the idiomatic form:
import re
re.findall(r"\bword\b", "word wordsmith password word.") # ['word', 'word']
re.sub(r"\bcat\b", "dog", "cat scatter cat") # 'dog scatter dog'
bool(re.search(r"\bcat\b", "the cat sat")) # TruePython 3 is Unicode-aware by default for str patterns, so r"\bnaïve\b" matches naïve with no extra flag. Pass re.A (re.ASCII) if you specifically want ASCII-only boundaries.
In JavaScript, write the literal with a single backslash (/\bword\b/), or double it when building from a string with new RegExp("\\bword\\b"):
"word wordsmith password word.".match(/\bword\b/g); // ['word', 'word']
"cat scatter cat".replace(/\bcat\b/g, "dog"); // 'dog scatter dog'
/\bcat\b/.test("the cat sat"); // trueJavaScript is ASCII-only for \w and \b unless you add the u (or v) flag, so reach for /\bnaïve\b/u on non-ASCII text. More on that in Unicode word boundaries above.
Common use cases with examples
1. Whole-word find and replace
The most common use of \b. Rename a variable in code without mangling longer identifiers:
# sed: rename 'count' to 'total' but skip 'counter', 'counterpart'
sed -E 's/\bcount\b/total/g' file.txt# Python: same
import re
new = re.sub(r"\bcount\b", "total", source)2. Highlight search terms
const term = "react";
const re = new RegExp(`\\b${term}\\b`, "giu");
return text.replace(re, m => `<mark>${m}</mark>`);The \b anchors prevent highlighting "react" inside "reaction" or "reactor".
3. Extract whole numbers from text
re.findall(r"\b\d+\b", "room 42 on floor 3 has 1024 widgets")
# ['42', '3', '1024']4. Match log tokens precisely
# Match LEVEL tokens in a log file: INFO, WARN, ERROR
re.findall(r"\b(?:INFO|WARN|ERROR)\b", log_line)Without \b, the pattern would match ERROR inside MIRRORED or similar collisions.
5. Validate identifier-like strings
// Match identifiers that are at least 3 chars, alphanumeric + underscore
const isIdentifier = (s) => /^\w{3,}$/.test(s);For input-validation patterns specifically, the Regex Anchors article covers ^ and $ which often pair with \b for full-string assertions.
Common mistakes
Mistake 1: assuming \b separates hyphens. \b treats - as a non-word character, so \bword\b matches word in word-art. If you don't want that, use the explicit lookaround (?<![A-Za-z0-9_-])word(?![A-Za-z0-9_-]).
Mistake 2: assuming \b is Unicode-aware. In JavaScript without the u flag, every non-ASCII character is treated as a non-word character. \bnaïve\b won't match naïve because ï looks like a word boundary in ASCII mode. The same trap applies in Java with the default flags, and in Go regardless.
Mistake 3: using \b in POSIX BRE/ERE. It doesn't exist there. Use [[:<:]] and [[:>:]] (GNU grep -E) or \< and \> (BSD grep, vim) for whole-word matching.
Mistake 4: thinking \B is "the opposite end" of \b. \B is "no boundary HERE", not "boundary on the other side". \Bbook\B matches book only when both edges are inside a word, which is rare for the whole word book (it would need to be inside something like bookbookkeeper).
Mistake 5: confusing \b with ^ and $. ^ and $ anchor to the start and end of the line (or string with the appropriate flag); \b anchors to the start or end of a word. Both are zero-width but operate at different scales. See the Regex Anchors guide for the line/string anchors.
Mistake 6: copying a Python \b pattern to Go. Go's regexp will compile the pattern fine, but \w is ASCII-only and there is no flag to flip it. If you need Unicode word matching in Go, either rewrite using \p{L}\p{N}_ character classes with explicit lookaround logic, or pull in the regexp2 package which supports the full PCRE-style feature set including Unicode word boundaries.
What to do next
For more advanced word-matching patterns:
- Regex Lookaheads and Lookbehinds: the variable-width alternative to
\bwhen you need finer control over what comes before or after. - Regex Anchors:
^and$for line and string boundaries, the natural complement to\b. - Regex Capturing Groups and Backreferences: when you need to reuse a matched word elsewhere in the pattern.
For specific real-world patterns that build on word boundaries:
- Match Email Address:
\bis essential for picking emails out of prose. - Match URLs: same idea, anchored to whole-token extraction.
- Match Domain Name: domain extraction from logs and mixed text.
- Match Numbers:
\b\d+\bis the canonical whole-number pattern. - Match HTML Tags: token-level matching against markup.
For the one-page reference with every regex shortcut on one page, see the Regex Cheat Sheet.
FAQ
\b is a zero-width assertion that matches the position between a word character (\w: letter, digit, or underscore) and a non-word character (\W: anything else), or at the start or end of the string. It consumes no characters; it only asserts that the current position is a word boundary.
The classic use is \bword\b to match a standalone occurrence of word without matching the substring inside a longer word like password or wordsmith.
\b matches at a word boundary; \B matches at every position that is NOT a word boundary. Use \b to find standalone words (\bcat\b matches cat in the cat sat); use \B to find substrings inside words (\Bcat\B matches cat in scatter but not the standalone cat).
Both are zero-width and consume no characters.
Yes, in every major engine that follows the \w = [A-Za-z0-9_] convention (JavaScript, Python, Java, PCRE, .NET, Ruby, Go, Rust). That means \b does NOT match between a letter and an underscore: \bid\b won't match id inside user_id because there's no boundary between r and _.
If you want to treat underscore as a separator, use an explicit lookaround: (?<![A-Za-z0-9])id(?![A-Za-z0-9]).
Not by default. Without the u (or v) flag, JavaScript treats every non-ASCII character as a non-word character, so \bnaïve\b won't match naïve. The ï looks like a non-word character and creates spurious boundaries.
Add the u flag (/\bnaïve\b/u) and modern V8 handles full Unicode word semantics. The v flag (Unicode-sets mode, ES2024) implies u and adds support for set notation inside character classes.
Not standard \b or \B. POSIX BRE and ERE use the bracket-expression classes [[:<:]] (start-of-word) and [[:>:]] (end-of-word) as GNU extensions, or \< and \> in BSD grep and vim.
For portable shell scripting, use grep -P (PCRE mode) when available, or fall back to the explicit class extensions for the target platform.
Wrap the word in \b anchors: \bword\b. That matches word as a standalone token (preceded and followed by either a non-word character or the start/end of the string) but not as a substring inside a longer word.
For case-insensitive matching add the i flag; for Unicode word handling add the appropriate flag for your engine (u in JavaScript, (*UCP) in PCRE, UNICODE_CHARACTER_CLASS in Java).
Yes. (?<!\w)cat(?!\w) is equivalent to \bcat\b and gives you finer control. For example, to treat underscore as a separator (which \b doesn't): (?<![A-Za-z0-9])cat(?![A-Za-z0-9]).
Lookarounds require an engine that supports them (most do; Go's regexp / RE2 does NOT, and the Rust regex crate does not). See Regex Lookaheads and Lookbehinds for the full reference.
It does. The hyphen - is a non-word character, so there IS a word boundary between cat and -. \bcat\b matches the cat in cat-food and the standalone cat equally.
If you want cat-food to be treated as one token (no boundary at the hyphen), use the lookaround form (?<![A-Za-z0-9-])cat(?![A-Za-z0-9-]) to include hyphen as a "word" character for boundary purposes.
See also
- Regex Anchors:
^and$are the full-string equivalents; word boundaries are the per-token version - Regex Lookaheads and Lookbehinds: the variable-width replacement for
\bwhen you need custom boundary characters - Validate Password Strength with Regex: a counterexample, anchored validation that uses
^/$rather than\b - Regex Capturing Groups and Backreferences:
\b(\w+)\bis the standard whole-word capture pattern - Regex Cheat Sheet: the wider syntax and engine compatibility reference
Recommended books
Word boundaries are one of the most misunderstood zero-width assertions. If you want the concepts properly grounded:
- Learning Regular Expressions (Ben Forta). The gentlest on-ramp: short, current, and example-driven. A good first book if you are still finding your feet.
- Regular Expressions Cookbook (Jan Goyvaerts and Steven Levithan, 2nd edition). Problem-then-solution recipes across eight languages (JavaScript, Python, PHP, Java, .NET, Ruby, Perl, VB). The one to keep next to the keyboard.
- Mastering Regular Expressions (Jeffrey Friedl, 3rd edition). The definitive deep-dive on how regex engines actually work: backtracking, NFA versus DFA, and the optimisation that makes a pattern fast or catastrophic. Dense, and unmatched once you are past the basics.





