Regex Word Boundaries: \b, \B, and Lookaround Reference (2026)

\b is the zero-width assertion that matches the position between a word character and a non-word character: the start of "cat" in "the cat" or the end of "cat" in "cat food", but not anywhere inside "scatter". \B is its negation, a position where there is no word boundary. Both are zero-width: they consume no characters, they only assert position.

What is a word boundary (`\b`) in regex?

A word boundary is the zero-width position between a word character (\w: a letter, digit, or underscore) and a non-word character (\W: anything else), or at the very start or end of the string. The \b metacharacter matches that position. It is a position, not a character: \b does not match a literal b, and it consumes nothing, so it never "uses up" any of the input.

In one line: \b says "a word starts or ends right here". The classic use is "match a whole word, not the substring inside another word": \bcat\b matches "cat" in "the cat sat" but not "cat" in "scatter" or "category". \B is the negation, every position that is not a word boundary.

That covers the short answer. The rest of this page is the full reference: the exact positions that match, per-engine support (JavaScript, Python, Java, PCRE, .NET, POSIX, Go, Rust), Unicode caveats, the lookaround alternative when you need finer control, and worked examples in Python and JavaScript. The same idea extends to log parsing (token-precise matches), refactor tools (rename only whole identifiers), and search-and-replace (don't mangle "Java" into "Python" inside "JavaScript").

Jump to:

\b is a boundary, not the letter b
Word characters: what counts as \w
Where \b matches: the four positions
Negated word boundary: \B
Engine compatibility
Unicode word boundaries
Lookaround alternative for finer control
Common use cases with examples
Common mistakes
FAQ

`\b` is a boundary, not the letter `b`

This trips up a lot of people searching for "b in regex" or "regex b", so it is worth being explicit. The escape \b (backslash-b) is a word boundary, a zero-width position. It is not the literal character b. To match an actual lowercase b you just write b with no backslash. So:

b matches the letter b.
\b matches a position (a word boundary), and matches no character at all.
\bb\b matches a standalone single-letter word "b" (the boundary, then the letter, then the boundary).

There is one more wrinkle that catches people out: inside a character class, \b changes meaning. In most engines [\b] is the backspace control character (U+0008), not a word boundary. A word boundary is meaningless inside a [...] class, so the engine reuses the escape for backspace there, the same way it does in string literals. If you meant "word boundary", keep \b outside the brackets.

python

import re
re.findall(r"\bb\b", "a b c bb")   # ['b']  - the standalone "b", not the "bb"
re.search(r"[\b]", "a\bc")         # match  - [\b] is backspace inside a class

Word characters: what counts as `\w`

Before word boundaries make sense, we need to be precise about what the engine considers a "word character". In most modern regex engines, \w is shorthand for [a-zA-Z0-9_]:

Character class	Members
Letters	`a-z`, `A-Z`
Digits	`0-9`
Underscore	`_`

Yes, the underscore is a word character. This catches people off-guard: \bsnake_case\b matches because _ doesn't form a boundary, but \bsnake-case\b does not (the - is a non-word character, so the engine sees two separate words).

Anything not in \w is in \W: spaces, tabs, punctuation, hyphens, and every non-ASCII letter or symbol depending on the engine and its Unicode mode.

Where `\b` matches: the four positions

A word boundary exists at any position where one side is a word character and the other side is not (including string start/end). The four cases:

Position	When it matches	Example
Start of string before a word char	`\bcat` against `"cat sat"`	matches at index 0
End of string after a word char	`cat\b` against `"the cat"`	matches at index 7
After non-word, before word	`\bcat\b` against `"the cat sat"`	matches the standalone `cat`
After word, before non-word	same as above, the trailing edge of `cat`	matches

What \b does NOT match: positions where both sides are word characters (\bcat\b against "scatter" finds no match) or both sides are non-word characters ("!@#" has no word boundaries).

Pattern	Input	Matches	Doesn't match
`\bbook\b`	`the book!`	`book`
`\bbook\b`	`notebook`		the substring is inside a word
`\bbook\b`	`bookstore`		trailing edge is word-to-word
`\bdog\b`	`bulldog ran`		leading edge is word-to-word
`\b\d+\b`	`room 42 floor 3`	`42`, `3`
`\b[A-Z]+\b`	`using SQL today`	`SQL`

Negated word boundary: `\B`

\B matches every position that \b does NOT, namely positions where both sides are word characters, or both sides are non-word characters. It's the assertion for "match this substring only when it is inside a word":

Pattern	Input	Matches	Why
`\Bis\B`	`this island`	`is` inside `this`	both sides are word chars
`\Boo\B`	`book`	`oo` inside `book`	both sides are word chars
`\Bing\B`	`singing`	`ing` (middle)	not standalone
`ing\B`	`singing`	`ing` not at end	end position is word→string-end
`\Bed`	`bedding`	matches inside `bedding`	left side is word char

The classic use: highlight every occurrence of a sequence as a substring but skip the standalone form. \Bing\b matches ing at the end of a word but not the standalone word "ing".

Engine compatibility

Word boundary support varies by engine, especially around Unicode. The differences matter when you're writing patterns that need to work in more than one runtime.

Engine	`\b` and `\B`	Unicode-aware by default	Notes
JavaScript	yes	no (ASCII only without `u` flag)	Use the `u` or `v` flag for Unicode word boundaries. Without it, `\bä\b` won't match `ä` correctly.
Python (`re`)	yes	depends on flag	In Python 3, `re.UNICODE` is the default for `str` patterns; explicit `re.A` (`re.ASCII`) reverts to ASCII-only. For `bytes` patterns the default is ASCII.
Python (`regex` package)	yes	yes by default	The third-party `regex` package is Unicode-aware by default and exposes finer control with the `(?V1)` version flag.
Java (`java.util.regex`)	yes	no by default	`\w` and `\b` are ASCII unless you set `Pattern.UNICODE_CHARACTER_CLASS` (or use `(?U)` inline). The `UNICODE_CASE` flag is separate and only affects case-insensitive matching.
PCRE / PCRE2 / PHP	yes	optional	Enable with the start-of-pattern callout `(*UCP)`, the PHP delimiter modifier `u` (e.g., `/pattern/u`), or the PCRE2_UCP compile option. There is no `(?UCP)` inline flag.
.NET (`System.Text.RegularExpressions`)	yes	yes by default	Word characters and boundaries are Unicode-aware out of the box. `RegexOptions.ECMAScript` restricts both to ASCII.
Ruby (Onigmo)	yes	yes with UTF-8 source	Ruby source files default to UTF-8 since 2.0, in which case `\w` and `\b` recognise Unicode letters. For other encodings, `\w` falls back to ASCII.
Go (`regexp`, RE2)	yes	no	Go's `regexp` is ASCII for `\w` and `\b` and has no Unicode-word flag. For Unicode word matching, build the class explicitly with `[\p{L}\p{N}_]` and pair with explicit lookaround logic, or use the `regexp2` third-party package.
Rust (`regex` crate)	yes	yes when Unicode is enabled (default)	The default `regex` crate build uses Unicode `\w` and Unicode-aware `\b`. The `regex-lite` and `no-default-features` builds drop back to ASCII.
POSIX BRE / ERE	no	n/a	Use `[[:<:]]` and `[[:>:]]` (GNU extensions) or `\<` and `\>` (BSD/vim) for whole-word matching.

For the broader cross-engine reference, see the Regex Cheat Sheet.

Unicode word boundaries

ASCII-only \b produces surprising results on non-English text. Examples:

javascript

// Without 'u' flag, JavaScript treats non-ASCII as non-word
"naïve".match(/\bna\b/);        // null - "na" stops at "ï"

// With 'u' flag, full Unicode word handling
"naïve".match(/\bna\b/u);       // null - "ï" is treated as a word char, so "na" is mid-word

// "café" full word
"café break".match(/\bcafé\b/);  // null in ASCII mode
"café break".match(/\bcafé\b/u); // ["café"]

Python 3 defaults to Unicode for str patterns:

python

import re
re.search(r"\bnaïve\b", "naïve")        # match, Unicode by default for str
re.search(r"\bnaïve\b", "naïve", re.A)  # None, ASCII mode forced

Java is the opposite default; you have to opt in:

java

Pattern p1 = Pattern.compile("\\bnaïve\\b");                              // ASCII, no match
Pattern p2 = Pattern.compile("\\bnaïve\\b", Pattern.UNICODE_CHARACTER_CLASS); // Unicode, matches

PCRE2 also introduced an explicit Unicode word boundary \b{wb} (Perl 5.22 and PCRE2) that uses Unicode TR29 segmentation rules (handles CJK, emoji clusters, contractions). Most engines don't implement it yet.

For applications dealing with multilingual content, default to the Unicode flag for your engine and test specifically against accented characters, CJK, and emoji.

Lookaround alternative for finer control

\b is a yes/no assertion. When you need to assert which kind of character comes before or after, for example "preceded by whitespace specifically, not punctuation", use lookarounds instead:

Goal	`\b` pattern	Lookaround alternative
Standalone whole word	`\bcat\b`	`(?<!\w)cat(?!\w)`
Preceded only by space (not punctuation)	not possible	`(?<=\s)cat\b`
Followed only by space or end-of-string	`cat\b` (too broad)	`cat(?=\s
Whole word excluding underscore as boundary	`\bid\b` (matches `_id`)	`(?<![A-Za-z0-9])id(?![A-Za-z0-9])`

The last case is particularly useful when working with identifiers: \b treats _ as a word character (so \bid\b won't match the id inside user_id). If you specifically want to find id as a standalone token even when adjacent to underscore, use the explicit lookaround.

Engines that support lookbehind but not variable-width: JavaScript before 2018 and Python before 3.7 required fixed-width lookbehind. Modern engines (JavaScript V8 since 2018, Python 3.7+, .NET, PCRE2, Java 9+) support variable-width lookbehind. Go's regexp (RE2) does not support lookarounds at all, neither fixed nor variable.

Word boundaries in Python and JavaScript

These two dominate, so here is the canonical whole-word match in each. The pattern is the same; what differs is how you express it and how Unicode behaves.

In Python, use a raw string (r"...") so the backslash reaches the regex engine instead of being eaten as a string escape. r"\bword\b" is the idiomatic form:

python

import re

re.findall(r"\bword\b", "word wordsmith password word.")  # ['word', 'word']
re.sub(r"\bcat\b", "dog", "cat scatter cat")              # 'dog scatter dog'
bool(re.search(r"\bcat\b", "the cat sat"))                # True

Python 3 is Unicode-aware by default for str patterns, so r"\bnaïve\b" matches naïve with no extra flag. Pass re.A (re.ASCII) if you specifically want ASCII-only boundaries.

In JavaScript, write the literal with a single backslash (/\bword\b/), or double it when building from a string with new RegExp("\\bword\\b"):

javascript

"word wordsmith password word.".match(/\bword\b/g);  // ['word', 'word']
"cat scatter cat".replace(/\bcat\b/g, "dog");        // 'dog scatter dog'
/\bcat\b/.test("the cat sat");                        // true

JavaScript is ASCII-only for \w and \b unless you add the u (or v) flag, so reach for /\bnaïve\b/u on non-ASCII text. More on that in Unicode word boundaries above.

Common use cases with examples

1. Whole-word find and replace

The most common use of \b. Rename a variable in code without mangling longer identifiers:

bash

# sed: rename 'count' to 'total' but skip 'counter', 'counterpart'
sed -E 's/\bcount\b/total/g' file.txt

python

# Python: same
import re
new = re.sub(r"\bcount\b", "total", source)

2. Highlight search terms

javascript

const term = "react";
const re = new RegExp(`\\b${term}\\b`, "giu");
return text.replace(re, m => `<mark>${m}</mark>`);

The \b anchors prevent highlighting "react" inside "reaction" or "reactor".

3. Extract whole numbers from text

python

re.findall(r"\b\d+\b", "room 42 on floor 3 has 1024 widgets")
# ['42', '3', '1024']

4. Match log tokens precisely

python

# Match LEVEL tokens in a log file: INFO, WARN, ERROR
re.findall(r"\b(?:INFO|WARN|ERROR)\b", log_line)

Without \b, the pattern would match ERROR inside MIRRORED or similar collisions.

5. Validate identifier-like strings

javascript

// Match identifiers that are at least 3 chars, alphanumeric + underscore
const isIdentifier = (s) => /^\w{3,}$/.test(s);

For input-validation patterns specifically, the Regex Anchors article covers ^ and $ which often pair with \b for full-string assertions.

Common mistakes

Mistake 1: assuming \b separates hyphens. \b treats - as a non-word character, so \bword\b matches word in word-art. If you don't want that, use the explicit lookaround (?<![A-Za-z0-9_-])word(?![A-Za-z0-9_-]).

Mistake 2: assuming \b is Unicode-aware. In JavaScript without the u flag, every non-ASCII character is treated as a non-word character. \bnaïve\b won't match naïve because ï looks like a word boundary in ASCII mode. The same trap applies in Java with the default flags, and in Go regardless.

Mistake 3: using \b in POSIX BRE/ERE. It doesn't exist there. Use [[:<:]] and [[:>:]] (GNU grep -E) or \< and \> (BSD grep, vim) for whole-word matching.

Mistake 4: thinking \B is "the opposite end" of \b. \B is "no boundary HERE", not "boundary on the other side". \Bbook\B matches book only when both edges are inside a word, which is rare for the whole word book (it would need to be inside something like bookbookkeeper).

Mistake 5: confusing \b with ^ and $. ^ and $ anchor to the start and end of the line (or string with the appropriate flag); \b anchors to the start or end of a word. Both are zero-width but operate at different scales. See the Regex Anchors guide for the line/string anchors.

Mistake 6: copying a Python \b pattern to Go. Go's regexp will compile the pattern fine, but \w is ASCII-only and there is no flag to flip it. If you need Unicode word matching in Go, either rewrite using \p{L}\p{N}_ character classes with explicit lookaround logic, or pull in the regexp2 package which supports the full PCRE-style feature set including Unicode word boundaries.

What to do next

For more advanced word-matching patterns:

Regex Lookaheads and Lookbehinds: the variable-width alternative to \b when you need finer control over what comes before or after.
Regex Anchors: ^ and $ for line and string boundaries, the natural complement to \b.
Regex Capturing Groups and Backreferences: when you need to reuse a matched word elsewhere in the pattern.

For specific real-world patterns that build on word boundaries:

Match Email Address: \b is essential for picking emails out of prose.
Match URLs: same idea, anchored to whole-token extraction.
Match Domain Name: domain extraction from logs and mixed text.
Match Numbers: \b\d+\b is the canonical whole-number pattern.
Match HTML Tags: token-level matching against markup.

For the one-page reference with every regex shortcut on one page, see the Regex Cheat Sheet.

FAQ

\b is a zero-width assertion that matches the position between a word character (\w: letter, digit, or underscore) and a non-word character (\W: anything else), or at the start or end of the string. It consumes no characters; it only asserts that the current position is a word boundary.

The classic use is \bword\b to match a standalone occurrence of word without matching the substring inside a longer word like password or wordsmith.

\b matches at a word boundary; \B matches at every position that is NOT a word boundary. Use \b to find standalone words (\bcat\b matches cat in the cat sat); use \B to find substrings inside words (\Bcat\B matches cat in scatter but not the standalone cat).

Both are zero-width and consume no characters.

Yes, in every major engine that follows the \w = [A-Za-z0-9_] convention (JavaScript, Python, Java, PCRE, .NET, Ruby, Go, Rust). That means \b does NOT match between a letter and an underscore: \bid\b won't match id inside user_id because there's no boundary between r and _.

If you want to treat underscore as a separator, use an explicit lookaround: (?<![A-Za-z0-9])id(?![A-Za-z0-9]).

Not by default. Without the u (or v) flag, JavaScript treats every non-ASCII character as a non-word character, so \bnaïve\b won't match naïve. The ï looks like a non-word character and creates spurious boundaries.

Add the u flag (/\bnaïve\b/u) and modern V8 handles full Unicode word semantics. The v flag (Unicode-sets mode, ES2024) implies u and adds support for set notation inside character classes.

Not standard \b or \B. POSIX BRE and ERE use the bracket-expression classes [[:<:]] (start-of-word) and [[:>:]] (end-of-word) as GNU extensions, or \< and \> in BSD grep and vim.

For portable shell scripting, use grep -P (PCRE mode) when available, or fall back to the explicit class extensions for the target platform.

Wrap the word in \b anchors: \bword\b. That matches word as a standalone token (preceded and followed by either a non-word character or the start/end of the string) but not as a substring inside a longer word.

For case-insensitive matching add the i flag; for Unicode word handling add the appropriate flag for your engine (u in JavaScript, (*UCP) in PCRE, UNICODE_CHARACTER_CLASS in Java).

Yes. (?<!\w)cat(?!\w) is equivalent to \bcat\b and gives you finer control. For example, to treat underscore as a separator (which \b doesn't): (?<![A-Za-z0-9])cat(?![A-Za-z0-9]).

Lookarounds require an engine that supports them (most do; Go's regexp / RE2 does NOT, and the Rust regex crate does not). See Regex Lookaheads and Lookbehinds for the full reference.

It does. The hyphen - is a non-word character, so there IS a word boundary between cat and -. \bcat\b matches the cat in cat-food and the standalone cat equally.

If you want cat-food to be treated as one token (no boundary at the hyphen), use the lookaround form (?<![A-Za-z0-9-])cat(?![A-Za-z0-9-]) to include hyphen as a "word" character for boundary purposes.

Recommended books

Word boundaries are one of the most misunderstood zero-width assertions. If you want the concepts properly grounded:

Learning Regular Expressions (Ben Forta). The gentlest on-ramp: short, current, and example-driven. A good first book if you are still finding your feet.
Regular Expressions Cookbook (Jan Goyvaerts and Steven Levithan, 2nd edition). Problem-then-solution recipes across eight languages (JavaScript, Python, PHP, Java, .NET, Ruby, Perl, VB). The one to keep next to the keyboard.
Mastering Regular Expressions (Jeffrey Friedl, 3rd edition). The definitive deep-dive on how regex engines actually work: backtracking, NFA versus DFA, and the optimisation that makes a pattern fast or catastrophic. Dense, and unmatched once you are past the basics.

Regex Word Boundaries: \b, \B, and Lookaround Equivalents

What is a word boundary (`\b`) in regex?

`\b` is a boundary, not the letter `b`

Word characters: what counts as `\w`

Where `\b` matches: the four positions

Negated word boundary: `\B`

Engine compatibility

Unicode word boundaries

Lookaround alternative for finer control

Word boundaries in Python and JavaScript

Common use cases with examples

1. Whole-word find and replace

2. Highlight search terms

3. Extract whole numbers from text

4. Match log tokens precisely

5. Validate identifier-like strings

Common mistakes

What to do next

FAQ

See also

Recommended books

Ishan Karunaratne

Related posts

How to Use Regex Lookaheads and Lookbehinds

Remove WordPress Feed Links (and the /feed/ URLs)

Fixed (Parallax-Style) Background Images with background-attachment

What does \b mean in regex?

What is the difference between \b and \B in regex?

Is underscore a word character in regex?

Does \b work in JavaScript with Unicode characters?

Does POSIX regex support word boundaries?

How do I match a whole word in regex?

Can I use lookarounds instead of \b for word boundaries?

Why doesn't \bcat\b match 'cat' in 'cat-food'?

Ishan Karunaratne