The shortest regex that matches a simple HTML opening tag: <[a-zA-Z][a-zA-Z0-9]*>. It accepts <div>, <p>, <h1>, and rejects <123> (numeric first character) and <!doctype html> (those special angles). For tags with attributes, closing tags, self-closing tags, or attribute capture, the pattern gets longer. Below I walk every variant with runnable code in JavaScript, Python, and PHP, engine notes per language, the bugs I've shipped, and I'll be honest about when this is the wrong approach (which is most of the time).
HTML is famously not a regular language. A parser handles nested structure, attribute quoting, void elements, character entities, comments, and CDATA correctly; regex handles none of those. The patterns below are for the narrow cases where a parser is overkill (log scanning, quick search-replace in a single file, stripping known tags from trusted text) and never for parsing untrusted HTML.
Quick reference
Simple tag (no attributes):
<[a-zA-Z][a-zA-Z0-9]*>
Opening tag with attributes:
<[a-zA-Z][a-zA-Z0-9]*(\s+[a-zA-Z-]+(\s*=\s*("[^"]*"|'[^']*'|[^\s>]+))?)*\s*>
Closing tag:
</[a-zA-Z][a-zA-Z0-9]*\s*>
Self-closing (XHTML/JSX):
<[a-zA-Z][a-zA-Z0-9]*[^>]*\/>
Simple tag (no attributes)
<[a-zA-Z][a-zA-Z0-9]*>
Matches: <div>, <p>, <h1>, <MyComponent>. Rejects: <123>, </div>, <br/>, <!doctype html>.
The first character has to be a letter; subsequent characters can be alphanumeric. This follows the HTML spec's element-name rules.
Opening tag with attributes
<[a-zA-Z][a-zA-Z0-9]*(\s+[a-zA-Z-]+(\s*=\s*("[^"]*"|'[^']*'|[^\s>]+))?)*\s*>
The pattern after the element name handles attributes: zero or more of (\s+name(=value)?). The attribute value is one of:
"double-quoted"'single-quoted'unquoted(no whitespace, no>)
This matches <a href="https://example.com">, <input type="text" required>, and <div class='card primary' data-id="42">.
Closing tag
</[a-zA-Z][a-zA-Z0-9]*\s*>
Closing tags have a leading </ and cannot have attributes. The optional whitespace before > allows </div > (rare but legal).
Self-closing tag
<[a-zA-Z][a-zA-Z0-9]*[^>]*\/>
XHTML and JSX use the /> ending for void elements. The pattern allows any content between the name and the /> for attributes.
Specific tag (e.g., only h1-h6)
<h[1-6]([^>]*)>(.*?)</h[1-6]>
This matches an <h1> through <h6> opening tag (with optional attributes), captures the inner text, and the matching closing tag.
Important caveat: this is a non-greedy match (.*?), which handles single-line headings. For headings that span multiple lines, add the s flag (PHP, Python, JavaScript with s flag) so . matches newlines too.
Also: this pattern does not enforce that the opening and closing levels match. <h1>...</h3> will match because both sides use h[1-6]. To enforce matching, use a backreference: <(h[1-6])([^>]*)>(.*?)</\1>.
Extracting attribute values
href\s*=\s*(?:"([^"]*)"|'([^']*)'|([^\s>]+))
This isolates the href attribute and captures its value into one of three groups depending on quoting style. Run it inside an opening-tag match to extract specifically the attribute you want.
In code, you usually run this in a loop over attribute names rather than building one mega-pattern.
When to use a parser instead
The cases where regex on HTML breaks down:
- Nested same-element tags.
<div><div>...</div></div>requires balanced-bracket matching, which regular languages cannot do. - CDATA, comments, processing instructions.
<![CDATA[<div>not a tag</div>]]>contains characters that look like tags but are not. - HTML entities and character references.
<div>is the text<div>, not a tag. - Attribute values containing
<or>. Rare but legal:<a title="1 < 2">. - Unquoted attribute values containing escaped quotes. Rare but legal.
- Untrusted input. Anyone parsing user-submitted HTML with regex is one cleverly-crafted input away from a security bug.
For any of those cases, use a parser:
- JavaScript:
DOMParserin browsers,cheerioorparse5in Node. - Python:
BeautifulSoup(bs4) orlxml. - PHP:
DOMDocumentorhtml5-php. - Go:
golang.org/x/net/html.
The parser handles every edge case correctly. The performance cost is rarely worth the simplicity cost on the regex side.
Examples in JavaScript, Python, and PHP
JavaScript, extracting all <h2> headings from a document:
const html = document.documentElement.outerHTML;
const headings = [];
const pattern = /<h2([^>]*)>(.*?)<\/h2>/gis;
let m;
while ((m = pattern.exec(html)) !== null) {
headings.push({ attrs: m[1].trim(), text: m[2] });
}The same task using DOMParser (the right way):
const parser = new DOMParser();
const doc = parser.parseFromString(html, "text/html");
const headings = [...doc.querySelectorAll("h2")].map(h => ({
attrs: [...h.attributes].map(a => `${a.name}="${a.value}"`).join(" "),
text: h.textContent,
}));Python, stripping all <script> and <style> blocks from text:
import re
def strip_dangerous(html: str) -> str:
# Regex version (acceptable for trusted input, not security-critical):
return re.sub(r"<(script|style)[^>]*>.*?</\1>", "", html, flags=re.IGNORECASE | re.DOTALL)The parser version (use this for untrusted input):
from bs4 import BeautifulSoup
def strip_dangerous(html: str) -> str:
soup = BeautifulSoup(html, "html.parser")
for tag in soup(["script", "style"]):
tag.decompose()
return str(soup)PHP, extracting all link href values from a string:
function extractHrefs(string $html): array {
preg_match_all('/<a\s[^>]*href\s*=\s*(?:"([^"]*)"|\'([^\']*)\'|([^\s>]+))[^>]*>/i', $html, $matches);
$hrefs = [];
for ($i = 0; $i < count($matches[0]); $i++) {
$hrefs[] = $matches[1][$i] ?: $matches[2][$i] ?: $matches[3][$i];
}
return $hrefs;
}The parser version:
function extractHrefs(string $html): array {
$dom = new DOMDocument();
@$dom->loadHTML($html);
$hrefs = [];
foreach ($dom->getElementsByTagName("a") as $a) {
$hrefs[] = $a->getAttribute("href");
}
return $hrefs;
}Engine compatibility
The patterns use backreferences (\1) for matched-tag pairs, the s (dotall) flag for multi-line content, and non-capturing groups. Most engines support all three; Go RE2 and Rust's regex crate do not support backreferences, which is the main cross-engine limitation.
| Engine | Tag patterns | Backreferences (\1) | Dotall flag | Parser equivalent |
|---|---|---|---|---|
| JavaScript | Works | Works | s flag (ES2018+) | DOMParser in browsers, cheerio / parse5 in Node |
Python (re) | Works | Works | re.DOTALL | BeautifulSoup, lxml |
| PHP (PCRE) | Works | Works | s modifier | DOMDocument, html5-php |
| Java | Works | Works | Pattern.DOTALL | Jsoup |
| .NET | Works | Works | RegexOptions.Singleline | HtmlAgilityPack |
| Go (RE2) | Works (no backref) | Not supported | (?s) flag | golang.org/x/net/html |
Rust (regex crate) | Works (no backref) | Not supported | (?s) flag | scraper crate, html5ever |
| Ruby | Works | Works | m flag (in Ruby m means dotall) | Nokogiri |
POSIX ERE (grep -E) | Works (basic patterns only) | BRE only (\1 in basic mode) | None | None |
For Go and Rust, the matched-pair patterns (<(h[1-6])>(.*?)</\1>) don't work directly. Either iterate with two passes (collect all opening tags, then for each, find the matching closing tag in a second scan) or use the language's HTML parser.
Common mistakes
The bugs I've shipped or seen in code review.
Using .* (greedy) instead of .*? (non-greedy). Against <div>a</div><div>b</div>, the pattern <div>.*</div> captures the entire string because .* matches as much as possible. Use .*? or, better, a negated class like [^<]*.
Forgetting the s/dotall flag for multi-line content. Without it, . does not match newlines, so a <div> with content on multiple lines fails to match. Add the flag.
Treating regex-extracted HTML as safe to render. A pattern that extracts <a> tags from arbitrary input can include attacker-injected content. Always sanitise via a parser or library (DOMPurify, bleach, HTMLPurifier) before rendering.
Assuming closing tags match opening tags by position. <h1>...</h3> is invalid HTML but matches a pattern like <h[1-6]>(.*?)</h[1-6]>. Use a backreference (<(h[1-6])>...</\1>) to enforce matching.
Trying to match HTML comments with <!--.*-->. Without s flag this fails on multi-line comments; with it, a string like <!-- a --> b <!-- c --> is captured as one giant comment because .* is greedy. Use <!--[\s\S]*?--> for the cleanest cross-engine form.
Stripping tags with regex and assuming the result is plain text. Attribute values like onerror="alert(1)" and character entities like <script> survive a tag-strip. Use a sanitiser library if security matters.
Test cases
| Input | Simple tag pattern | Opening with attrs | Closing tag |
|---|---|---|---|
<div> | Match | Match | No match |
<p class="card"> | No match | Match | No match |
<h1> | Match | Match | No match |
</div> | No match | No match | Match |
<br/> | No match | No match | No match |
<input type="text" required> | No match | Match | No match |
<!doctype html> | No match | No match | No match |
<123> | No match | No match | No match |
FAQ
For trusted, well-formed HTML in narrow contexts (extracting headings, scanning your own log output, stripping a known set of tags), yes. For arbitrary or untrusted HTML, no. Use a real parser.
HTML allows nested same-name elements, attribute quoting variants, character references, CDATA, and comments that regex cannot handle cleanly. A parser handles all of those correctly.
<[^>]+> matches any sequence of <, any non-> characters, and a closing >. It is a quick-and-dirty match for "any tag" but accepts garbage like <> and <"<>".
The cleaner form: <\/?[a-zA-Z][a-zA-Z0-9]*[^>]*>. Matches opening or closing tags, requires a real element name, and allows attributes.
<h2([^>]*)>(.*?)</h2> with non-greedy matching catches single-line h2 tags. For multi-line, add the s flag so . matches newlines.
For all heading levels with backref-enforced matching: <(h[1-6])([^>]*)>(.*?)</\1>. The \\1 ensures the closing tag matches the opening level.
Because .* is greedy by default. It matches as much as possible, so <div>a</div><div>b</div> against <div>.*</div> captures the entire string.
Use .*? (non-greedy) to stop at the first closing tag. Or use a character class that explicitly excludes <: <div>([^<]*)</div>.
No. Untrusted HTML can contain crafted inputs that exploit naive regex matchers. For example, embedded scripts that survive a regex strip but are caught by a parser, or attribute injections that break out of the matched region.
For untrusted HTML, always use a parser (DOMParser in browsers, BeautifulSoup in Python, DOMDocument in PHP) and a sanitiser library (DOMPurify, bleach, HTMLPurifier).
The pattern name\s*=\s*(?:"([^"]*)"|'([^']*)'|([^\s>]+)) captures the value into one of three groups depending on quoting. Replace name with the specific attribute you want (e.g., href, class).
Run it inside an opening-tag match first, then iterate attributes. Or use a parser, which gives you a clean attribute map without the regex gymnastics.
Because Go uses RE2, which does not support backreferences. Patterns like <(h[1-6])>(.*?)</\1> compile but the \1 matches nothing. RE2 also rejects some lookaround constructs.
For Go, either do two passes (find all opening tags, then for each, scan forward for the matching closing tag) or use golang.org/x/net/html which is the standard HTML parser.
See also
- How to Match a URL with Regex: the
hrefvalue once you've extracted it from a tag - How to Match a Hex Color Code with Regex: the colour values inside inline
styleattributes - Regex Anchors: why anchoring matters for whole-string tag validation
- Regex Word Boundaries: scanning attribute names out of mixed text
- Regex Lookaheads and Lookbehinds: matching a tag preceded or followed by specific context
- Regex Capturing Groups and Backreferences: the
\1trick that pairs opening and closing tags - Regex Cheat Sheet: the wider syntax and engine compatibility reference
External reference: the HTML Living Standard parser section describes how real HTML parsers handle the edge cases regex cannot. The famous Stack Overflow answer "You can't parse HTML with regex" is the canonical statement of the problem.





