The shortest regex that matches a simple HTML opening tag: <[a-zA-Z][a-zA-Z0-9]*>. It accepts <div>, <p>, <h1>, and rejects <123> (numeric first character) and <!doctype html> (those special angles). For tags with attributes, closing tags, self-closing tags, or attribute capture, the pattern gets longer. Below I walk every variant with runnable code in JavaScript, Python, and PHP, engine notes per language, the bugs I've shipped, and I'll be honest about when this is the wrong approach (which is most of the time).
HTML is famously not a regular language. A parser handles nested structure, attribute quoting, void elements, character entities, comments, and CDATA correctly; regex handles none of those. The patterns below are for the narrow cases where a parser is overkill (log scanning, quick search-replace in a single file, stripping known tags from trusted text) and never for parsing untrusted HTML.
Quick reference
Simple tag (no attributes):
<[a-zA-Z][a-zA-Z0-9]*>
Opening tag with attributes:
<[a-zA-Z][a-zA-Z0-9]*(\s+[a-zA-Z-]+(\s*=\s*("[^"]*"|'[^']*'|[^\s>]+))?)*\s*>
Closing tag:
</[a-zA-Z][a-zA-Z0-9]*\s*>
Self-closing (XHTML/JSX):
<[a-zA-Z][a-zA-Z0-9]*[^>]*\/>
Simple tag (no attributes)
<[a-zA-Z][a-zA-Z0-9]*>
Matches: <div>, <p>, <h1>, <MyComponent>. Rejects: <123>, </div>, <br/>, <!doctype html>.
The first character has to be a letter; subsequent characters can be alphanumeric. This follows the HTML spec's element-name rules.
Opening tag with attributes
<[a-zA-Z][a-zA-Z0-9]*(\s+[a-zA-Z-]+(\s*=\s*("[^"]*"|'[^']*'|[^\s>]+))?)*\s*>
The pattern after the element name handles attributes: zero or more of (\s+name(=value)?). The attribute value is one of:
"double-quoted"'single-quoted'unquoted(no whitespace, no>)
This matches <a href="https://example.com">, <input type="text" required>, and <div class='card primary' data-id="42">.
Closing tag
</[a-zA-Z][a-zA-Z0-9]*\s*>
Closing tags have a leading </ and cannot have attributes. The optional whitespace before > allows </div > (rare but legal).
Self-closing tag
<[a-zA-Z][a-zA-Z0-9]*[^>]*\/>
XHTML and JSX use the /> ending for void elements. The pattern allows any content between the name and the /> for attributes.
Specific tag (e.g., only h1-h6)
<h[1-6]([^>]*)>(.*?)</h[1-6]>
This matches an <h1> through <h6> opening tag (with optional attributes), captures the inner text, and the matching closing tag.
Important caveat: this is a non-greedy match (.*?), which handles single-line headings. For headings that span multiple lines, add the s flag (PHP, Python, JavaScript with s flag) so . matches newlines too.
Also: this pattern does not enforce that the opening and closing levels match. <h1>...</h3> will match because both sides use h[1-6]. To enforce matching, use a backreference: <(h[1-6])([^>]*)>(.*?)</\1>.
Extracting attribute values
href\s*=\s*(?:"([^"]*)"|'([^']*)'|([^\s>]+))
This isolates the href attribute and captures its value into one of three groups depending on quoting style. Run it inside an opening-tag match to extract specifically the attribute you want.
In code, you usually run this in a loop over attribute names rather than building one mega-pattern.
When to use a parser instead
The cases where regex on HTML breaks down:
- Nested same-element tags.
<div><div>...</div></div>requires balanced-bracket matching, which regular languages cannot do. - CDATA, comments, processing instructions.
<![CDATA[<div>not a tag</div>]]>contains characters that look like tags but are not. - HTML entities and character references.
<div>is the text<div>, not a tag. - Attribute values containing
<or>. Rare but legal:<a title="1 < 2">. - Unquoted attribute values containing escaped quotes. Rare but legal.
- Untrusted input. Anyone parsing user-submitted HTML with regex is one cleverly-crafted input away from a security bug.
For any of those cases, use a parser:
- JavaScript:
DOMParserin browsers,cheerioorparse5in Node. - Python:
BeautifulSoup(bs4) orlxml. - PHP:
DOMDocumentorhtml5-php. - Go:
golang.org/x/net/html.
The parser handles every edge case correctly. The performance cost is rarely worth the simplicity cost on the regex side.
Examples in JavaScript, Python, and PHP
JavaScript, extracting all <h2> headings from a document:
const html = document.documentElement.outerHTML;
const headings = [];
const pattern = /<h2([^>]*)>(.*?)<\/h2>/gis;
let m;
while ((m = pattern.exec(html)) !== null) {
headings.push({ attrs: m[1].trim(), text: m[2] });
}The same task using DOMParser (the right way):
const parser = new DOMParser();
const doc = parser.parseFromString(html, "text/html");
const headings = [...doc.querySelectorAll("h2")].map(h => ({
attrs: [...h.attributes].map(a => `${a.name}="${a.value}"`).join(" "),
text: h.textContent,
}));Python, stripping all <script> and <style> blocks from text:
import re
def strip_dangerous(html: str) -> str:
# Regex version (acceptable for trusted input, not security-critical):
return re.sub(r"<(script|style)[^>]*>.*?</\1>", "", html, flags=re.IGNORECASE | re.DOTALL)The parser version (use this for untrusted input):
from bs4 import BeautifulSoup
def strip_dangerous(html: str) -> str:
soup = BeautifulSoup(html, "html.parser")
for tag in soup(["script", "style"]):
tag.decompose()
return str(soup)PHP, extracting all link href values from a string:
function extractHrefs(string $html): array {
preg_match_all('/<a\s[^>]*href\s*=\s*(?:"([^"]*)"|\'([^\']*)\'|([^\s>]+))[^>]*>/i', $html, $matches);
$hrefs = [];
for ($i = 0; $i < count($matches[0]); $i++) {
$hrefs[] = $matches[1][$i] ?: $matches[2][$i] ?: $matches[3][$i];
}
return $hrefs;
}The parser version:
function extractHrefs(string $html): array {
$dom = new DOMDocument();
@$dom->loadHTML($html);
$hrefs = [];
foreach ($dom->getElementsByTagName("a") as $a) {
$hrefs[] = $a->getAttribute("href");
}
return $hrefs;
}Engine compatibility
The patterns use backreferences (\1) for matched-tag pairs, the s (dotall) flag for multi-line content, and non-capturing groups. Most engines support all three; Go RE2 and Rust's regex crate do not support backreferences, which is the main cross-engine limitation.
| Engine | Tag patterns | Backreferences (\1) | Dotall flag | Parser equivalent |
|---|---|---|---|---|
| JavaScript | Works | Works | s flag (ES2018+) | DOMParser in browsers, cheerio / parse5 in Node |
Python (re) | Works | Works | re.DOTALL | BeautifulSoup, lxml |
| PHP (PCRE) | Works | Works | s modifier | DOMDocument, html5-php |
| Java | Works | Works | Pattern.DOTALL | Jsoup |
| .NET | Works | Works | RegexOptions.Singleline | HtmlAgilityPack |
| Go (RE2) | Works (no backref) | Not supported | (?s) flag | golang.org/x/net/html |
Rust (regex crate) | Works (no backref) | Not supported | (?s) flag | scraper crate, html5ever |
| Ruby | Works | Works | m flag (in Ruby m means dotall) | Nokogiri |
POSIX ERE (grep -E) | Works (basic patterns only) | BRE only (\1 in basic mode) | None | None |
For Go and Rust, the matched-pair patterns (<(h[1-6])>(.*?)</\1>) don't work directly. Either iterate with two passes (collect all opening tags, then for each, find the matching closing tag in a second scan) or use the language's HTML parser.
Common mistakes
The bugs I've shipped or seen in code review.
Using .* (greedy) instead of .*? (non-greedy). Against <div>a</div><div>b</div>, the pattern <div>.*</div> captures the entire string because .* matches as much as possible. Use .*? or, better, a negated class like [^<]*.
Forgetting the s/dotall flag for multi-line content. Without it, . does not match newlines, so a <div> with content on multiple lines fails to match. Add the flag.
Treating regex-extracted HTML as safe to render. A pattern that extracts <a> tags from arbitrary input can include attacker-injected content. Always sanitise via a parser or library (DOMPurify, bleach, HTMLPurifier) before rendering.
Assuming closing tags match opening tags by position. <h1>...</h3> is invalid HTML but matches a pattern like <h[1-6]>(.*?)</h[1-6]>. Use a backreference (<(h[1-6])>...</\1>) to enforce matching.
Trying to match HTML comments with <!--.*-->. Without s flag this fails on multi-line comments; with it, a string like <!-- a --> b <!-- c --> is captured as one giant comment because .* is greedy. Use <!--[\s\S]*?--> for the cleanest cross-engine form.
Stripping tags with regex and assuming the result is plain text. Attribute values like onerror="alert(1)" and character entities like <script> survive a tag-strip. Use a sanitiser library if security matters.
Test cases
| Input | Simple tag pattern | Opening with attrs | Closing tag |
|---|---|---|---|
<div> | Match | Match | No match |
<p class="card"> | No match | Match | No match |
<h1> | Match | Match | No match |
</div> | No match | No match | Match |
<br/> | No match | No match | No match |
<input type="text" required> | No match | Match | No match |
<!doctype html> | No match | No match | No match |
<123> | No match | No match | No match |
FAQ
See also
- How to Match a URL with Regex: the
hrefvalue once you've extracted it from a tag - How to Match a Hex Color Code with Regex: the colour values inside inline
styleattributes - Regex Anchors: why anchoring matters for whole-string tag validation
- Regex Word Boundaries: scanning attribute names out of mixed text
- Regex Lookaheads and Lookbehinds: matching a tag preceded or followed by specific context
- Regex Capturing Groups and Backreferences: the
\1trick that pairs opening and closing tags - Regex Cheat Sheet: the wider syntax and engine compatibility reference
External reference: the HTML Living Standard parser section describes how real HTML parsers handle the edge cases regex cannot. The famous Stack Overflow answer "You can't parse HTML with regex" is the canonical statement of the problem.





