Match HTML Tags with Regex: Pragmatic Patterns (2026)

The shortest regex that matches a simple HTML opening tag: <[a-zA-Z][a-zA-Z0-9]*>. It accepts <div>, <p>, <h1>, and rejects <123> (numeric first character) and <!doctype html> (those special angles). For tags with attributes, closing tags, self-closing tags, or attribute capture, the pattern gets longer. Below I walk every variant with runnable code in JavaScript, Python, and PHP, engine notes per language, the bugs I've shipped, and I'll be honest about when this is the wrong approach (which is most of the time).

HTML is famously not a regular language. A parser handles nested structure, attribute quoting, void elements, character entities, comments, and CDATA correctly; regex handles none of those. The patterns below are for the narrow cases where a parser is overkill (log scanning, quick search-replace in a single file, stripping known tags from trusted text) and never for parsing untrusted HTML.

Quick reference

Simple tag (no attributes):

code

<[a-zA-Z][a-zA-Z0-9]*>

Opening tag with attributes:

code

<[a-zA-Z][a-zA-Z0-9]*(\s+[a-zA-Z-]+(\s*=\s*("[^"]*"|'[^']*'|[^\s>]+))?)*\s*>

Closing tag:

code

</[a-zA-Z][a-zA-Z0-9]*\s*>

Self-closing (XHTML/JSX):

code

<[a-zA-Z][a-zA-Z0-9]*[^>]*\/>

Simple tag (no attributes)

code

<[a-zA-Z][a-zA-Z0-9]*>

Matches: <div>, <p>, <h1>, <MyComponent>. Rejects: <123>, </div>, <br/>, <!doctype html>.

The first character has to be a letter; subsequent characters can be alphanumeric. This follows the HTML spec's element-name rules.

Opening tag with attributes

code

<[a-zA-Z][a-zA-Z0-9]*(\s+[a-zA-Z-]+(\s*=\s*("[^"]*"|'[^']*'|[^\s>]+))?)*\s*>

The pattern after the element name handles attributes: zero or more of (\s+name(=value)?). The attribute value is one of:

"double-quoted"
'single-quoted'
unquoted (no whitespace, no >)

This matches <a href="https://example.com">, <input type="text" required>, and <div class='card primary' data-id="42">.

Closing tag

code

</[a-zA-Z][a-zA-Z0-9]*\s*>

Closing tags have a leading </ and cannot have attributes. The optional whitespace before > allows </div > (rare but legal).

Self-closing tag

code

<[a-zA-Z][a-zA-Z0-9]*[^>]*\/>

XHTML and JSX use the /> ending for void elements. The pattern allows any content between the name and the /> for attributes.

Specific tag (e.g., only h1-h6)

code

<h[1-6]([^>]*)>(.*?)</h[1-6]>

This matches an <h1> through <h6> opening tag (with optional attributes), captures the inner text, and the matching closing tag.

Important caveat: this is a non-greedy match (.*?), which handles single-line headings. For headings that span multiple lines, add the s flag (PHP, Python, JavaScript with s flag) so . matches newlines too.

Also: this pattern does not enforce that the opening and closing levels match. <h1>...</h3> will match because both sides use h[1-6]. To enforce matching, use a backreference: <(h[1-6])([^>]*)>(.*?)</\1>.

Extracting attribute values

code

href\s*=\s*(?:"([^"]*)"|'([^']*)'|([^\s>]+))

This isolates the href attribute and captures its value into one of three groups depending on quoting style. Run it inside an opening-tag match to extract specifically the attribute you want.

In code, you usually run this in a loop over attribute names rather than building one mega-pattern.

When to use a parser instead

The cases where regex on HTML breaks down:

Nested same-element tags. <div><div>...</div></div> requires balanced-bracket matching, which regular languages cannot do.
CDATA, comments, processing instructions. <![CDATA[<div>not a tag</div>]]> contains characters that look like tags but are not.
HTML entities and character references. <div> is the text <div>, not a tag.
Attribute values containing < or >. Rare but legal: <a title="1 < 2">.
Unquoted attribute values containing escaped quotes. Rare but legal.
Untrusted input. Anyone parsing user-submitted HTML with regex is one cleverly-crafted input away from a security bug.

For any of those cases, use a parser:

JavaScript: DOMParser in browsers, cheerio or parse5 in Node.
Python: BeautifulSoup (bs4) or lxml.
PHP: DOMDocument or html5-php.
Go: golang.org/x/net/html.

The parser handles every edge case correctly. The performance cost is rarely worth the simplicity cost on the regex side.

Examples in JavaScript, Python, and PHP

JavaScript, extracting all <h2> headings from a document:

javascript

const html = document.documentElement.outerHTML;
const headings = [];
const pattern = /<h2([^>]*)>(.*?)<\/h2>/gis;
let m;
while ((m = pattern.exec(html)) !== null) {
  headings.push({ attrs: m[1].trim(), text: m[2] });
}

The same task using DOMParser (the right way):

javascript

const parser = new DOMParser();
const doc = parser.parseFromString(html, "text/html");
const headings = [...doc.querySelectorAll("h2")].map(h => ({
  attrs: [...h.attributes].map(a => `${a.name}="${a.value}"`).join(" "),
  text: h.textContent,
}));

Python, stripping all <script> and <style> blocks from text:

python

import re

def strip_dangerous(html: str) -> str:
    # Regex version (acceptable for trusted input, not security-critical):
    return re.sub(r"<(script|style)[^>]*>.*?</\1>", "", html, flags=re.IGNORECASE | re.DOTALL)

The parser version (use this for untrusted input):

python

from bs4 import BeautifulSoup

def strip_dangerous(html: str) -> str:
    soup = BeautifulSoup(html, "html.parser")
    for tag in soup(["script", "style"]):
        tag.decompose()
    return str(soup)

PHP, extracting all link href values from a string:

php

function extractHrefs(string $html): array {
    preg_match_all('/<a\s[^>]*href\s*=\s*(?:"([^"]*)"|\'([^\']*)\'|([^\s>]+))[^>]*>/i', $html, $matches);
    $hrefs = [];
    for ($i = 0; $i < count($matches[0]); $i++) {
        $hrefs[] = $matches[1][$i] ?: $matches[2][$i] ?: $matches[3][$i];
    }
    return $hrefs;
}

The parser version:

php

function extractHrefs(string $html): array {
    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    $hrefs = [];
    foreach ($dom->getElementsByTagName("a") as $a) {
        $hrefs[] = $a->getAttribute("href");
    }
    return $hrefs;
}

Engine compatibility

The patterns use backreferences (\1) for matched-tag pairs, the s (dotall) flag for multi-line content, and non-capturing groups. Most engines support all three; Go RE2 and Rust's regex crate do not support backreferences, which is the main cross-engine limitation.

Engine	Tag patterns	Backreferences (`\1`)	Dotall flag	Parser equivalent
JavaScript	Works	Works	`s` flag (ES2018+)	`DOMParser` in browsers, `cheerio` / `parse5` in Node
Python (`re`)	Works	Works	`re.DOTALL`	`BeautifulSoup`, `lxml`
PHP (PCRE)	Works	Works	`s` modifier	`DOMDocument`, `html5-php`
Java	Works	Works	`Pattern.DOTALL`	`Jsoup`
.NET	Works	Works	`RegexOptions.Singleline`	`HtmlAgilityPack`
Go (RE2)	Works (no backref)	Not supported	`(?s)` flag	`golang.org/x/net/html`
Rust (`regex` crate)	Works (no backref)	Not supported	`(?s)` flag	`scraper` crate, `html5ever`
Ruby	Works	Works	`m` flag (in Ruby `m` means dotall)	`Nokogiri`
POSIX ERE (`grep -E`)	Works (basic patterns only)	BRE only (`\1` in basic mode)	None	None

For Go and Rust, the matched-pair patterns (<(h[1-6])>(.*?)</\1>) don't work directly. Either iterate with two passes (collect all opening tags, then for each, find the matching closing tag in a second scan) or use the language's HTML parser.

Common mistakes

The bugs I've shipped or seen in code review.

Using .* (greedy) instead of .*? (non-greedy). Against <div>a</div><div>b</div>, the pattern <div>.*</div> captures the entire string because .* matches as much as possible. Use .*? or, better, a negated class like [^<]*.

Forgetting the s/dotall flag for multi-line content. Without it, . does not match newlines, so a <div> with content on multiple lines fails to match. Add the flag.

Treating regex-extracted HTML as safe to render. A pattern that extracts <a> tags from arbitrary input can include attacker-injected content. Always sanitise via a parser or library (DOMPurify, bleach, HTMLPurifier) before rendering.

Assuming closing tags match opening tags by position. <h1>...</h3> is invalid HTML but matches a pattern like <h[1-6]>(.*?)</h[1-6]>. Use a backreference (<(h[1-6])>...</\1>) to enforce matching.

Trying to match HTML comments with . Without s flag this fails on multi-line comments; with it, a string like  b  is captured as one giant comment because .* is greedy. Use  for the cleanest cross-engine form.

Stripping tags with regex and assuming the result is plain text. Attribute values like onerror="alert(1)" and character entities like <script> survive a tag-strip. Use a sanitiser library if security matters.

Test cases

Input	Simple tag pattern	Opening with attrs	Closing tag
`<div>`	Match	Match	No match
`<p class="card">`	No match	Match	No match
`<h1>`	Match	Match	No match
`</div>`	No match	No match	Match
`<br/>`	No match	No match	No match
`<input type="text" required>`	No match	Match	No match
`<!doctype html>`	No match	No match	No match
`<123>`	No match	No match	No match

FAQ

For trusted, well-formed HTML in narrow contexts (extracting headings, scanning your own log output, stripping a known set of tags), yes. For arbitrary or untrusted HTML, no. Use a real parser.

HTML allows nested same-name elements, attribute quoting variants, character references, CDATA, and comments that regex cannot handle cleanly. A parser handles all of those correctly.

<[^>]+> matches any sequence of <, any non-> characters, and a closing >. It is a quick-and-dirty match for "any tag" but accepts garbage like <> and <"<>".

The cleaner form: <\/?[a-zA-Z][a-zA-Z0-9]*[^>]*>. Matches opening or closing tags, requires a real element name, and allows attributes.

<h2([^>]*)>(.*?)</h2> with non-greedy matching catches single-line h2 tags. For multi-line, add the s flag so . matches newlines.

For all heading levels with backref-enforced matching: <(h[1-6])([^>]*)>(.*?)</\1>. The \\1 ensures the closing tag matches the opening level.

Because .* is greedy by default. It matches as much as possible, so <div>a</div><div>b</div> against <div>.*</div> captures the entire string.

Use .*? (non-greedy) to stop at the first closing tag. Or use a character class that explicitly excludes <: <div>([^<]*)</div>.

No. Untrusted HTML can contain crafted inputs that exploit naive regex matchers. For example, embedded scripts that survive a regex strip but are caught by a parser, or attribute injections that break out of the matched region.

For untrusted HTML, always use a parser (DOMParser in browsers, BeautifulSoup in Python, DOMDocument in PHP) and a sanitiser library (DOMPurify, bleach, HTMLPurifier).

The pattern name\s*=\s*(?:"([^"]*)"|'([^']*)'|([^\s>]+)) captures the value into one of three groups depending on quoting. Replace name with the specific attribute you want (e.g., href, class).

Run it inside an opening-tag match first, then iterate attributes. Or use a parser, which gives you a clean attribute map without the regex gymnastics.

Because Go uses RE2, which does not support backreferences. Patterns like <(h[1-6])>(.*?)</\1> compile but the \1 matches nothing. RE2 also rejects some lookaround constructs.

For Go, either do two passes (find all opening tags, then for each, scan forward for the matching closing tag) or use golang.org/x/net/html which is the standard HTML parser.

How to Match HTML Tags with Regex (And Why You Probably Shouldn't)

Quick reference

Simple tag (no attributes)

Opening tag with attributes

Closing tag

Self-closing tag

Specific tag (e.g., only h1-h6)

Extracting attribute values

When to use a parser instead

Examples in JavaScript, Python, and PHP

Engine compatibility

Common mistakes

Test cases

FAQ

See also

Ishan Karunaratne

Related posts

How to Match a Date with Regex (Multiple Formats)

How to Count Matches with grep -c (and the Line-vs-Occurrence Trap)

How to Match an Email Address with Regex

Can I use regex to parse HTML?

What is the simplest regex to match any HTML tag?

How do I match a specific tag like only <h2> with regex?

Why does my regex match across multiple tags?

Is it safe to use regex on user-submitted HTML?

How do I extract attribute values with regex?

Why does Go's regexp fail with my HTML pattern?

Ishan Karunaratne