TechEarl

How to Match HTML Tags with Regex (And Why You Probably Shouldn't)

Match HTML tags with regex when you must, and the case for parsing instead. Patterns for opening, closing, self-closing, attribute-bearing tags, JavaScript / Python / PHP, engine notes, common mistakes.

Ishan KarunaratneIshan Karunaratne⏱️ 10 min readUpdated
Match HTML tags with regex when you have to. Patterns for opening, closing, self-closing, and attribute-bearing tags. JavaScript / Python / PHP examples, engine notes, common mistakes, parser alternative.

The shortest regex that matches a simple HTML opening tag: <[a-zA-Z][a-zA-Z0-9]*>. It accepts <div>, <p>, <h1>, and rejects <123> (numeric first character) and <!doctype html> (those special angles). For tags with attributes, closing tags, self-closing tags, or attribute capture, the pattern gets longer. Below I walk every variant with runnable code in JavaScript, Python, and PHP, engine notes per language, the bugs I've shipped, and I'll be honest about when this is the wrong approach (which is most of the time).

HTML is famously not a regular language. A parser handles nested structure, attribute quoting, void elements, character entities, comments, and CDATA correctly; regex handles none of those. The patterns below are for the narrow cases where a parser is overkill (log scanning, quick search-replace in a single file, stripping known tags from trusted text) and never for parsing untrusted HTML.

Quick reference

Simple tag (no attributes):

code
<[a-zA-Z][a-zA-Z0-9]*>

Opening tag with attributes:

code
<[a-zA-Z][a-zA-Z0-9]*(\s+[a-zA-Z-]+(\s*=\s*("[^"]*"|'[^']*'|[^\s>]+))?)*\s*>

Closing tag:

code
</[a-zA-Z][a-zA-Z0-9]*\s*>

Self-closing (XHTML/JSX):

code
<[a-zA-Z][a-zA-Z0-9]*[^>]*\/>

Simple tag (no attributes)

code
<[a-zA-Z][a-zA-Z0-9]*>

Matches: <div>, <p>, <h1>, <MyComponent>. Rejects: <123>, </div>, <br/>, <!doctype html>.

The first character has to be a letter; subsequent characters can be alphanumeric. This follows the HTML spec's element-name rules.

Opening tag with attributes

code
<[a-zA-Z][a-zA-Z0-9]*(\s+[a-zA-Z-]+(\s*=\s*("[^"]*"|'[^']*'|[^\s>]+))?)*\s*>

The pattern after the element name handles attributes: zero or more of (\s+name(=value)?). The attribute value is one of:

  • "double-quoted"
  • 'single-quoted'
  • unquoted (no whitespace, no >)

This matches <a href="https://example.com">, <input type="text" required>, and <div class='card primary' data-id="42">.

Closing tag

code
</[a-zA-Z][a-zA-Z0-9]*\s*>

Closing tags have a leading </ and cannot have attributes. The optional whitespace before > allows </div > (rare but legal).

Self-closing tag

code
<[a-zA-Z][a-zA-Z0-9]*[^>]*\/>

XHTML and JSX use the /> ending for void elements. The pattern allows any content between the name and the /> for attributes.

Specific tag (e.g., only h1-h6)

code
<h[1-6]([^>]*)>(.*?)</h[1-6]>

This matches an <h1> through <h6> opening tag (with optional attributes), captures the inner text, and the matching closing tag.

Important caveat: this is a non-greedy match (.*?), which handles single-line headings. For headings that span multiple lines, add the s flag (PHP, Python, JavaScript with s flag) so . matches newlines too.

Also: this pattern does not enforce that the opening and closing levels match. <h1>...</h3> will match because both sides use h[1-6]. To enforce matching, use a backreference: <(h[1-6])([^>]*)>(.*?)</\1>.

Extracting attribute values

code
href\s*=\s*(?:"([^"]*)"|'([^']*)'|([^\s>]+))

This isolates the href attribute and captures its value into one of three groups depending on quoting style. Run it inside an opening-tag match to extract specifically the attribute you want.

In code, you usually run this in a loop over attribute names rather than building one mega-pattern.

When to use a parser instead

The cases where regex on HTML breaks down:

  • Nested same-element tags. <div><div>...</div></div> requires balanced-bracket matching, which regular languages cannot do.
  • CDATA, comments, processing instructions. <![CDATA[<div>not a tag</div>]]> contains characters that look like tags but are not.
  • HTML entities and character references. &lt;div&gt; is the text <div>, not a tag.
  • Attribute values containing < or >. Rare but legal: <a title="1 < 2">.
  • Unquoted attribute values containing escaped quotes. Rare but legal.
  • Untrusted input. Anyone parsing user-submitted HTML with regex is one cleverly-crafted input away from a security bug.

For any of those cases, use a parser:

  • JavaScript: DOMParser in browsers, cheerio or parse5 in Node.
  • Python: BeautifulSoup (bs4) or lxml.
  • PHP: DOMDocument or html5-php.
  • Go: golang.org/x/net/html.

The parser handles every edge case correctly. The performance cost is rarely worth the simplicity cost on the regex side.

Examples in JavaScript, Python, and PHP

JavaScript, extracting all <h2> headings from a document:

javascript
const html = document.documentElement.outerHTML;
const headings = [];
const pattern = /<h2([^>]*)>(.*?)<\/h2>/gis;
let m;
while ((m = pattern.exec(html)) !== null) {
  headings.push({ attrs: m[1].trim(), text: m[2] });
}

The same task using DOMParser (the right way):

javascript
const parser = new DOMParser();
const doc = parser.parseFromString(html, "text/html");
const headings = [...doc.querySelectorAll("h2")].map(h => ({
  attrs: [...h.attributes].map(a => `${a.name}="${a.value}"`).join(" "),
  text: h.textContent,
}));

Python, stripping all <script> and <style> blocks from text:

python
import re

def strip_dangerous(html: str) -> str:
    # Regex version (acceptable for trusted input, not security-critical):
    return re.sub(r"<(script|style)[^>]*>.*?</\1>", "", html, flags=re.IGNORECASE | re.DOTALL)

The parser version (use this for untrusted input):

python
from bs4 import BeautifulSoup

def strip_dangerous(html: str) -> str:
    soup = BeautifulSoup(html, "html.parser")
    for tag in soup(["script", "style"]):
        tag.decompose()
    return str(soup)

PHP, extracting all link href values from a string:

php
function extractHrefs(string $html): array {
    preg_match_all('/<a\s[^>]*href\s*=\s*(?:"([^"]*)"|\'([^\']*)\'|([^\s>]+))[^>]*>/i', $html, $matches);
    $hrefs = [];
    for ($i = 0; $i < count($matches[0]); $i++) {
        $hrefs[] = $matches[1][$i] ?: $matches[2][$i] ?: $matches[3][$i];
    }
    return $hrefs;
}

The parser version:

php
function extractHrefs(string $html): array {
    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    $hrefs = [];
    foreach ($dom->getElementsByTagName("a") as $a) {
        $hrefs[] = $a->getAttribute("href");
    }
    return $hrefs;
}

Engine compatibility

The patterns use backreferences (\1) for matched-tag pairs, the s (dotall) flag for multi-line content, and non-capturing groups. Most engines support all three; Go RE2 and Rust's regex crate do not support backreferences, which is the main cross-engine limitation.

EngineTag patternsBackreferences (\1)Dotall flagParser equivalent
JavaScriptWorksWorkss flag (ES2018+)DOMParser in browsers, cheerio / parse5 in Node
Python (re)WorksWorksre.DOTALLBeautifulSoup, lxml
PHP (PCRE)WorksWorkss modifierDOMDocument, html5-php
JavaWorksWorksPattern.DOTALLJsoup
.NETWorksWorksRegexOptions.SinglelineHtmlAgilityPack
Go (RE2)Works (no backref)Not supported(?s) flaggolang.org/x/net/html
Rust (regex crate)Works (no backref)Not supported(?s) flagscraper crate, html5ever
RubyWorksWorksm flag (in Ruby m means dotall)Nokogiri
POSIX ERE (grep -E)Works (basic patterns only)BRE only (\1 in basic mode)NoneNone

For Go and Rust, the matched-pair patterns (<(h[1-6])>(.*?)</\1>) don't work directly. Either iterate with two passes (collect all opening tags, then for each, find the matching closing tag in a second scan) or use the language's HTML parser.

Common mistakes

The bugs I've shipped or seen in code review.

Using .* (greedy) instead of .*? (non-greedy). Against <div>a</div><div>b</div>, the pattern <div>.*</div> captures the entire string because .* matches as much as possible. Use .*? or, better, a negated class like [^<]*.

Forgetting the s/dotall flag for multi-line content. Without it, . does not match newlines, so a <div> with content on multiple lines fails to match. Add the flag.

Treating regex-extracted HTML as safe to render. A pattern that extracts <a> tags from arbitrary input can include attacker-injected content. Always sanitise via a parser or library (DOMPurify, bleach, HTMLPurifier) before rendering.

Assuming closing tags match opening tags by position. <h1>...</h3> is invalid HTML but matches a pattern like <h[1-6]>(.*?)</h[1-6]>. Use a backreference (<(h[1-6])>...</\1>) to enforce matching.

Trying to match HTML comments with <!--.*-->. Without s flag this fails on multi-line comments; with it, a string like <!-- a --> b <!-- c --> is captured as one giant comment because .* is greedy. Use <!--[\s\S]*?--> for the cleanest cross-engine form.

Stripping tags with regex and assuming the result is plain text. Attribute values like onerror="alert(1)" and character entities like &#x3C;script&#x3E; survive a tag-strip. Use a sanitiser library if security matters.

Test cases

InputSimple tag patternOpening with attrsClosing tag
<div>MatchMatchNo match
<p class="card">No matchMatchNo match
<h1>MatchMatchNo match
</div>No matchNo matchMatch
<br/>No matchNo matchNo match
<input type="text" required>No matchMatchNo match
<!doctype html>No matchNo matchNo match
<123>No matchNo matchNo match

FAQ

See also

External reference: the HTML Living Standard parser section describes how real HTML parsers handle the edge cases regex cannot. The famous Stack Overflow answer "You can't parse HTML with regex" is the canonical statement of the problem.

TagsRegexHTMLParsingRegular ExpressionsJavaScriptPythonPHP
Share
Ishan Karunaratne

Ishan Karunaratne

Tech Architect · Software Engineer · AI/DevOps

Tech architect and software engineer with 20+ years across software, Linux systems, DevOps, and infrastructure — and a more recent focus on AI. Currently Chief Technology Officer at a tech startup in the healthcare space.

Keep reading

Related posts

Match an email address with regex. Practical pattern, strict RFC 5321 pattern, JavaScript / Python / PHP examples, edge cases, engine compatibility, common mistakes, and a test table.

How to Match an Email Address with Regex

Match an email address with regex. The practical pattern, the strict RFC 5321 pattern, examples in JavaScript, Python, and PHP, edge cases, engine compatibility, common mistakes, and a validation test table.

Match integers, decimals, signed, scientific, thousands-separated, currency, and percent numbers with regex. JavaScript / Python / PHP examples, engine notes, common mistakes, test table.

How to Match Numbers with Regex

Match integers, decimals, signed, scientific, thousands-separated, currency, and percent numbers with regex. JavaScript / Python / PHP examples, engine notes, common mistakes, test table.