TechEarl

How to Match a URL with Regex

Match a URL with regex. Covers http/https schemes, protocol-relative URLs, ports, paths, query strings, fragments, runnable JavaScript / Python / PHP, engine notes, and the URL parser alternative.

Ishan Karunaratne⏱️ 10 min readUpdated
Share thisCopied
Match a URL with regex. http/https schemes, protocol-relative URLs, ports, paths, query strings, fragments. JavaScript / Python / PHP examples, engine notes, parser alternative, common mistakes, test table.

The practical regex for matching a URL: ^(https?:\/\/)?([\w-]+(\.[\w-]+)+)([\/\w \.-]*)*\/?(\?[^\s#]*)?(#[^\s]*)?$. It accepts http://, https://, or no protocol at all (protocol-relative URLs like example.com/path), a domain with at least one dot, an optional path, query string, and fragment. There is also a stricter form that requires the protocol and validates port numbers, and there is the modern alternative that skips regex entirely and uses the language's built-in URL parser. Below I walk all three, with runnable code in JavaScript, Python, and PHP, engine-specific notes, and the bugs I've seen most often.

The reason "match a URL" has so many regex variants is that the URL standard (RFC 3986) permits a lot of esoteric forms. The practical pattern matches the URLs that real users type and real APIs return; the strict pattern follows the spec more closely. Decide based on what the input is for.

Quick reference

The practical pattern, ready to paste:

code
^(https?:\/\/)?([\w-]+(\.[\w-]+)+)([\/\w \.-]*)*\/?(\?[^\s#]*)?(#[^\s]*)?$

Strict pattern (protocol required, optional port):

code
^(https?):\/\/([\w-]+(\.[\w-]+)+)(:[0-9]{1,5})?(\/[^\s?#]*)?(\?[^\s#]*)?(#\S*)?$

HTTPS only (the one I use in production for webhook URLs):

code
^https:\/\/([\w-]+(\.[\w-]+)+)(\/[^\s?#]*)?(\?[^\s#]*)?(#\S*)?$

The practical pattern

code
^(https?:\/\/)?([\w-]+(\.[\w-]+)+)([\/\w \.-]*)*\/?(\?[^\s#]*)?(#[^\s]*)?$

Left to right:

  • ^ and $ anchor to the full string.
  • (https?:\/\/)? is an optional protocol. https? matches both http and https. The whole group is optional.
  • ([\w-]+(\.[\w-]+)+) is the domain: one or more "word" characters (letters/digits/underscore) plus hyphens, repeated with dots between segments. Requires at least one dot.
  • ([\/\w \.-]*)*\/? is the optional path.
  • (\?[^\s#]*)? is the optional query string (everything from ? to a # or whitespace).
  • (#[^\s]*)? is the optional fragment (everything from # to whitespace).

This pattern accepts https://example.com, example.com/path, sub.example.com/path?query=1#section, and most things in between.

The strict pattern (with protocol and port)

If you want to require the protocol and explicitly handle ports:

code
^(https?):\/\/([\w-]+(\.[\w-]+)+)(:[0-9]{1,5})?(\/[^\s?#]*)?(\?[^\s#]*)?(#\S*)?$

The differences:

  • (https?) is required (no ? after the group).
  • (:[0-9]{1,5})? is an optional port between 1 and 99999.
  • The path uses [^\s?#]* so it stops at the first space, ?, or #.

Use this when the URL is coming from a trusted source and you want to reject obviously-broken inputs like htttps://example.com (note the triple t).

Examples in JavaScript, Python, and PHP

JavaScript:

javascript
const urlPattern = /^(https?:\/\/)?([\w-]+(\.[\w-]+)+)([\/\w \.-]*)*\/?(\?[^\s#]*)?(#[^\s]*)?$/;
function isValidUrl(input) {
  return urlPattern.test(input);
}
isValidUrl("https://example.com");           // true
isValidUrl("example.com/path?q=1");          // true
isValidUrl("ftp://example.com");             // false (not http/https)

Python:

python
import re
URL_RE = re.compile(
    r"^(https?:\/\/)?([\w-]+(\.[\w-]+)+)([\/\w \.-]*)*\/?(\?[^\s#]*)?(#[^\s]*)?$"
)

def is_valid_url(value: str) -> bool:
    return bool(URL_RE.match(value))

is_valid_url("https://example.com/path")    # True
is_valid_url("not a url")                   # False

PHP:

php
function isValidUrl(string $value): bool {
    $pattern = '/^(https?:\/\/)?([\w-]+(\.[\w-]+)+)([\/\w \.-]*)*\/?(\?[^\s#]*)?(#[^\s]*)?$/';
    return (bool) preg_match($pattern, $value);
}

isValidUrl("https://techearl.com/regex-match-url");  // true
isValidUrl("javascript:alert(1)");                   // false

For PHP specifically, the built-in alternative is filter_var($value, FILTER_VALIDATE_URL). It is stricter than the regex above and refuses URLs without a scheme.

When to skip regex and use a URL parser instead

For anything more than "is this URL-shaped", a regex is the wrong tool. Every modern language has a URL parser that handles edge cases the regex cannot: internationalised domain names (例え.テスト), userinfo (user:pass@host), IPv6 literals (http://[::1]/), percent-encoding, all of it.

JavaScript:

javascript
function isValidUrl(input) {
  try {
    new URL(input);
    return true;
  } catch {
    return false;
  }
}

Python:

python
from urllib.parse import urlparse

def is_valid_url(value: str) -> bool:
    try:
        result = urlparse(value)
        return all([result.scheme, result.netloc])
    except Exception:
        return False

PHP:

php
function isValidUrl(string $value): bool {
    return filter_var($value, FILTER_VALIDATE_URL) !== false;
}

The trade-off: parsers are more correct but slower than regex. For high-volume input validation (form fields on a busy site, log scanning), regex wins. For "is this safe to redirect to?", use the parser and inspect specific fields (scheme, host, port).

Engine compatibility

The practical and strict patterns use only universal features (anchors, character classes, quantifiers, alternation). They run unmodified everywhere. The per-engine notes are about the parser fallback you reach for when you need correctness over speed.

EngineParser equivalentPer-engine note
JavaScriptnew URL(input)Throws on invalid input; wrap in try/catch. Supports IDN and IPv6 literals out of the box.
Pythonurllib.parse.urlparseReturns a struct even for non-URL input; check scheme and netloc are non-empty.
PHP (PCRE)filter_var($v, FILTER_VALIDATE_URL)Follows RFC 2396 (the older spec). Rejects IDN without idn_to_ascii preprocessing.
Javajava.net.URI(s).toURL()URI parses, toURL() enforces a known scheme.
.NETUri.TryCreate(s, UriKind.Absolute, out _)The recommended cross-version approach.
Go (RE2)net/url.ParseReturns no error for partial URLs; check u.Scheme and u.Host explicitly. RE2 lacks lookahead so any pattern with (?=...) needs rewriting.
Rust (regex crate)url::Url::parse (url crate)No lookahead, no backreferences. Stick to the practical pattern.
RubyURI.parse(s)Raises on invalid; rescue URI::InvalidURIError.

For cross-language form validation where the same pattern runs on the frontend and the backend, keep to the practical form. Anything richer should defer to the language's URL parser.

Common mistakes

The bugs I see most often.

Allowing any scheme without thinking. A pattern like ^[a-z]+:\/\/ matches javascript:, data:, file:, and vbscript: too. Always restrict the scheme to the ones you actually want (https? for web URLs, or just https for security-sensitive contexts).

Forgetting the second anchor. ^https?:\/\/[\w-]+ accepts https://exampleEXTRA_GARBAGE_HERE because nothing pins the end. Anchor both sides for validation.

Treating regex-validated URLs as safe to redirect to. A URL can be "shaped right" and still point at an attacker-controlled host. For open-redirect prevention, parse the URL and inspect the host against an allow-list.

Not allowing the protocol-relative form when you should. Patterns that force https?:\/\/ reject //cdn.example.com/file.js, which is legal in HTML and common in CDN configs. Decide whether to accept this case and adjust.

Storing the raw input instead of the parsed form. Two URLs that resolve to the same resource (HTTPS://Example.com/Path and https://example.com/Path) compare unequal as strings. Always normalise via the URL parser before storing or comparing.

Trusting the path part to be free of HTML. A URL like https://example.com/<script> is valid as a URL but unsafe to render unescaped. The regex validates the shape; HTML-escape on output regardless.

Test cases: matches and non-matches

InputPractical patternNotes
https://example.comMatchStandard
http://example.comMatchStandard
example.com/pathMatchProtocol-relative
example.comMatchJust domain
https://example.com/path?q=1&p=2#anchorMatchFull URL
https://sub.example.co.uk:8080/pathMatch (strict only)Port
htp://example.comNo matchWrong scheme
https://No matchDomain required
https://exampleNo matchNo TLD
javascript:alert(1)No matchNot a URL scheme we accept

FAQ

Use a URL parser (new URL() in JavaScript, urlparse in Python, filter_var in PHP) when correctness matters. For example, when deciding whether to redirect a user to the URL, or storing it in a database.

Use regex when speed matters more than handling every edge of the URL spec, or when you need to enforce something the parser doesn't (only HTTPS, only specific domains, no userinfo).

If your pattern uses [a-z]+:\/\/ without restricting the scheme, it will match any scheme. The practical pattern in this article uses https?:\/\/ which only allows http and https.

Other dangerous schemes to explicitly reject in user input: javascript:, data:, vbscript:, file:. Always inspect the scheme; never blindly redirect to a user-provided URL.

No. The pattern uses [\w-] for domain characters, which is ASCII letters/digits/underscore plus hyphen. Internationalised domains like 例え.テスト use Unicode and would be encoded as Punycode (xn--r8jz45g.xn--zckzah) for DNS purposes.

If you need to accept internationalised domains, use a URL parser instead. The parser normalises internationalised characters to Punycode for you.

Replace https? with https in the pattern: ^https:\/\/([\w-]+(\.[\w-]+)+).... The s is no longer optional, so http:// URLs fail to match.

This is the pattern to use when you want to enforce TLS on user-submitted links (webhooks, OAuth callbacks, payment-success URLs).

PHP's FILTER_VALIDATE_URL follows RFC 2396 (the older URL spec) and rejects URLs with internationalised domains or some Unicode characters even after percent-encoding. It also requires a scheme by default.

Be aware that FILTER_FLAG_PATH_REQUIRED tightens validation rather than relaxing it: it forces the URL to include a path component, so http://example.com fails while http://example.com/ passes. For a more permissive check, drop the flags entirely and fall back to a regex like the one in this article or a dedicated URL parser.

Capture the host portion in a group: ^https?:\/\/([^\/\s:?#]+). After matching, the host is in group 1. This handles ports correctly by stopping at the first :.

For anything more involved (extracting userinfo, ports, paths separately), use the URL parser. See the domain-matching guide for the standalone domain pattern.

Yes. The practical pattern includes (\?[^\s#]*)? for the optional query string and (#[^\s]*)? for the optional fragment. Both stop at whitespace; the query also stops at # so the fragment can take over.

What it does not do is validate the query-string structure (key=value pairs, percent-encoding). For that, parse the URL and use the parser's query iterator.

See also

External reference: try the pattern interactively at regex101.com and see every token explained. For URL parsing details, the WHATWG URL Standard is the modern spec implemented by browsers.

Sources

Authoritative references this article was fact-checked against.

TagsRegexURL ValidationRegular ExpressionsJavaScriptPythonPHPValidation

Found this useful? Pass it on.

Copied

Ishan Karunaratne

Software Systems Architect · Senior Software Engineer · Engineering Leadership

Software systems architect and senior software engineer with more than two decades designing, building, and running production software, Linux systems, and DevOps infrastructure, and lately working AI into the stack. Now a CTO, though what I write here is drawn from the full arc of that work, across architecture, engineering, and operations, not any single job.

Keep reading

Related posts

Match integers, decimals, signed, scientific, thousands-separated, currency, and percent numbers with regex. JavaScript / Python / PHP examples, engine notes, common mistakes, test table.

How to Match Numbers with Regex

Match integers, decimals, signed, scientific, thousands-separated, currency, and percent numbers with regex. JavaScript / Python / PHP examples, engine notes, common mistakes, test table.

Match a domain name with regex. Basic labels, RFC 1035 length rules, subdomains, IDN punycode, trailing-dot form, JavaScript / Python / PHP examples, engine notes, and common mistakes.

How to Match a Domain Name with Regex

Match a domain name with regex. Basic labels, RFC 1035 length rules, subdomains, IDN punycode, trailing-dot form, JavaScript / Python / PHP examples, engine notes, and common mistakes.

Match a hex color code with regex. 3-digit, 6-digit, and 8-digit (alpha) forms. Case-insensitive. JavaScript / Python / PHP examples, engine notes, common mistakes, test cases.

How to Match a Hex Color Code with Regex

Match a hex color code with regex. 3-digit, 6-digit, and 8-digit (alpha) forms. JavaScript / Python / PHP examples, engine notes, common mistakes, a stripped-hash variant.