TechEarl

How to Match a URL with Regex

Match a URL with regex. Covers http/https schemes, protocol-relative URLs, ports, paths, query strings, fragments, runnable JavaScript / Python / PHP, engine notes, and the URL parser alternative.

Ishan KarunaratneIshan Karunaratne⏱️ 9 min readUpdated
Match a URL with regex. http/https schemes, protocol-relative URLs, ports, paths, query strings, fragments. JavaScript / Python / PHP examples, engine notes, parser alternative, common mistakes, test table.

The practical regex for matching a URL: ^(https?:\/\/)?([\w-]+(\.[\w-]+)+)([\/\w \.-]*)*\/?(\?[^\s#]*)?(#[^\s]*)?$. It accepts http://, https://, or no protocol at all (protocol-relative URLs like example.com/path), a domain with at least one dot, an optional path, query string, and fragment. There is also a stricter form that requires the protocol and validates port numbers, and there is the modern alternative that skips regex entirely and uses the language's built-in URL parser. Below I walk all three, with runnable code in JavaScript, Python, and PHP, engine-specific notes, and the bugs I've seen most often.

The reason "match a URL" has so many regex variants is that the URL standard (RFC 3986) permits a lot of esoteric forms. The practical pattern matches the URLs that real users type and real APIs return; the strict pattern follows the spec more closely. Decide based on what the input is for.

Quick reference

The practical pattern, ready to paste:

code
^(https?:\/\/)?([\w-]+(\.[\w-]+)+)([\/\w \.-]*)*\/?(\?[^\s#]*)?(#[^\s]*)?$

Strict pattern (protocol required, optional port):

code
^(https?):\/\/([\w-]+(\.[\w-]+)+)(:[0-9]{1,5})?(\/[^\s?#]*)?(\?[^\s#]*)?(#\S*)?$

HTTPS only (the one I use in production for webhook URLs):

code
^https:\/\/([\w-]+(\.[\w-]+)+)(\/[^\s?#]*)?(\?[^\s#]*)?(#\S*)?$

The practical pattern

code
^(https?:\/\/)?([\w-]+(\.[\w-]+)+)([\/\w \.-]*)*\/?(\?[^\s#]*)?(#[^\s]*)?$

Left to right:

  • ^ and $ anchor to the full string.
  • (https?:\/\/)? is an optional protocol. https? matches both http and https. The whole group is optional.
  • ([\w-]+(\.[\w-]+)+) is the domain: one or more "word" characters (letters/digits/underscore) plus hyphens, repeated with dots between segments. Requires at least one dot.
  • ([\/\w \.-]*)*\/? is the optional path.
  • (\?[^\s#]*)? is the optional query string (everything from ? to a # or whitespace).
  • (#[^\s]*)? is the optional fragment (everything from # to whitespace).

This pattern accepts https://example.com, example.com/path, sub.example.com/path?query=1#section, and most things in between.

The strict pattern (with protocol and port)

If you want to require the protocol and explicitly handle ports:

code
^(https?):\/\/([\w-]+(\.[\w-]+)+)(:[0-9]{1,5})?(\/[^\s?#]*)?(\?[^\s#]*)?(#\S*)?$

The differences:

  • (https?) is required (no ? after the group).
  • (:[0-9]{1,5})? is an optional port between 1 and 99999.
  • The path uses [^\s?#]* so it stops at the first space, ?, or #.

Use this when the URL is coming from a trusted source and you want to reject obviously-broken inputs like htttps://example.com (note the triple t).

Examples in JavaScript, Python, and PHP

JavaScript:

javascript
const urlPattern = /^(https?:\/\/)?([\w-]+(\.[\w-]+)+)([\/\w \.-]*)*\/?(\?[^\s#]*)?(#[^\s]*)?$/;
function isValidUrl(input) {
  return urlPattern.test(input);
}
isValidUrl("https://example.com");           // true
isValidUrl("example.com/path?q=1");          // true
isValidUrl("ftp://example.com");             // false (not http/https)

Python:

python
import re
URL_RE = re.compile(
    r"^(https?:\/\/)?([\w-]+(\.[\w-]+)+)([\/\w \.-]*)*\/?(\?[^\s#]*)?(#[^\s]*)?$"
)

def is_valid_url(value: str) -> bool:
    return bool(URL_RE.match(value))

is_valid_url("https://example.com/path")    # True
is_valid_url("not a url")                   # False

PHP:

php
function isValidUrl(string $value): bool {
    $pattern = '/^(https?:\/\/)?([\w-]+(\.[\w-]+)+)([\/\w \.-]*)*\/?(\?[^\s#]*)?(#[^\s]*)?$/';
    return (bool) preg_match($pattern, $value);
}

isValidUrl("https://techearl.com/regex-match-url");  // true
isValidUrl("javascript:alert(1)");                   // false

For PHP specifically, the built-in alternative is filter_var($value, FILTER_VALIDATE_URL). It is stricter than the regex above and refuses URLs without a scheme.

When to skip regex and use a URL parser instead

For anything more than "is this URL-shaped", a regex is the wrong tool. Every modern language has a URL parser that handles edge cases the regex cannot: internationalised domain names (例え.テスト), userinfo (user:pass@host), IPv6 literals (http://[::1]/), percent-encoding, all of it.

JavaScript:

javascript
function isValidUrl(input) {
  try {
    new URL(input);
    return true;
  } catch {
    return false;
  }
}

Python:

python
from urllib.parse import urlparse

def is_valid_url(value: str) -> bool:
    try:
        result = urlparse(value)
        return all([result.scheme, result.netloc])
    except Exception:
        return False

PHP:

php
function isValidUrl(string $value): bool {
    return filter_var($value, FILTER_VALIDATE_URL) !== false;
}

The trade-off: parsers are more correct but slower than regex. For high-volume input validation (form fields on a busy site, log scanning), regex wins. For "is this safe to redirect to?", use the parser and inspect specific fields (scheme, host, port).

Engine compatibility

The practical and strict patterns use only universal features (anchors, character classes, quantifiers, alternation). They run unmodified everywhere. The per-engine notes are about the parser fallback you reach for when you need correctness over speed.

EngineParser equivalentPer-engine note
JavaScriptnew URL(input)Throws on invalid input; wrap in try/catch. Supports IDN and IPv6 literals out of the box.
Pythonurllib.parse.urlparseReturns a struct even for non-URL input; check scheme and netloc are non-empty.
PHP (PCRE)filter_var($v, FILTER_VALIDATE_URL)Follows RFC 2396 (the older spec). Rejects IDN without idn_to_ascii preprocessing.
Javajava.net.URI(s).toURL()URI parses, toURL() enforces a known scheme.
.NETUri.TryCreate(s, UriKind.Absolute, out _)The recommended cross-version approach.
Go (RE2)net/url.ParseReturns no error for partial URLs; check u.Scheme and u.Host explicitly. RE2 lacks lookahead so any pattern with (?=...) needs rewriting.
Rust (regex crate)url::Url::parse (url crate)No lookahead, no backreferences. Stick to the practical pattern.
RubyURI.parse(s)Raises on invalid; rescue URI::InvalidURIError.

For cross-language form validation where the same pattern runs on the frontend and the backend, keep to the practical form. Anything richer should defer to the language's URL parser.

Common mistakes

The bugs I see most often.

Allowing any scheme without thinking. A pattern like ^[a-z]+:\/\/ matches javascript:, data:, file:, and vbscript: too. Always restrict the scheme to the ones you actually want (https? for web URLs, or just https for security-sensitive contexts).

Forgetting the second anchor. ^https?:\/\/[\w-]+ accepts https://exampleEXTRA_GARBAGE_HERE because nothing pins the end. Anchor both sides for validation.

Treating regex-validated URLs as safe to redirect to. A URL can be "shaped right" and still point at an attacker-controlled host. For open-redirect prevention, parse the URL and inspect the host against an allow-list.

Not allowing the protocol-relative form when you should. Patterns that force https?:\/\/ reject //cdn.example.com/file.js, which is legal in HTML and common in CDN configs. Decide whether to accept this case and adjust.

Storing the raw input instead of the parsed form. Two URLs that resolve to the same resource (HTTPS://Example.com/Path and https://example.com/Path) compare unequal as strings. Always normalise via the URL parser before storing or comparing.

Trusting the path part to be free of HTML. A URL like https://example.com/<script> is valid as a URL but unsafe to render unescaped. The regex validates the shape; HTML-escape on output regardless.

Test cases: matches and non-matches

InputPractical patternNotes
https://example.comMatchStandard
http://example.comMatchStandard
example.com/pathMatchProtocol-relative
example.comMatchJust domain
https://example.com/path?q=1&p=2#anchorMatchFull URL
https://sub.example.co.uk:8080/pathMatch (strict only)Port
htp://example.comNo matchWrong scheme
https://No matchDomain required
https://exampleNo matchNo TLD
javascript:alert(1)No matchNot a URL scheme we accept

FAQ

See also

External reference: try the pattern interactively at regex101.com and see every token explained. For URL parsing details, the WHATWG URL Standard is the modern spec implemented by browsers.

TagsRegexURL ValidationRegular ExpressionsJavaScriptPythonPHPValidation
Share
Ishan Karunaratne

Ishan Karunaratne

Tech Architect · Software Engineer · AI/DevOps

Tech architect and software engineer with 20+ years across software, Linux systems, DevOps, and infrastructure — and a more recent focus on AI. Currently Chief Technology Officer at a tech startup in the healthcare space.

Keep reading

Related posts

Match integers, decimals, signed, scientific, thousands-separated, currency, and percent numbers with regex. JavaScript / Python / PHP examples, engine notes, common mistakes, test table.

How to Match Numbers with Regex

Match integers, decimals, signed, scientific, thousands-separated, currency, and percent numbers with regex. JavaScript / Python / PHP examples, engine notes, common mistakes, test table.

Match a domain name with regex. Basic labels, RFC 1035 length rules, subdomains, IDN punycode, trailing-dot form, JavaScript / Python / PHP examples, engine notes, and common mistakes.

How to Match a Domain Name with Regex

Match a domain name with regex. Basic labels, RFC 1035 length rules, subdomains, IDN punycode, trailing-dot form, JavaScript / Python / PHP examples, engine notes, and common mistakes.

Match a hex color code with regex. 3-digit, 6-digit, and 8-digit (alpha) forms. Case-insensitive. JavaScript / Python / PHP examples, engine notes, common mistakes, test cases.

How to Match a Hex Color Code with Regex

Match a hex color code with regex. 3-digit, 6-digit, and 8-digit (alpha) forms. JavaScript / Python / PHP examples, engine notes, common mistakes, a stripped-hash variant.