Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
257 views
in Technique[技术] by (71.8m points)

php - What is the RFC compliant and working regular expression to check if a string is a valid URL

There is question by the almost the same name already: What is the best regular expression to check if a string is a valid URL

I don't understand this stackoverflow. It seems like I need reputation to comment an answer. As I don't have it, I don't know how to tell/ask that the proposed solution doesn't seem to work. So I'm forced to make a new question and ask for the solution this way?

UPDATE: So it seems that that Reg Exp supports IPV6 and I was to blame as the IPv6 is supposed to go like http://[2620:0:1cfe:face:b00c::3]/.

So only real problem I know with that now is, that it accepts example.org: as valid URL.

Or is PHP to blame?

/**
  * Validate URL - RFC 3987 (IRI)
  *
  * https://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url
  *
  * @param string $str_url
  * @return boolean
  */
 function is_url($str_url)
 {
  // RFC 3987 For absolute IRIs (internationalized):
  return (bool) preg_match('/^[a-z](?:[-a-z0-9+.])*:(?://(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9._~x{A0}-x{D7FF}x{F900}-x{FDCF}x{FDF0}-x{FFEF}x{10000}-x{1FFFD}x{20000}-x{2FFFD}x{30000}-x{3FFFD}x{40000}-x{4FFFD}x{50000}-x{5FFFD}x{60000}-x{6FFFD}x{70000}-x{7FFFD}x{80000}-x{8FFFD}x{90000}-x{9FFFD}x{A0000}-x{AFFFD}x{B0000}-x{BFFFD}x{C0000}-x{CFFFD}x{D0000}-x{DFFFD}x{E1000}-x{EFFFD}!$&'()*+,;=:])*@)?(?:[(?:(?:(?:[0-9a-f]{1,4}:){6}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|::(?:[0-9a-f]{1,4}:){5}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){4}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:[0-9a-f]{1,4}:[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){3}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,2}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){2}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,3}[0-9a-f]{1,4})?::[0-9a-f]{1,4}:(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,4}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,5}[0-9a-f]{1,4})?::[0-9a-f]{1,4}|(?:(?:[0-9a-f]{1,4}:){0,6}[0-9a-f]{1,4})?::)|v[0-9a-f]+[-a-z0-9._~!$&'()*+,;=:]+)]|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}|(?:%[0-9a-f][0-9a-f]|[-a-z0-9._~x{A0}-x{D7FF}x{F900}-x{FDCF}x{FDF0}-x{FFEF}x{10000}-x{1FFFD}x{20000}-x{2FFFD}x{30000}-x{3FFFD}x{40000}-x{4FFFD}x{50000}-x{5FFFD}x{60000}-x{6FFFD}x{70000}-x{7FFFD}x{80000}-x{8FFFD}x{90000}-x{9FFFD}x{A0000}-x{AFFFD}x{B0000}-x{BFFFD}x{C0000}-x{CFFFD}x{D0000}-x{DFFFD}x{E1000}-x{EFFFD}!$&'()*+,;=@])*)(?::[0-9]*)?(?:/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9._~x{A0}-x{D7FF}x{F900}-x{FDCF}x{FDF0}-x{FFEF}x{10000}-x{1FFFD}x{20000}-x{2FFFD}x{30000}-x{3FFFD}x{40000}-x{4FFFD}x{50000}-x{5FFFD}x{60000}-x{6FFFD}x{70000}-x{7FFFD}x{80000}-x{8FFFD}x{90000}-x{9FFFD}x{A0000}-x{AFFFD}x{B0000}-x{BFFFD}x{C0000}-x{CFFFD}x{D0000}-x{DFFFD}x{E1000}-x{EFFFD}!$&'()*+,;=:@]))*)*|/(?:(?:(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9._~x{A0}-x{D7FF}x{F900}-x{FDCF}x{FDF0}-x{FFEF}x{10000}-x{1FFFD}x{20000}-x{2FFFD}x{30000}-x{3FFFD}x{40000}-x{4FFFD}x{50000}-x{5FFFD}x{60000}-x{6FFFD}x{70000}-x{7FFFD}x{80000}-x{8FFFD}x{90000}-x{9FFFD}x{A0000}-x{AFFFD}x{B0000}-x{BFFFD}x{C0000}-x{CFFFD}x{D0000}-x{DFFFD}x{E1000}-x{EFFFD}!$&'()*+,;=:@]))+)(?:/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9._~x{A0}-x{D7FF}x{F900}-x{FDCF}x{FDF0}-x{FFEF}x{10000}-x{1FFFD}x{20000}-x{2FFFD}x{30000}-x{3FFFD}x{40000}-x{4FFFD}x{50000}-x{5FFFD}x{60000}-x{6FFFD}x{70000}-x{7FFFD}x{80000}-x{8FFFD}x{90000}-x{9FFFD}x{A0000}-x{AFFFD}x{B0000}-x{BFFFD}x{C0000}-x{CFFFD}x{D0000}-x{DFFFD}x{E1000}-x{EFFFD}!$&'()*+,;=:@]))*)*)?|(?:(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9._~x{A0}-x{D7FF}x{F900}-x{FDCF}x{FDF0}-x{FFEF}x{10000}-x{1FFFD}x{20000}-x{2FFFD}x{30000}-x{3FFFD}x{40000}-x{4FFFD}x{50000}-x{5FFFD}x{60000}-x{6FFFD}x{70000}-x{7FFFD}x{80000}-x{8FFFD}x{90000}-x{9FFFD}x{A0000}-x{AFFFD}x{B0000}-x{BFFFD}x{C0000}-x{CFFFD}x{D0000}-x{DFFFD}x{E1000}-x{EFFFD}!$&'()*+,;=:@]))+)(?:/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9._~x{A0}-x{D7FF}x{F900}-x{FDCF}x{FDF0}-x{FFEF}x{10000}-x{1FFFD}x{20000}-x{2FFFD}x{30000}-x{3FFFD}x{40000}-x{4FFFD}x{50000}-x{5FFFD}x{60000}-x{6FFFD}x{70000}-x{7FFFD}x{80000}-x{8FFFD}x{90000}-x{9FFFD}x{A0000}-x{AFFFD}x{B0000}-x{BFFFD}x{C0000}-x{CFFFD}x{D0000}-x{DFFFD}x{E1000}-x{EFFFD}!$&'()*+,;=:@]))*)*|(?!(?:%[0-9a-f][0-9a-f]|[-a-z0-9._~x{A0}-x{D7FF}x{F900}-x{FDCF}x{FDF0}-x{FFEF}x{10000}-x{1FFFD}x{20000}-x{2FFFD}x{30000}-x{3FFFD}x{40000}-x{4FFFD}x{50000}-x{5FFFD}x{60000}-x{6FFFD}x{70000}-x{7FFFD}x{80000}-x{8FFFD}x{90000}-x{9FFFD}x{A0000}-x{AFFFD}x{B0000}-x{BFFFD}x{C0000}-x{CFFFD}x{D0000}-x{DFFFD}x{E1000}-x{EFFFD}!$&'()*+,;=:@])))(?:?(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9._~x{A0}-x{D7FF}x{F900}-x{FDCF}x{FDF0}-x{FFEF}x{10000}-x{1FFFD}x{20000}-x{2FFFD}x{30000}-x{3FFFD}x{40000}-x{4FFFD}x{50000}-x{5FFFD}x{60000}-x{6FFFD}x{70000}-x{7FFFD}x{80000}-x{8FFFD}x{90000}-x{9FFFD}x{A0000}-x{AFFFD}x{B0000}-x{BFFFD}x{C0000}-x{CFFFD}x{D0000}-x{DFFFD}x{E1000}-x{EFFFD}!$&'()*+,;=:@])|[x{E000}-x{F8FF}x{F0000}-x{FFFFD}|x{100000}-x{10FFFD}/?])*)?(?:#(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9._~x{A0}-x{D7FF}x{F900}-x{FDCF}x{FDF0}-x{FFEF}x{10000}-x{1FFFD}x{20000}-x{2FFFD}x{30000}-x{3FFFD}x{40000}-x{4FFFD}x{50000}-x{5FFFD}x{60000}-x{6FFFD}x{70000}-x{7FFFD}x{80000}-x{8FFFD}x{90000}-x{9FFFD}x{A0000}-x{AFFFD}x{B0000}-x{BFFFD}x{C0000}-x{CFFFD}x{D0000}-x{DFFFD}x{E1000}-x{EFFFD}!$&'()*+,;=:@])|[/?])*)?$/iu',$str_url);
 }

Here is the test for it:

$urls=array('http://www.example.org/','http://www.example.org:80/','example.org','ftp://user:pass@example.org/','http://example.org/?cat=5&test=joo','http://www.fi/?cat=5&test=joo','http://[::1]/','http://[2620:0:1cfe:face:b00c::3]/','http://[2620:0:1cfe:face:b00c::3]:80/','');
foreach ($urls as $a)
{
    echo $a."
";
    $a=is_url($a);
    var_dump($a);
}

And that outputs:

"http://www.example.org/" bool(true)
"http://www.example.org:80/" bool(true)
"example.org" bool(false)
"ftp://user:pass@example.org/" bool(true)
"http://example.org/?cat=5&test=joo" bool(true)
"http://www.fi/?cat=5&test=joo" bool(true) 
"http://[::1]/" bool(true)
"http://[2620:0:1cfe:face:b00c::3]/" bool(true)
"http://[2620:0:1cfe:face:b00c::3]:80/" bool(true)
"" bool(false)

So what is the RFC compilicant and working regexp?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Well, if you look at it, the specification is broken down into "chunks". That's how I'd suggest building the regex so that it's easier to read, more maintainable and understandable. So, the parts of the regex are (Optional are italicized):

  1. Scheme
  2. Username/Password
  3. Domain Or IP Address
  4. Port
  5. Path
  6. Query
  7. Anchor

So, we need to build a regex sub-part for each.

  1. Scheme:

    $scheme = "[a-z][a-z0-9+.-]*";
    
  2. Username/Password:

    $username = "([^:@/](:[^:@/])?@)?";
    
  3. Domain or IP Address:

    Now, we need to build up the 3 possible hosts:

    1. Domain Name
    2. IPv4
    3. IPv6

    Domain Name:

    $segment = "([a-z][a-z0-9-]*?[a-z0-9])";
    $domain = "({$segment}.)*{$segment}";
    

    IPv4:

    $segment = "([0|1][0-9]{2}|2([0-4][0-9]|5[0-5]))";
    $ipv4 = "({$segment}.{$segment}.{$segment}.{$segment})";
    

    IPv6:

    $block = "([a-f0-9]{0,4})";
    $rawIpv6 = "({$block}:){2,8}";
    $ipv4sub = "(::ffff:{$ipv4})";
    $ipv6 = "([({$rawIpv6}|{$ipv4sub})])";
    

    Finally:

    $host = "($domain|$ipv4|$ipv6)";
    
  4. Port:

    $port = "(:[d]{1,5})?";
    
  5. Path:

    $path = "([^?;#]*)?";
    
  6. Query:

    $query = "(?[^#;]*)?";
    
  7. Anchor:

    $anchor = "(#.*)?";
    

And the final regex:

$regex = "#^{$scheme}://{$username}{$host}{$port}(/{$path}{$query}{$anchor}|)$#i";

Note that the / is in the regex, and not the path part since path can be empty.

Also note that I have not tested this. It should work, but definitely it needs confirming that each part is correct (as for what to expect in the url).

Also also note that this is only one way of doing it. You could use other tools that don't need regexp or a library or framework that'll be easier to maintain in the long run.

Best of luck


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...