Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
229 views
in Technique[技术] by (71.8m points)

textmatching - How to match URIs in text?

How would one go about spotting URIs in a block of text?

The idea is to turn such runs of texts into links. This is pretty simple to do if one only considered the http(s) and ftp(s) schemes; however, I am guessing the general problem (considering tel, mailto and other URI schemes) is much more complicated (if it is even possible).

I would prefer a solution in C# if possible. Thank you.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Regexs may prove a good starting point for this, though URIs and URLs are notoriously difficult to match with a single pattern.

To illustrate, the simplest of patterns looks fairly complicated (in Perl 5 notation):

w+:/{2}[dw-]+(.[dw-]+)*(?:(?:/[^s/]*))*

This would match http://example.com/foo/bar-baz

and ftp://192.168.0.1/foo/file.txt

but would cause problems for at least these:

  • mailto:support@stackoverflow.com (no match - no //, but present @)
  • ftp://192.168.0.1.2 (match, but too many numbers, so it's not a valid URI)
  • ftp://1000.120.0.1 (match, but the IP address needs numbers between 0 and 255, so it's not a valid URI)
  • nonexistantscheme://obvious.false.positive
  • http://www.google.com/search?q=uri+regular+expression (match, but query isn't I think this is a case of the 80:20 rule. If you want to catch most things, then I would do as suggested an find a decent regular expression if you can't write one yourself.

If you're looking at text pulled from fairly controlled sources (e.g. machine generated), then this will the best course of action.

If you absolutely positively have to catch every URI that you encounter, and you're looking at text from the wild, then I think I would look for any word with a colon in it e.g. s(w:S+)s. Once you have a suitable candidate for a URI, then pass it to the a real URI parser in the URI class of whatever library you're using.

If you're interested in why it's so hard to write a URI pattern, the I guess it would be that the definition of a URI is done with a Type-2 grammar, while regular expressions can only parse languages from Type-3 grammars.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...