Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
533 views
in Technique[技术] by (71.8m points)

browser - Weird characters in URL

In my webserver when user requests URLs with weird characters, I remove these characters. And system logs these cases. When I check sanitized cases I found these. I'm curious that what would be the objective of these URLs ?

I check the IPs and these are real people and uses website as a normal person. But 1 time in their 20 URL requets of these people, URL has these weird characters at last.

http://example.com/@%EF%BF%BD%EF%BF%BD%02?o=3&g=&s=&z=%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%3E?, agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0,
http://example.com/%60E%EF%BF%BD%02?o=3&g=&s=&z=%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%3E?, agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0
http://example.com/%60E%EF%BF%BD%02?o=3&g=&s=&z=%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%3E?, agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0
http://example.com/p%EF%BF%BD%1D%01?o=3&g=&s=&z=%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%3E?, agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0
http://example.com/%EF%BF%BDC%EF%BF%BD%02?o=3&g=&s=&z=%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%3E?, agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0
http://example.com/%EF%BF%BDR%EF%BF%BD%02?o=3&g=&s=&z=%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD`%EF%BF%BD%EF%BF%BD%7F, agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36
http://example.com/%EF%BF%BDe%EF%BF%BDv8%01%EF%BF%BD?o=3&g=P%01%EF%BF%BD&s=&z=%EF%BF%BD%EF%BF%BD%15%01%EF%BF%BD%EF%BF%BD, agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36

http://en.wikipedia.org/wiki/Specials_(Unicode_block)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

They are essentially malformed URLs. They can be generated from a specific malware that is trying to exploit web site vulnerabilities, from malfunctioning browser plugin or extension, or from a bug in a JS file (i.e. tracking with Google Analytics) in combination with a specific browser version/operating system. In any case, you can't actually control what requests will come from a client and there's nothing you can do to stop that so, if your generated HTML/JS code is correct, you have done your work.

If you like to correct those URLs for any reason, you can enable URL rewriting and set a rule with a regular expression filter to transform those URLs to valid URLs. Anyway, I don't suggest do that: the web server should respond with a error 404 page not found message, because that is the standard (it's a client error, after all), and this is in my opinion a faster and safer method than applying URL rewriting. (rewriting procedure may contains bugs, so someone can try to exploit that, etc, etc)

For sake of curiosity, you can easily decode those URLs with an online URL decoder of your choice (i.e. this), but essentially you will discover what you already know: there are a lot of UTF-8 replacement characters in those URLs.

In fact, %EF%BF%BD is the url-encoded version of the hex representation of the 3 bytes (EF BF BD) of the UTF-8 replacement character. You can see that character also as ? or EF BF BD or FFFD or ? ? ?, and so on, depending of the representation method you choose.

Also, you can check by your own how the client handles that character. Go here:

http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?input=%EF%BF%BD&mode=char

press the GO button and, using your browser developer tools, check what really happens: the browser is actually encoding the unknown character with %EF%BF%BD before sending it to the web server.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...