Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
270 views
in Technique[技术] by (71.8m points)

label empty or too long - python urllib2

I am having a strange situation:

i am curling urls like this:

def check_urlstatus(url):
  h = httplib2.Http()
  try:
      resp = h.request("http://" + url, 'HEAD')        
      if int(resp[0]['status']) < 400:
          return 'ok'
      else:
          return 'bad'
  except httplib2.ServerNotFoundError:
      return 'bad'

if I try to test this with:

if check_urlstatus('.f.de') == "bad": #<--- error happening here
   #..
   #..

it is saying:

UnicodeError: label empty or too long

what is the problem i am causing here?

EDIT: here is the traceback with idna. I guess, it tries to split the input by . and in this case, first label is empty which is the pace before the first ..

enter image description here

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The problem is your URL cannot properly be encoded as per the IDNA rules, which govern how internationalized domain names are converted:

The conversions between ASCII and non-ASCII forms of a domain name are accomplished by algorithms called ToASCII and ToUnicode. These algorithms are not applied to the domain name as a whole, but rather to individual labels. For example, if the domain name is www.example.com, then the labels are www, example, and com. ToASCII or ToUnicode are applied to each of these three separately.

The details of these two algorithms are complex, and are specified in RFC 3490. The following gives an overview of their function.

ToASCII leaves unchanged any ASCII label, but will fail if the label is unsuitable for the Domain Name System. If given a label containing at least one non-ASCII character, ToASCII will apply the Nameprep algorithm, which converts the label to lowercase and performs other normalization, and will then translate the result to ASCII using Punycode[16] before prepending the four-character string "xn--".[17] This four-character string is called the ASCII Compatible Encoding (ACE) prefix, and is used to distinguish Punycode encoded labels from ordinary ASCII labels. The ToASCII algorithm can fail in several ways; for example, the final string could exceed the 63-character limit of a DNS name. A label for which ToASCII fails cannot be used in an internationalized domain name.

In your case a '' (blank) is not a valid domain name character, and you end up with this:

>>> '.f.de'.encode('idna')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/encodings/idna.py", line 164, in encode
    result.append(ToASCII(label))
  File "/usr/lib/python2.6/encodings/idna.py", line 73, in ToASCII
    raise UnicodeError("label empty or too long")
UnicodeError: label empty or too long

If you change the domain name to 'a.f.de' it should not raise this exception.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...