Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
602 views
in Technique[技术] by (71.8m points)

python - Why does bs4 return tags and then an empty list to this find_all() method?

Looking at US Census QFD I'm trying to grab the race % by county. The loop I'm building is outside the scope of my question, which concerns this code:

url = 'http://quickfacts.census.gov/qfd/states/48/48507.html'
#last county in TX; for some reason the qfd #'s counties w/ only odd numbers
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)

c_black_alone = soup.find_all("td", attrs={'headers':'rp9'})[0] #c = county %
s_black_alone = soup.find_all("td", attrs={'headers':'rp9'})[1] #s = state %

Which grabs the html element including its tags, not just the text within it:

c_black_alone, s_black_alone

(<td align="right" headers="rp9 p1" valign="bottom">96.9%<sup></sup></td>,
 <td align="right" headers="rp9 p2" valign="bottom">80.3%<sup></sup></td>)

Above ^, I only want the %'s inside the elements...

Furthermore, why does

test_black = soup.find_all("td", text = "Black")

not return the same element as above (or its text), but instead returns an empty bs4 ResultSet object? (Edit: I have been following along with the documentation, so I hope this question doesn't seem too vague...)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

To get the text from those matches, use .text to get all contained text:

>>> soup.find_all("td", attrs={'headers':'rp9'})[0].text
u'96.9%'
>>> soup.find_all("td", attrs={'headers':'rp9'})[1].text
u'80.3%'

Your text search doesn't match anything for two reasons:

  1. A literal string only matches the whole contained text, not a partial match. It'll only work for element with <td>Black</td> as the sole contents.
  2. It will use the .string property, but that property is only set if the text is the only child of a given element. If there are other elements present, the search will fail entirely.

The way around this is by using a lambda instead; it'll be passed the whole element and you can validate each element:

soup.find_all(lambda e: e.name == 'td' and 'Black' in e.text)

Demo:

>>> soup.find_all(lambda e: e.name == 'td' and 'Black' in e.text)
[<td id="rp10" valign="top">Black or African American alone, percent, 2013 (a)  <!-- RHI225213 --> </td>, <td id="re6" valign="top">Black-owned firms, percent, 2007  <!-- SBO315207 --> </td>]

Both of these matches have a comment in the <td> element, making a search with a text match ineffective.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...