Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
658 views
in Technique[技术] by (71.8m points)

python - Beautifulsoup sibling structure with br tags

I'm trying to parse a HTML document using the BeautifulSoup Python library, but the structure is getting distorted by <br> tags. Let me just give you an example.

Input HTML:

<div>
  some text <br>
  <span> some more text </span> <br>
  <span> and more text </span>
</div>

HTML that BeautifulSoup interprets:

<div>
  some text
  <br>
    <span> some more text </span>
    <br>
      <span> and more text </span>
    </br>
  </br>
</div>

In the source, the spans could be considered siblings. After parsing (using the default parser), the spans are suddenly no longer siblings, as the br tags became part of the structure.

The solution I can think of to solve this is to strip the <br> tags altogether, before pouring the html into Beautifulsoup, but that doesn't seem very elegant, as it requires me to change the input. What's a better way to solve this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Your best bet is to extract() the line breaks. It's easier than you think :).

>>> from bs4 import BeautifulSoup as BS
>>> html = """<div>
...   some text <br>
...   <span> some more text </span> <br>
...   <span> and more text </span>
... </div>"""
>>> soup = BS(html)
>>> for linebreak in soup.find_all('br'):
...     linebreak.extract()
... 
<br/>
<br/>
>>> print soup.prettify()
<html>
 <body>
  <div>
   some text
   <span>
    some more text
   </span>
   <span>
    and more text
   </span>
  </div>
 </body>
</html>

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...