html - How can I use the python HTMLParser library to extract data from a specific div tag?

Question

Welcome To Ask or Share your Answers For Others

html - How can I use the python HTMLParser library to extract data from a specific div tag?

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

html - How can I use the python HTMLParser library to extract data from a specific div tag?

I am trying to get a value out of a HTML page using the python HTMLParser library. The value I want to get hold of is within this html element:

...
<div id="remository">20</div>
...

This is my HTMLParser class so far:

class LinksParser(HTMLParser.HTMLParser):
  def __init__(self):
    HTMLParser.HTMLParser.__init__(self)
    self.seen = {}

  def handle_starttag(self, tag, attributes):
    if tag != 'div': return
    for name, value in attributes:
    if name == 'id' and value == 'remository':
      #print value
      return

  def handle_data(self, data):
    print data


p = LinksParser()
f = urllib.urlopen("http://domain.com/somepage.html")
html = f.read()
p.feed(html)
p.close()

Can someone point me in the right direction? I want the class functionality to get the value 20.

question from:https://stackoverflow.com/questions/3276040/how-can-i-use-the-python-htmlparser-library-to-extract-data-from-a-specific-div

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T17:08:51+0000

class LinksParser(HTMLParser.HTMLParser):
  def __init__(self):
    HTMLParser.HTMLParser.__init__(self)
    self.recording = 0
    self.data = []

  def handle_starttag(self, tag, attributes):
    if tag != 'div':
      return
    if self.recording:
      self.recording += 1
      return
    for name, value in attributes:
      if name == 'id' and value == 'remository':
        break
    else:
      return
    self.recording = 1

  def handle_endtag(self, tag):
    if tag == 'div' and self.recording:
      self.recording -= 1

  def handle_data(self, data):
    if self.recording:
      self.data.append(data)

self.recording counts the number of nested div tags starting from a "triggering" one. When we're in the sub-tree rooted in a triggering tag, we accumulate the data in self.data.

The data at the end of the parse are left in self.data (a list of strings, possibly empty if no triggering tag was met). Your code from outside the class can access the list directly from the instance at the end of the parse, or you can add appropriate accessor methods for the purpose, depending on what exactly is your goal.

The class could be easily made a bit more general by using, in lieu of the constant literal strings seen in the code above, 'div', 'id', and 'remository', instance attributes self.tag, self.attname and self.attvalue, set by __init__ from arguments passed to it -- I avoided that cheap generalization step in the code above to avoid obscuring the core points (keep track of a count of nested tags and accumulate data into a list when the recording state is active).

Categories

html - How can I use the python HTMLParser library to extract data from a specific div tag?

html - How can I use the python HTMLParser library to extract data from a specific div tag?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags