Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
475 views
in Technique[技术] by (71.8m points)

javascript - HTML tag appears empty when parsing it with BeautifulSoup but has content when opened in browser

I have an issue when parsing an html page through BS4. I have a hidden div in an html page of which I want to read the content using BeautifulSoup. The content of which is generated dynamically by a javascript function which is triggered via body onload.

The problem is: when I call the page in my browser, the tag has the content it is supposed to have. When I parse the same page via BS4, the tag is empty.

I could not find any information with regards to BS4 not being able to handle onload javascript-generated content, so not sure what the issue may be here.

Python script:

import urllib.request
from bs4 import BeautifulSoup

import time
import datetime
eT = time.time()

version = 1
vNum = str(version)

t = datetime.datetime.now()

d = "0" + str(t.day)
#d = d.rstrip()
d = d[-2:]
m = "0" + str(t.month)
#m = m.rstrip()
m = m[-2:]
y = str(t.year)

dStr = y + m + d

resultFile = 'output/classAndIdList-' + dStr + '-v' + vNum + '.txt'
pageListFile = 'input/quickListFR.txt'
f = open(pageListFile, mode='r', encoding='utf-8')

urlRoot = 'http://dev.example.com/'

fOut = open(resultFile, 'w')
ciList = []

# for url in urls.split('
'):
for l in f:
    u = l.rstrip()  
    url = urlRoot + u
    html_content = urllib.request.urlopen(url)
    time.sleep(1)
    html_text = html_content.read()
    soup = BeautifulSoup(html_text)
    ciTag = soup.find(id="testDivCSS")
    print(ciTag)
    ciString = ciTag.get_text()
    # print(ciString)
    ciArray = ciString.split(',')
    # print(ciArray)
    for c in ciArray:
        if c not in ciList:
            ciList.append(c)
            fOut.write(c + '
')
            print(c)
    print(u + '... DONE')       
fOut.close()

Example result page via BeautifulSoup:

Example-page-1.html... DONE
<div id="testDivCSS" style="display: none;"> </div>

And the div in the browser (indicating that the php and javascript parts work fine):

<div id="testDivCSS" style="display: none;">div#menu_rightup,div#social,div#sidebar,div#specific,div#menu_rightdown,div#footer</div>
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

BeautifulSoup cannot handle dynamic generate contents by javascript. You may use browser automation tools (such as selenium) to help get the whole page (including dynamic part) first, then use BeautifulSoup to parse the page.

Refer to this question: How to retrieve the values of dynamic html content using Python


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...