javascript - HTML tag appears empty when parsing it with BeautifulSoup but has content when opened in browser

Question

Welcome To Ask or Share your Answers For Others

javascript - HTML tag appears empty when parsing it with BeautifulSoup but has content when opened in browser

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

javascript - HTML tag appears empty when parsing it with BeautifulSoup but has content when opened in browser

I have an issue when parsing an html page through BS4. I have a hidden div in an html page of which I want to read the content using BeautifulSoup. The content of which is generated dynamically by a javascript function which is triggered via body onload.

The problem is: when I call the page in my browser, the tag has the content it is supposed to have. When I parse the same page via BS4, the tag is empty.

I could not find any information with regards to BS4 not being able to handle onload javascript-generated content, so not sure what the issue may be here.

Python script:

import urllib.request
from bs4 import BeautifulSoup

import time
import datetime
eT = time.time()

version = 1
vNum = str(version)

t = datetime.datetime.now()

d = "0" + str(t.day)
#d = d.rstrip()
d = d[-2:]
m = "0" + str(t.month)
#m = m.rstrip()
m = m[-2:]
y = str(t.year)

dStr = y + m + d

resultFile = 'output/classAndIdList-' + dStr + '-v' + vNum + '.txt'
pageListFile = 'input/quickListFR.txt'
f = open(pageListFile, mode='r', encoding='utf-8')

urlRoot = 'http://dev.example.com/'

fOut = open(resultFile, 'w')
ciList = []

# for url in urls.split('
'):
for l in f:
    u = l.rstrip()  
    url = urlRoot + u
    html_content = urllib.request.urlopen(url)
    time.sleep(1)
    html_text = html_content.read()
    soup = BeautifulSoup(html_text)
    ciTag = soup.find(id="testDivCSS")
    print(ciTag)
    ciString = ciTag.get_text()
    # print(ciString)
    ciArray = ciString.split(',')
    # print(ciArray)
    for c in ciArray:
        if c not in ciList:
            ciList.append(c)
            fOut.write(c + '
')
            print(c)
    print(u + '... DONE')       
fOut.close()

Example result page via BeautifulSoup:

Example-page-1.html... DONE
<div id="testDivCSS" style="display: none;"> </div>

And the div in the browser (indicating that the php and javascript parts work fine):

<div id="testDivCSS" style="display: none;">div#menu_rightup,div#social,div#sidebar,div#specific,div#menu_rightdown,div#footer</div>

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T20:07:18+0000

BeautifulSoup cannot handle dynamic generate contents by javascript. You may use browser automation tools (such as selenium) to help get the whole page (including dynamic part) first, then use BeautifulSoup to parse the page.

Refer to this question: How to retrieve the values of dynamic html content using Python

Categories

javascript - HTML tag appears empty when parsing it with BeautifulSoup but has content when opened in browser

javascript - HTML tag appears empty when parsing it with BeautifulSoup but has content when opened in browser

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags