Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
168 views
in Technique[技术] by (71.8m points)

Python Beautiful Soup and Regex - Double quotes not getting replaced

I am trying to scrape this website using BeautifulSoup and Regex. While doing so, I encountered a question which was having "double quotes" and I wanted to replace the "double quotes" and save it as a .txt file. But it is not replacing the "double quotes". We tried .replace() method but I failed. The code is as follows:

url = 'http://www.sanfoundry.com/operating-system-mcqs-process-scheduling-queue/'
r = requests.get(url)
soup = bs(r.content)
data = soup.find_all('div', {'class':'entry-content'})
data1 = data[0].text
pattern = r'^d{1,2}[.|)]([s|S].*)|(^[a-z])s.*)|^View Answers?(Answer:.*)'
#pattern = r'^d{1,2}[.|)]s*(.*)|(^[a-z])s.*)|^View Answers?(Answer:.*)'
reg = re.compile(pattern)
#with open(r'C:UsersdhvaniGoogle DrivePythonData Scrapingyb.txt', 'a') as f:
with open(r'C:UsersJeri_DabbaGoogle DrivePythonData Scrapingyb.txt', 'a') as f:

    for i in data1.split('
'):
        if reg.search(i).group(1):
           y = reg.search(i).group(1)
           y = y.replace('"', '')
           f.write(y + "
")

When I checked the .txt file the "double quotes" was not replaced. What might be the problem?

I am new to python.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This website includes characters that aren't 'normal' double quote characters i.e. not " U+0022

The site includes right and left double quotation marks unicode U+201C and U+201D

You can replace these:

y = y.replace('"', '')
y = y.replace('“', '')
y = y.replace('”', '')

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...