Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
87 views
in Technique[技术] by (71.8m points)

python - Unable to fetch the rest of the names leading to the next pages from a webpage using requests

I've created a script to get different names from this website filtering State Province to Alabama and Country to United States in the search box. The script can parse the names from the first page. However, I can't figure out how I can get the results from next pages as well using requests.

There are two options in there to get all the names. Option one: using this show all 410 and option two: making use of next button.

I've tried with (capable of grabbing names from the first page):

import re
import requests
from bs4 import BeautifulSoup

URL = "https://cci-online.org/CCI/Verify/CCI/Credential_Verification.aspx"
params = {
    'errorpath': '/CCI/Verify/CCI/Credential_Verification.aspx'
}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
    r = s.get(URL)
    
    params['WebsiteKey'] = re.search(r"gWebsiteKey[^']+'(.*?)'",r.text).group(1)
    params['hkey'] = re.search(r"gHKey[^']+'(.*?)'",r.text).group(1)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']: i.get('value', '') for i in soup.select('input[name]')}
    payload['ctl01$TemplateBody$WebPartManager1$gwpciPeopleSearch$ciPeopleSearch$ResultsGrid$Sheet0$Input4$DropDown1'] = 'AL'
    payload['ctl01$TemplateBody$WebPartManager1$gwpciPeopleSearch$ciPeopleSearch$ResultsGrid$Sheet0$Input5$DropDown1'] = 'United States'
    
    r = s.post(URL,params=params,data=payload)
    soup = BeautifulSoup(r.text,"lxml")
    for item in soup.select("table.rgMasterTable > tbody > tr a[title]"):
        print(item.text)

In case someone comes up with any solution based on selenium, I've found success already with the same. However, I'm not willing to go that route:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "https://cci-online.org/CCI/Verify/CCI/Credential_Verification.aspx"

with webdriver.Chrome() as driver:
    driver.get(link)
    wait = WebDriverWait(driver,15)

    Select(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "select[id$='Input4_DropDown1']")))).select_by_value("AL")
    Select(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "select[id$='Input5_DropDown1']")))).select_by_value("United States")
    wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='SubmitButton']"))).click()
    wait.until(EC.visibility_of_element_located((By.XPATH, "//a[contains(.,'show all')]"))).click()
    wait.until(EC.invisibility_of_element_located((By.XPATH, "//span[@id='ctl01_LoadingLabel' and .='Loading']")))
    soup = BeautifulSoup(driver.page_source,"lxml")
    for item in soup.select("table.rgMasterTable > tbody > tr a[title]"):
        print(item.text)

How can I get the rest of the names from that webpage leading to the next pages using requests module?

question from:https://stackoverflow.com/questions/65642333/unable-to-fetch-the-rest-of-the-names-leading-to-the-next-pages-from-a-webpage-u

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

First, click that link in chrome with the network panel open. Then look at the Form Data for the request:

enter image description here

Pay extra attention to __EVENTTARGET and __EVENTARGUMENT.

Next, inspect one of those next links, they will look like this:

<a onclick="return false;" title="Go to page 2" class="rgCurrentPage" href="javascript:__doPostBack('ctl01$TemplateBody$WebPartManager1$gwpciPeopleSearch$ciPeopleSearch$ResultsGrid$Grid1$ctl00$ctl02$ctl00$ctl07','')"><span>2</span></a>

The doPostBack arguments go in __EVENTTARGET and __EVENTARGUMENT and everything else should match what you see in network (headers as well as form data).

It will be helpful to proxy requests through Charles or Fiddler so you can compare the requests side by side.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...