python - Unable to fetch the rest of the names leading to the next pages from a webpage using requests

Question

Welcome To Ask or Share your Answers For Others

python - Unable to fetch the rest of the names leading to the next pages from a webpage using requests

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Unable to fetch the rest of the names leading to the next pages from a webpage using requests

I've created a script to get different names from this website filtering State Province to Alabama and Country to United States in the search box. The script can parse the names from the first page. However, I can't figure out how I can get the results from next pages as well using requests.

There are two options in there to get all the names. Option one: using this show all 410 and option two: making use of next button.

I've tried with (capable of grabbing names from the first page):

import re
import requests
from bs4 import BeautifulSoup

URL = "https://cci-online.org/CCI/Verify/CCI/Credential_Verification.aspx"
params = {
    'errorpath': '/CCI/Verify/CCI/Credential_Verification.aspx'
}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
    r = s.get(URL)
    
    params['WebsiteKey'] = re.search(r"gWebsiteKey[^']+'(.*?)'",r.text).group(1)
    params['hkey'] = re.search(r"gHKey[^']+'(.*?)'",r.text).group(1)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']: i.get('value', '') for i in soup.select('input[name]')}
    payload['ctl01$TemplateBody$WebPartManager1$gwpciPeopleSearch$ciPeopleSearch$ResultsGrid$Sheet0$Input4$DropDown1'] = 'AL'
    payload['ctl01$TemplateBody$WebPartManager1$gwpciPeopleSearch$ciPeopleSearch$ResultsGrid$Sheet0$Input5$DropDown1'] = 'United States'
    
    r = s.post(URL,params=params,data=payload)
    soup = BeautifulSoup(r.text,"lxml")
    for item in soup.select("table.rgMasterTable > tbody > tr a[title]"):
        print(item.text)

In case someone comes up with any solution based on selenium, I've found success already with the same. However, I'm not willing to go that route:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "https://cci-online.org/CCI/Verify/CCI/Credential_Verification.aspx"

with webdriver.Chrome() as driver:
    driver.get(link)
    wait = WebDriverWait(driver,15)

    Select(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "select[id$='Input4_DropDown1']")))).select_by_value("AL")
    Select(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "select[id$='Input5_DropDown1']")))).select_by_value("United States")
    wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='SubmitButton']"))).click()
    wait.until(EC.visibility_of_element_located((By.XPATH, "//a[contains(.,'show all')]"))).click()
    wait.until(EC.invisibility_of_element_located((By.XPATH, "//span[@id='ctl01_LoadingLabel' and .='Loading']")))
    soup = BeautifulSoup(driver.page_source,"lxml")
    for item in soup.select("table.rgMasterTable > tbody > tr a[title]"):
        print(item.text)

How can I get the rest of the names from that webpage leading to the next pages using requests module?

question from:https://stackoverflow.com/questions/65642333/unable-to-fetch-the-rest-of-the-names-leading-to-the-next-pages-from-a-webpage-u

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T18:46:37+0000

First, click that link in chrome with the network panel open. Then look at the Form Data for the request:

Pay extra attention to __EVENTTARGET and __EVENTARGUMENT.

Next, inspect one of those next links, they will look like this:

<a onclick="return false;" title="Go to page 2" class="rgCurrentPage" href="javascript:__doPostBack('ctl01$TemplateBody$WebPartManager1$gwpciPeopleSearch$ciPeopleSearch$ResultsGrid$Grid1$ctl00$ctl02$ctl00$ctl07','')"><span>2</span></a>

The doPostBack arguments go in __EVENTTARGET and __EVENTARGUMENT and everything else should match what you see in network (headers as well as form data).

It will be helpful to proxy requests through Charles or Fiddler so you can compare the requests side by side.

Categories

python - Unable to fetch the rest of the names leading to the next pages from a webpage using requests

python - Unable to fetch the rest of the names leading to the next pages from a webpage using requests

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags