python 3.x - Change proxy in chromedriver for scraping purposes

Question

Welcome To Ask or Share your Answers For Others

python 3.x - Change proxy in chromedriver for scraping purposes

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python 3.x - Change proxy in chromedriver for scraping purposes

I'm scraping Bet365, probably one of the most tricky websites I've encountered, with selenium and Chrome. The issue with this page is that, even though my scraper takes sleeps so in no way it runs faster of what a human could, at some point, sometimes, it blocks my ip from a random amount of time (between half and 2 hours).

So, I'm looking into proxies to change my IP and resume my scraping. And here is where I'm kind of stuck trying to decide how to approach this

I've used 2 different free ip providers as follows

https://gimmeproxy.com

I wasn't able to make this one work, I'm emailing their support, but what I have, which should work is as follows

import requests

api="MY_API_KEY"  #with the free plan I can ask 240 times a day for an IP
adder="&post=true&supportsHttps=true&maxCheckPeriod=3600"

url="https://gimmeproxy.com/api/getProxy?"
r=requests.get(url=url,params=adder)

THIS IS EDITED
apik="api_key={}".format(api)
r=requests.get(url=url,params=apik+adder)

aaand I get no answer. 404 error not found. NOW WORKS, MY BAD

My second approach is through this other site sslproxy

With this one, you scrape the page, and you get a list of 100 IPs, theoretically checked and working. So, I've set up a loop in which I try a random IP from that list, and if it doesn't work deletes it from the list and tries again. This approach works hen trying to open Bet365.

for n in range(1, 100):
  proxy_index=random.randint(0, len(proxies) - 1)
  proxi=proxies[proxy_index]

  PROXY=proxi['ip']+':'+proxi['port']
  chrome_options = webdriver.ChromeOptions()
  chrome_options.add_argument('--proxy-server={}'.format(PROXY))

  url="https://www.bet365.es"

  try:    
     browser=webdriver.Chrome(path,options=chrome_options)
     browser.get(url)
     WebDriverWait(browser,10)..... #no need to post the whole condition
     break

  except:
     del proxies[proxy_index]
     browser.quit()

Well, with this one I succeded on trying to open Bet365, and I'm still checking, but I think this webdriver is going to be much slower than my original one, with no proxy.

So, my question is, is it expected that using proxy the scraping is going to be much slower, or does it depend on the proxy used? If so, does anyone recommed a different (or better, surely) approach?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:26:46+0000

I don't see any significant issue either in your approach or your code block. However, another approach would be to make use of all the proxies marked with in the Last Checked column which gets updated within the Free Proxy List.

As a solution you can write a script to grab all the proxies available and create a List dynamically every time you initialize your program. The following program will invoke a proxy from the Proxy List one by one until a successful proxied connection is established and verified through the Page Title of https://www.bet365.es to contain the text bet365. An exception may arise because the free proxy which your program grabbed was overloaded with users trying to get their proxy traffic through.

Code Block:

driver.get("https://sslproxies.org/")
driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='table table-striped table-bordered dataTable']//th[contains(., 'IP Address')]"))))
ips = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@class='table table-striped table-bordered dataTable']//tbody//tr[@role='row']/td[position() = 1]")))]
ports = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@class='table table-striped table-bordered dataTable']//tbody//tr[@role='row']/td[position() = 2]")))]
driver.quit()
proxies = []
for i in range(0, len(ips)):
    proxies.append(ips[i]+':'+ports[i])
print(proxies)
for i in range(0, len(proxies)):
    try:
        print("Proxy selected: {}".format(proxies[i]))
        options = webdriver.ChromeOptions()
        options.add_argument('--proxy-server={}'.format(proxies[i]))
        driver = webdriver.Chrome(options=options, executable_path=r'C:WebDriverschromedriver.exe')
        driver.get("https://www.bet365.es")
        if "Proxy Type" in WebDriverWait(driver, 20).until(EC.title_contains("bet365")):
            # Do your scrapping here
            break
    except Exception:
        driver.quit()
print("Proxy was Invoked")

Console Output:

['190.7.158.58:39871', '175.139.179.65:54980', '186.225.45.146:45672', '185.41.99.100:41258', '43.230.157.153:52986', '182.23.32.66:30898', '36.37.160.253:31450', '93.170.15.214:56305', '36.67.223.67:43628', '78.26.172.44:52490', '36.83.135.183:3128', '34.74.180.144:3128', '206.189.122.177:3128', '103.194.192.42:55546', '70.102.86.204:8080', '117.254.216.97:23500', '171.100.221.137:8080', '125.166.176.153:8080', '185.146.112.24:8080', '35.237.104.97:3128']

Proxy selected: 190.7.158.58:39871

Proxy selected: 175.139.179.65:54980

Proxy selected: 186.225.45.146:45672

Proxy selected: 185.41.99.100:41258

Categories

python 3.x - Change proxy in chromedriver for scraping purposes

python 3.x - Change proxy in chromedriver for scraping purposes

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags