Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
347 views
in Technique[技术] by (71.8m points)

web scraping - BeautifulSoup and List indexing

As I am still quite new to web scraping I am currently practicing some basics such as this one. I have scraped the categories from 'th' tag and the players from the 'tr' tag and appended it to a couple empty lists. The categories come out fine from get_text(), but when I try printing the players it has a number rank before the first letter of the name, and the player's team abbreviation letters after the last name. 3 things I am trying to do:

1)output only the first and last name of each player by doing some slicing from the list but I cannot figure out any easier way to do it. There is probably a quicker way inside the tags where I can call the class or using soup.findAll again in the html, or something else I am unware of, but I currently do not know how or what I am missing.

2)take the number ranks before the name and append it to an empty list.

3)take the 3 last abbreviated letters and append it to an empty list

Any suggestions would be much appreciated!

from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd
from time import sleep

players = []
categories = []

url ='https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgPoints/dir/desc'
source = requests.get(url)
soup = bs4(source.text, 'lxml')

for i in soup.findAll('th'):
    c = i.get_text()
    categories.append(c)

for i in soup.findAll('tr'):
    player = i.get_text()
    players.append(player)

players = players[1:51]

print(categories)
print(players)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Apis are always the best way to go in my opinion.

However, this can also be done with pandas .read_html() (it uses beautifulsoup under the hood to parse the table).

import pandas as pd

url = 'https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgPoints/dir/desc'

dfs = pd.read_html(url)
dfs[0][['Name','Team']] = dfs[0]['Name'].str.extract('^(.*?)([A-Z]+)$', expand=True)
df = dfs[0].join(dfs[1])

Output:

print (df[['RK','Name','Team','POS']])

    RK                     Name  Team POS
0    1             James Harden   HOU  SG
1    2            Stephen Curry    GS  PG
2    3             Bradley Beal   WSH  SG
3    4               Trae Young   ATL  PG
4    5             Kevin Durant   BKN  SF
5    6              CJ McCollum   POR  SG
6    7             Kyrie Irving   BKN  PG
7    8             Jaylen Brown   BOS  SG
8    9    Giannis Antetokounmpo   MIL  PF
9   10             Jayson Tatum   BOS  PF
10  11           Damian Lillard   POR  PG
11  12              Luka Doncic   DAL  PG
12  13            Collin Sexton   CLE  PG
13  14              Paul George   LAC  SG
14  15           Brandon Ingram    NO  SF
15  16             Nikola Jokic   DEN   C
16  17             LeBron James   LAL  SF
17  18              Zach LaVine   CHI  SG
18  19           Christian Wood   HOU  PF
19  20            Kawhi Leonard   LAC  SF
20  21              Joel Embiid   PHI   C
21  22             Jerami Grant   DET  PF
22  23            Anthony Davis   LAL  PF
23  24             Jamal Murray   DEN  PG
24  25            Julius Randle    NY  PF
25  26          Malcolm Brogdon   IND  PG
26  27            Fred VanVleet   TOR  SG
27  28           Nikola Vucevic   ORL   C
28  28         Donovan Mitchell  UTAH  SG
29  30             Terry Rozier   CHA  PG
30  31             Devin Booker   PHX  SG
31  32          Khris Middleton   MIL  SF
32  33            Terrence Ross   ORL  SG
33  33           Victor Oladipo   IND  SG
34  35        Russell Westbrook   WSH  PG
35  36         Domantas Sabonis   IND  PF
36  36             De'Aaron Fox   SAC  PG
37  38          Zion Williamson    NO  SF
38  39            Tobias Harris   PHI  SF
39  40              Bam Adebayo   MIA   C
40  41            DeMar DeRozan    SA  SG
41  41         D'Angelo Russell   MIN  SG
42  43           Gordon Hayward   CHA  SF
43  44               Kyle Lowry   TOR  PG
44  44  Shai Gilgeous-Alexander   OKC  SG
45  46              Mike Conley  UTAH  PG
46  47            Malik Beasley   MIN  SG
47  48               RJ Barrett    NY  SG
48  49            Thomas Bryant   WSH   C
49  50            Pascal Siakam   TOR  PF

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...