web crawler - Scrape articles form wsj by requests, CURL and BeautifulSoup

Question

Welcome To Ask or Share your Answers For Others

web crawler - Scrape articles form wsj by requests, CURL and BeautifulSoup

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

web crawler - Scrape articles form wsj by requests, CURL and BeautifulSoup

I'm a paid member of wsj and I tried to scrape articles to do my NLP project. I thought I kept the session.

rs = requests.session()
login_url="https://sso.accounts.dowjones.com/login?client=5hssEAdMy0mJTICnJNvC9TXEw3Va7jfO&protocol=oauth2&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin&scope=openid%20idp_id&response_type=code&nonce=18091b1f-2c73-4a93-ab10-77b0d4d4f9d3&connection=DJldap&ui_locales=en-us-x-wsj-3&mg=prod%2Faccounts-wsj&state=NfljSw-Gz-TnT_I6kLjnTa2yxy8akTui#!/signin" 
payload={
    "username":"xxx@email",
    "password":"myPassword",
}
result = rs.post(
    login_url, 
    data = payload, 
    headers = dict(referer=login_url)
)

The article I want to parse.

r = rs.get('https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y')

Then I found the html is still the one for non-member

I also tried another method by using CURL to save the cookies after I login

curl -c cookies.txt -I "https://www.wsj.com"
curl -v cookies.txt "https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y" > test.html

The result is the same.

I'm not very familiar with the mechanism how the authencation work behind the browser. Can someone explains why both the methods above are failed and how should I fix it to get my goal. Thanks you very much.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:23:37+0000

Your attempts have failed because the protocol used is oauth2.0. This is not basic authentication.

What's happening here is :

some information are generated server side when login URL https://accounts.wsj.com/login is called : connection & client_id
when submitting username/password, the URL https://sso.accounts.dowjones.com/usernamepassword/login is called which needs some parameter (the previous connection & client_id + some static parameter for oauth2 : scope, response_type, redirect_uri
a response is received from the previous login call that gives a form which auto-submit. This form has 3 params wa, wresult and wctx (wresult is a JWT). This form performs the call to https://sso.accounts.dowjones.com/login/callback to retrieve an URL with a code param like code=AjKK8g0pZZfvYpju
The URL https://accounts.wsj.com/auth/sso/login?code=AjKK8g0pZZfvYpju is called which retrieve the cookies with a valid user session

The bash script which uses curl, grep, pup and jq :

username="user@gmail.com"
password="YourPassword"

login_url=$(curl -s -I "https://accounts.wsj.com/login")
connection=$(echo "$login_url" | grep -oP "Location:s+.*connection=K(w+)")
client_id=$(echo "$login_url" | grep -oP "Location:s+.*client_id=K(w+)")

#connection=$(echo "$login_url" | gawk 'match($0, /Location:s+.*connection=(w+)&/, data) {print data[1]}')
#client_id=$(echo "$login_url" | gawk 'match($0, /Location:s+.*client_id=(w+)&/, data) {print data[1]}')

rm -f cookies.txt

IFS='|' read -r wa wresult wctx < <(curl -s 'https://sso.accounts.dowjones.com/usernamepassword/login' 
      --data-urlencode "username=$username" 
      --data-urlencode "password=$password" 
      --data-urlencode "connection=$connection" 
      --data-urlencode "client_id=$client_id" 
      --data 'scope=openid+idp_id&tenant=sso&response_type=code&protocol=oauth2&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin' | pup 'input json{}' | jq -r 'map(.value) | join("|")')

# replace double quote ""
wctx=$(echo "$wctx" | sed 's/&#34;/"/g')

code_url=$(curl -D - -s -c cookies.txt 'https://sso.accounts.dowjones.com/login/callback' 
     --data-urlencode "wa=$wa" 
     --data-urlencode "wresult=$wresult" 
     --data-urlencode "wctx=$wctx" | grep -oP "Location:s+K(S*)")

curl -s -c cookies.txt "$code_url"

# here call your URL loading cookies.txt
curl -s -b cookies.txt "https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y"

Categories

web crawler - Scrape articles form wsj by requests, CURL and BeautifulSoup

web crawler - Scrape articles form wsj by requests, CURL and BeautifulSoup

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags