Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
971 views
in Technique[技术] by (71.8m points)

web crawler - Scrape articles form wsj by requests, CURL and BeautifulSoup

I'm a paid member of wsj and I tried to scrape articles to do my NLP project. I thought I kept the session.

rs = requests.session()
login_url="https://sso.accounts.dowjones.com/login?client=5hssEAdMy0mJTICnJNvC9TXEw3Va7jfO&protocol=oauth2&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin&scope=openid%20idp_id&response_type=code&nonce=18091b1f-2c73-4a93-ab10-77b0d4d4f9d3&connection=DJldap&ui_locales=en-us-x-wsj-3&mg=prod%2Faccounts-wsj&state=NfljSw-Gz-TnT_I6kLjnTa2yxy8akTui#!/signin" 
payload={
    "username":"xxx@email",
    "password":"myPassword",
}
result = rs.post(
    login_url, 
    data = payload, 
    headers = dict(referer=login_url)
)

The article I want to parse.

r = rs.get('https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y')

Then I found the html is still the one for non-member

I also tried another method by using CURL to save the cookies after I login

curl -c cookies.txt -I "https://www.wsj.com"
curl -v cookies.txt "https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y" > test.html

The result is the same.

I'm not very familiar with the mechanism how the authencation work behind the browser. Can someone explains why both the methods above are failed and how should I fix it to get my goal. Thanks you very much.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Your attempts have failed because the protocol used is oauth2.0. This is not basic authentication.

What's happening here is :

  • some information are generated server side when login URL https://accounts.wsj.com/login is called : connection & client_id
  • when submitting username/password, the URL https://sso.accounts.dowjones.com/usernamepassword/login is called which needs some parameter (the previous connection & client_id + some static parameter for oauth2 : scope, response_type, redirect_uri
  • a response is received from the previous login call that gives a form which auto-submit. This form has 3 params wa, wresult and wctx (wresult is a JWT). This form performs the call to https://sso.accounts.dowjones.com/login/callback to retrieve an URL with a code param like code=AjKK8g0pZZfvYpju
  • The URL https://accounts.wsj.com/auth/sso/login?code=AjKK8g0pZZfvYpju is called which retrieve the cookies with a valid user session

The bash script which uses curl, grep, pup and jq :

username="user@gmail.com"
password="YourPassword"

login_url=$(curl -s -I "https://accounts.wsj.com/login")
connection=$(echo "$login_url" | grep -oP "Location:s+.*connection=K(w+)")
client_id=$(echo "$login_url" | grep -oP "Location:s+.*client_id=K(w+)")

#connection=$(echo "$login_url" | gawk 'match($0, /Location:s+.*connection=(w+)&/, data) {print data[1]}')
#client_id=$(echo "$login_url" | gawk 'match($0, /Location:s+.*client_id=(w+)&/, data) {print data[1]}')

rm -f cookies.txt

IFS='|' read -r wa wresult wctx < <(curl -s 'https://sso.accounts.dowjones.com/usernamepassword/login' 
      --data-urlencode "username=$username" 
      --data-urlencode "password=$password" 
      --data-urlencode "connection=$connection" 
      --data-urlencode "client_id=$client_id" 
      --data 'scope=openid+idp_id&tenant=sso&response_type=code&protocol=oauth2&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin' | pup 'input json{}' | jq -r 'map(.value) | join("|")')

# replace double quote ""
wctx=$(echo "$wctx" | sed 's/&#34;/"/g')

code_url=$(curl -D - -s -c cookies.txt 'https://sso.accounts.dowjones.com/login/callback' 
     --data-urlencode "wa=$wa" 
     --data-urlencode "wresult=$wresult" 
     --data-urlencode "wctx=$wctx" | grep -oP "Location:s+K(S*)")

curl -s -c cookies.txt "$code_url"

# here call your URL loading cookies.txt
curl -s -b cookies.txt "https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y"

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...