Edit
I've made a python library to scrape tableau dashboard. The implementation is more straightforward :
from tableauscraper import TableauScraper as TS
url = "https://public.tableau.com/views/Colorado_COVID19_Data/CO_Home"
ts = TS()
ts.loads(url)
dashboard = ts.getDashboard()
for t in dashboard.worksheets:
#show worksheet name
print(f"WORKSHEET NAME : {t.name}")
#show dataframe for this worksheet
print(t.data)
run this on repl.it
Old answer
The graphic seems to be generated in JS from the result of an API which looks like :
POST https://public.tableau.com/TITLE/bootstrapSession/sessions/SESSION_ID
The SESSION_ID parameter is located (among other things) in tsConfigContainer
textarea in the URL used to build the iframe.
Starting from https://covid19.colorado.gov/hospital-data :
- check element with class
tableauPlaceholder
- get the
param
element with attribute name
- it gives you the url :
https://public.tableau.com/views/{urlPath}
- the previous link gives you a textarea with id
tsConfigContainer
with a bunch of json values
- extract the
session_id
and root path (vizql_root
)
- make a POST on
https://public.tableau.com/ROOT_PATH/bootstrapSession/sessions/SESSION_ID
with the sheetId
as form data
- extract the json from the result (result is not json)
Code :
import requests
from bs4 import BeautifulSoup
import json
import re
r = requests.get("https://covid19.colorado.gov/hospital-data")
soup = BeautifulSoup(r.text, "html.parser")
# get the second tableau link
tableauContainer = soup.findAll("div", { "class": "tableauPlaceholder"})[1]
urlPath = tableauContainer.find("param", { "name": "name"})["value"]
r = requests.get(
f"https://public.tableau.com/views/{urlPath}",
params= {
":showVizHome":"no",
}
)
soup = BeautifulSoup(r.text, "html.parser")
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)
dataUrl = f'https://public.tableau.com{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'
r = requests.post(dataUrl, data= {
"sheet_id": tableauData["sheetId"],
})
dataReg = re.search('d+;({.*})d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))
print(data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"])
From there you have all the data. You will need to look for the way the data is splitted as it seems all the data is dumped through a single list. Probably looking at the other fields in the JSON object would be useful for that.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…