Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
70 views
in Technique[技术] by (71.8m points)

python - Structuring a table using Scrapy Data

I have a website that contains tables (trs and tds). I want to create a structured CSV file from the table data. I'm trying to create field names from the scraped table as those field names can change depending upon the month or selections.

While I have been successful at iterating through the table and actually scraping the data I want to use as my field names I have yet to figure out how to yield that data into the CSV file.

Right now I have them scraped into an Item named "h1header" and when yielded to a CSV file they appear as rows under that item key "h1header" so:

Project Owning Org
Project Date Range
Fee Factor
Project Organization
Project Manager
Fee Calculation Method
Project Code
Project Lead
Status
Project Title
Total Project Value
Condition
External System Code
Funded Value
Billing Type

What I would ultimately like is the following:

Project Owning Org, Project Date Range, Fee Factor, Project Organization ...etc

so instead of rows they are columns and then I can populate the multiple tables on the page that are formatted with the same h1header with the data as field values of those columns. Below is an example of the html that I'm scraping. This particular tbody.h1 repeats multiple times on the page depending on the results.

<table class="report">
<tbody class="h1"><tr><td colspan="22">
<table class="report" >
<tbody class="h1">
<tr>
<td class="label">Project Owning Organization:</td><td>1.02.10</td>
<td class="label">Project Date Range:</td><td>8/12/2020 - 8/11/2021</td>
<td class="label">Fee Factor:</td><td>&#8212;</td>
</tr>
<tr>
<td class="label">Project Organization:</td><td>1.2.26.1</td>
<td class="label">Project Manager:</td><td>Smith, John</td>
<td class="label">Fee Calculation Method:</td><td>&#8212;</td>
</tr>
<tr>
<td class="label">Project Code:</td><td>PROJECT.001</td>
<td class="label">Project Lead:</td><td>Doe, Jane</td>
<td class="label">Status:</td><td>Backlog</td>
</tr>
<tr>
<td class="label">Project Title:</td><td>Scrapy Project</td>
<td class="label">Total Project Value:</td><td>1,438.00</td>
<td class="label">Condition:</td><td>Green<img src="/images/status_green.png" alt="Green" 
title="Green"></td>
</tr>
<tr>
<td class="label">External System Code:</td><td>&#8212;</td>
<td class="label">Funded Value:</td><td>1,438.00</td>
<td class="label">Billing Type:</td><td>FP</td>
</tr>
</tbody>

There are other tables within this html (tbody.h1 and tbody.detail) where I will then need to append columns to the above.

I've done this in Java using Beautiful Soup by creating and writing to arrays then ultimately exporting those built arrays as csv files. Python Scrapy is FAR easier to get the data than Java was and I'm sure I'm over complicating this but am stuck trying to figure it out so any guidance would be appreciated!

question from:https://stackoverflow.com/questions/65927304/structuring-a-table-using-scrapy-data

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Try this.

from simplified_scrapy import SimplifiedDoc, req, utils

html = '''
    <table class="report">
    <tbody class="h1"><tr><td colspan="22">
    <table class="report" >
    <tbody class="h1">
    <tr>
    <td class="label">Project Owning Organization:</td><td>1.02.10</td>
    <td class="label">Project Date Range:</td><td>8/12/2020 - 8/11/2021</td>
    <td class="label">Fee Factor:</td><td>&#8212;</td>
    </tr>
    <tr>
    <td class="label">Project Organization:</td><td>1.2.26.1</td>
    <td class="label">Project Manager:</td><td>Smith, John</td>
    <td class="label">Fee Calculation Method:</td><td>&#8212;</td>
    </tr>
    <tr>
    <td class="label">Project Code:</td><td>PROJECT.001</td>
    <td class="label">Project Lead:</td><td>Doe, Jane</td>
    <td class="label">Status:</td><td>Backlog</td>
    </tr>
    <tr>
    <td class="label">Project Title:</td><td>Scrapy Project</td>
    <td class="label">Total Project Value:</td><td>1,438.00</td>
    <td class="label">Condition:</td><td>Green<img src="/images/status_green.png" alt="Green" 
    title="Green"></td>
    </tr>
    <tr>
    <td class="label">External System Code:</td><td>&#8212;</td>
    <td class="label">Funded Value:</td><td>1,438.00</td>
    <td class="label">Billing Type:</td><td>FP</td>
    </tr>
    </tbody>
    </table>
    </tbody>
</table>
'''
# html = req.get('your url') 
# html = utils.getFileContent('your file path')

# header = []
rows = []
doc = SimplifiedDoc(html)
tds = doc.selects('table.report>table.report>td')
row = []
for i in range(0,len(tds),2):
    # header.append(tds[i].text.strip(':'))
    row.append(tds[i+1].text)

# rows.append(header)
rows.append(row)

utils.save2csv('test.csv', rows, mode='a')

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...