I'm trying to read in multiple CSV files into a single dataframe. While this works using list comprehension and Panda's concat function, e.g.
import pandas as pd
files = ['file1.csv', 'file2.csv', etc....]
all_df = []
for filename in files:
all_df.append(pd.read_csv(filename))
df = pd.concat(all_df)
I find this is too slow when files is a long list (e.g. 100s items).
I've tried using Dask which accepts list as input and has built-in parallelisation for speed, e.g.
import dask.dataframe as dd
df_dask = dd.read_csv(files)
df = df_dask.compute()
which gives ~2x speed up.
However, for further speed up, I want the ability to only read in every Nth row of the files.
With Pandas, I can do this using a lambda function and the skiprows
argument of read_csv. e.g.
cond = lambda x : x % downsampling != 0
and in the loop, use, pd.read_csv(filename, skiprows=cond)
.
However, this doesn't work for Dask and the skiprows argument doesn't accept lambda functions. I can't pass in integers to skiprows since each file has a different length so exactly which rows to skip differs for each file.
Is there a fast solution? I'm thinking that some sort of operation to downsample that's compatible with Dask could be a solution, but not sure how to implement.
Is this possible please?
question from:
https://stackoverflow.com/questions/65927646/how-to-read-every-nth-row-using-dask-read-csv-for-fast-multiple-reading-in-multi 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…