Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
468 views
in Technique[技术] by (71.8m points)

python - How to read every nth row using Dask read_csv for fast multiple reading in multiple files?

I'm trying to read in multiple CSV files into a single dataframe. While this works using list comprehension and Panda's concat function, e.g.

import pandas as pd
files = ['file1.csv', 'file2.csv', etc....]
all_df = []
for filename in files:
    all_df.append(pd.read_csv(filename))
df = pd.concat(all_df)

I find this is too slow when files is a long list (e.g. 100s items).

I've tried using Dask which accepts list as input and has built-in parallelisation for speed, e.g.

import dask.dataframe as dd
df_dask = dd.read_csv(files)
df = df_dask.compute()

which gives ~2x speed up.

However, for further speed up, I want the ability to only read in every Nth row of the files.

With Pandas, I can do this using a lambda function and the skiprows argument of read_csv. e.g. cond = lambda x : x % downsampling != 0 and in the loop, use, pd.read_csv(filename, skiprows=cond).

However, this doesn't work for Dask and the skiprows argument doesn't accept lambda functions. I can't pass in integers to skiprows since each file has a different length so exactly which rows to skip differs for each file.

Is there a fast solution? I'm thinking that some sort of operation to downsample that's compatible with Dask could be a solution, but not sure how to implement.

Is this possible please?

question from:https://stackoverflow.com/questions/65927646/how-to-read-every-nth-row-using-dask-read-csv-for-fast-multiple-reading-in-multi

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Elaborating on @quizzical_panini's suggestion to use dask.delayed:

import dask
import pandas as pd

@dask.delayed
def custom_pandas_load(file_path):
     # do what you would do if you had one file
    cond = lambda x : x % downsampling != 0
    df = pd.read_csv(file_path, skiprows=cond)
    return df

[computed_dfs] = dask.compute(
    [custom_pandas_load(file_path)
     for file_path in files]
)

df_final = pd.concat(computed_dfs)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...