python - How to read every nth row using Dask read_csv for fast multiple reading in multiple files?

Question

Welcome To Ask or Share your Answers For Others

python - How to read every nth row using Dask read_csv for fast multiple reading in multiple files?

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How to read every nth row using Dask read_csv for fast multiple reading in multiple files?

I'm trying to read in multiple CSV files into a single dataframe. While this works using list comprehension and Panda's concat function, e.g.

import pandas as pd
files = ['file1.csv', 'file2.csv', etc....]
all_df = []
for filename in files:
    all_df.append(pd.read_csv(filename))
df = pd.concat(all_df)

I find this is too slow when files is a long list (e.g. 100s items).

I've tried using Dask which accepts list as input and has built-in parallelisation for speed, e.g.

import dask.dataframe as dd
df_dask = dd.read_csv(files)
df = df_dask.compute()

which gives ~2x speed up.

However, for further speed up, I want the ability to only read in every Nth row of the files.

With Pandas, I can do this using a lambda function and the skiprows argument of read_csv. e.g. cond = lambda x : x % downsampling != 0 and in the loop, use, pd.read_csv(filename, skiprows=cond).

However, this doesn't work for Dask and the skiprows argument doesn't accept lambda functions. I can't pass in integers to skiprows since each file has a different length so exactly which rows to skip differs for each file.

Is there a fast solution? I'm thinking that some sort of operation to downsample that's compatible with Dask could be a solution, but not sure how to implement.

Is this possible please?

question from:https://stackoverflow.com/questions/65927646/how-to-read-every-nth-row-using-dask-read-csv-for-fast-multiple-reading-in-multi

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T19:06:12+0000

Elaborating on @quizzical_panini's suggestion to use dask.delayed:

import dask
import pandas as pd

@dask.delayed
def custom_pandas_load(file_path):
     # do what you would do if you had one file
    cond = lambda x : x % downsampling != 0
    df = pd.read_csv(file_path, skiprows=cond)
    return df

[computed_dfs] = dask.compute(
    [custom_pandas_load(file_path)
     for file_path in files]
)

df_final = pd.concat(computed_dfs)

Categories

python - How to read every nth row using Dask read_csv for fast multiple reading in multiple files?

python - How to read every nth row using Dask read_csv for fast multiple reading in multiple files?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags