Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
163 views
in Technique[技术] by (71.8m points)

python - Pandas: slow date conversion

I'm reading a huge CSV with a date field in the format YYYYMMDD and I'm using the following lambda to convert it when reading:

import pandas as pd

df = pd.read_csv(filen,
                 index_col=None,
                 header=None,
                 parse_dates=[0],
                 date_parser=lambda t:pd.to_datetime(str(t),
                                            format='%Y%m%d', coerce=True))

This function is very slow though.

Any suggestion to improve it?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Note: As @ritchie46's answer states, this solution may be redundant since pandas version 0.25 per the new argument cache_dates that defaults to True

Try using this function for parsing dates:

def lookup(date_pd_series, format=None):
    """
    This is an extremely fast approach to datetime parsing.
    For large data, the same dates are often repeated. Rather than
    re-parse these, we store all unique dates, parse them, and
    use a lookup to convert all dates.
    """
    dates = {date:pd.to_datetime(date, format=format) for date in date_pd_series.unique()}
    return date_pd_series.map(dates)

Use it like:

df['date-column'] = lookup(df['date-column'], format='%Y%m%d')

Benchmarks:

$ python date-parse.py
to_datetime: 5799 ms
dateutil:    5162 ms
strptime:    1651 ms
manual:       242 ms
lookup:        32 ms

Source: https://github.com/sanand0/benchmarks/tree/master/date-parse


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...