python - Get the same hash value for a Pandas DataFrame each time

Question

Welcome To Ask or Share your Answers For Others

python - Get the same hash value for a Pandas DataFrame each time

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Get the same hash value for a Pandas DataFrame each time

My goal is to get unique hash value for a DataFrame. I obtain it out of .csv file. Whole point is to get the same hash each time I call hash() on it.

My idea was that I create the function

def _get_array_hash(arr):
    arr_hashable = arr.values
    arr_hashable.flags.writeable = False
    hash_ = hash(arr_hashable.data)
    return hash_

that is calling underlying numpy array, set it to immutable state and get hash of the buffer.

INLINE UPD.

As of 08.11.2016, this version of the function doesn't work anymore. Instead, you should use

hash(df.values.tobytes())

See comments for the Most efficient property to hash for numpy array.

END OF INLINE UPD.

It works for regular pandas array:

In [12]: data = pd.DataFrame({'A': [0], 'B': [1]})

In [13]: _get_array_hash(data)
Out[13]: -5522125492475424165

In [14]: _get_array_hash(data)
Out[14]: -5522125492475424165

But then I try to apply it to DataFrame obtained from a .csv file:

In [15]: fpath = 'foo/bar.csv'

In [16]: data_from_file = pd.read_csv(fpath)

In [17]: _get_array_hash(data_from_file)
Out[17]: 6997017925422497085

In [18]: _get_array_hash(data_from_file)
Out[18]: -7524466731745902730

Can somebody explain me, how's that possible?

I can create new DataFrame out of it, like

new_data = pd.DataFrame(data=data_from_file.values, 
            columns=data_from_file.columns, 
            index=data_from_file.index)

and it works again

In [25]: _get_array_hash(new_data)
Out[25]: -3546154109803008241

In [26]: _get_array_hash(new_data)
Out[26]: -3546154109803008241

But my goal is to preserve the same hash value for a dataframe across application launches in order to retrieve some value from cache.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:34:47+0000

As of Pandas 0.20.1, you can use the little known (and poorly documented) hash_pandas_object (source code) which was recently made public in pandas.util. It returns one hash value for reach row of the dataframe (and works on series etc. too)

import pandas as pd
import numpy as np

np.random.seed(42)
arr = np.random.choice(['foo', 'bar', 42], size=(3,4))
df = pd.DataFrame(arr)

print(df)
#      0    1   2    3
# 0   42  foo  42   42
# 1  foo  foo  42  bar
# 2   42   42  42   42

from pandas.util import hash_pandas_object
h = hash_pandas_object(df)

print(h)
# 0     5559921529589760079
# 1    16825627446701693880
# 2     7171023939017372657
# dtype: uint64

You can always do hash_pandas_object(df).sum() if you want an overall hash of all rows.

Categories

python - Get the same hash value for a Pandas DataFrame each time

python - Get the same hash value for a Pandas DataFrame each time

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags