Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
250 views
in Technique[技术] by (71.8m points)

python - Map a NumPy array of strings to integers

Problem:

Given an array of string data

dataSet = np.array(['kevin', 'greg', 'george', 'kevin'], dtype='U21'), 

I would like a function that returns the indexed dataset

indexed_dataSet = np.array([0, 1, 2, 0], dtype='int')

and a lookup table

lookupTable = np.array(['kevin', 'greg', 'george'], dtype='U21')

such that

(lookupTable[indexed_dataSet] == dataSet).all()

is true. Note that the indexed_dataSet and lookupTable can both be permuted such that the above holds and that is fine (i.e. it is not necessary that the order of lookupTable is equivalent to the order of first appearance in dataSet).

Slow Solution:

I currently have the following slow solution

def indexDataSet(dataSet):
    """Returns the indexed dataSet and a lookup table
       Input:
           dataSet         : A length n numpy array to be indexed
       Output:
           indexed_dataSet : A length n numpy array containing values in {0, len(set(dataSet))-1}
           lookupTable     : A lookup table such that lookupTable[indexed_Dataset] = dataSet"""
    labels = set(dataSet)
    lookupTable = np.empty(len(labels), dtype='U21')
    indexed_dataSet = np.zeros(dataSet.size, dtype='int')
    count = -1
    for label in labels:
        count += 1
        indexed_dataSet[np.where(dataSet == label)] = count
        lookupTable[count] = label

    return indexed_dataSet, lookupTable

Is there a quicker way to do this? I feel like I am not using numpy to its full potential here.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can use np.unique with the return_inverse argument:

>>> lookupTable, indexed_dataSet = np.unique(dataSet, return_inverse=True)
>>> lookupTable
array(['george', 'greg', 'kevin'], 
      dtype='<U21')
>>> indexed_dataSet
array([2, 1, 0, 2])

If you like, you can reconstruct your original array from these two arrays:

>>> lookupTable[indexed_dataSet]
array(['kevin', 'greg', 'george', 'kevin'], 
      dtype='<U21')

If you use pandas, lookupTable, indexed_dataSet = pd.factorize(dataSet) will achieve the same thing (and potentially be more efficient for large arrays).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...