Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
378 views
in Technique[技术] by (71.8m points)

python - Flatten a column with value of type list while duplicating the other column's value accordingly in Pandas

Dear power Pandas experts:

I'm trying to implement a function to flatten a column of a dataframe which has element of type list, I want for each row of the dataframe where the column has element of type list, all columns but the designated column to be flattened will be duplicated, while the designated column will have one of the value in the list.

The following illustrate my requirements:

input = DataFrame({'A': [1, 2], 'B': [['a', 'b'], 'c']})
     A   B
0    1   [a, b]
1    2   c

expected = DataFrame({'A': [1, 1, 2], 'B': ['a', 'b', 'c']}, index=[0, 0, 1])

     A   B
0    1   a
0    1   b
1    2   c

I feel that there might be an elegant solution/concept for it, but I'm struggling.

Here is my attempt, which does not work yet.

def flattenColumn(df, column):
    '''column is a string of the column's name.
    for each value of the column's element (which might be a list), duplicate the rest of columns at the correspdonding row with the (each) value.
    '''
    def duplicate_if_needed(row):
        return concat([concat([row.drop(column, axis = 1), DataFrame({column: each})], axis = 1) for each in row[column][0]])
    return df.groupby(df.index).transform(duplicate_if_needed)

In recognition of alko's help, here is my trivial generalization of the solution to deal with more than 2 columns in a dataframe:

def flattenColumn(input, column):
    '''
    column is a string of the column's name.
    for each value of the column's element (which might be a list),
    duplicate the rest of columns at the corresponding row with the (each) value.
    '''
    column_flat = pandas.DataFrame(
        [
            [i, c_flattened]
            for i, y in input[column].apply(list).iteritems()
            for c_flattened in y
        ],
        columns=['I', column]
    )
    column_flat = column_flat.set_index('I')
    return (
        input.drop(column, 1)
             .merge(column_flat, left_index=True, right_index=True)
    )

The only limitation at the moment is that the order of columns changed, the column flatten would be at the right most, not in its original position. It should be feasible to fix.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I guess easies way to flatten list of lists would be a pure python code, as this object type is not well suited for pandas or numpy. So you can do it with for example

>>> b_flat = pd.DataFrame([[i, x] 
...               for i, y in input['B'].apply(list).iteritems() 
...                    for x in y], columns=list('IB'))
>>> b_flat = b_flat.set_index('I')

Having B column flattened, you can merge it back:

>>> input[['A']].merge(b_flat, left_index=True, right_index=True)
   A  B
0  1  a
0  1  b
1  2  c

[3 rows x 2 columns]

If you want the index to be recreated, as in your expected result, you can add .reset_index(drop=True) to last command.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...