The first step is to obtain the information you want in a natural way ("natural": easy to express in pandas, e.g. using pivot_table()
or groupby()
). In order to make the full product of interval x item
(with 0
for missing pairs), you may use:
df.pivot_table(index='interval', columns='item', values='transaction',
aggfunc=sum, fill_value=0)
# out:
item a b
interval
x1 1 2
x2 2 0
The trick however is how to reshape this into the specific format you asked for. This will involve duplicating the 'item'
column or level (something that pandas, understandably, is not particularly fond of). The following is the full operation in one chained sequence:
df2 = (df
.pivot_table(index='interval', columns='item', values='transaction',
aggfunc=sum, fill_value=0)
.stack().to_frame('count')
.reset_index('item').set_index('item', append=True, drop=False)
.unstack('interval').swaplevel(axis=1)
.sort_index(axis=1, ascending=[True, False])
.reset_index(drop=True)
)
# df2:
interval x1 x2
item count item count
0 a 1 a 2
1 b 2 b 0
You can comment out from the end to see the various stages. Let's break this down line by line after the pivot_table
:
Move item to level-1 multiindex and rename the sum as 'count'
... .stack().to_frame('count')
count
interval item
x1 a 1
b 2
x2 a 2
b 0
Duplicate the item
column (in order to unstack later):
... .reset_index('item').set_index('item', append=True, drop=False)
item count
interval item
x1 a a 1
b b 2
x2 a a 2
b b 0
Unstack the interval
, and swap the levels of the new multiindex columns (note: that's why we needed to duplicate item
: otherwise unstack()
would operate on a regular index (not MultiIndex), and as such would convert to a Series):
... .unstack('interval').swaplevel(axis=1)
interval x1 x2 x1 x2
item item count count
item
a a a 1 2
b b b 2 0
Finally, sort the columns MultiIndex and drop the (now useless) index:
... .sort_index(axis=1, ascending=[True, False])
... .reset_index(drop=True)
interval x1 x2
item count item count
0 a 1 a 2
1 b 2 b 0